20
Processing stages 27/02/2009 PROCESSING STAGES: Aggregator Stage: Aggregator stage is an Active stage. It accepts 1 input and 1 output. It is aggregating data based on a key column. e !roup by on a column and do various cal on t"at group #"is stage takes data from input and does aggregations on eac" group. e can do ca like $um% Average% &in% &a'% Percentage and also (ount t"e values based on t"e key input. )or better Performance% #"e *ey column s"ould be sorted before aggregation and "as partitioned. Properties: -- Group key: It is a key by which we Group by the data. It will arrange the similar records at same place which is easier when we go for any Calculation. ++ Aggregate type: ,ere -e "ave (alculate/ (ount/ ecount options. Calculate: -e can do calculations like $um% Average% Percentage% &inimum% &a'imum% & and so on. Recalculation: #"is option is used -"en -e "ave to do more t"an one calculation. )or scenario -e need to connect one Aggregator stage follo-ed by anot"er Aggregator. output data set from first stage -ill be t"e input for t"e second stage -"ere re+c COPY stage: (opy stage is an Active stage. It can "ave a single input link and any numbe links. It is used to copy single input data set to multiple output data sets. #"us t"e input data is copied to every output data set. #"is stage is used to take backup of a data set to some other location on the disk or when we want multiple copies of data for different processes in other job. Here we cannot alter data in any order. Properties: -- Force = True / False When we have a single input and single output in the job set the !orce as "True’ and when we have single input and multiple outputs then set as "False’. #y default it is set as "!alse$. A#A$#A! P

Parallel Stages

Embed Size (px)

DESCRIPTION

parallel

Citation preview

Processing stages:

Processing stages

27/02/2009

PROCESSING STAGES: Aggregator Stage: Aggregator stage is an Active stage. It accepts 1 input and 1 output. It is used for aggregating data based on a key column. We Group by on a column and do various calculations on that group

This stage takes data from input and does aggregations on each group. We can do calculations like Sum, Average, Min, Max, Percentage and also Count the values based on the keys from input.

For better Performance, The Key column should be sorted before aggregation and hash partitioned.Properties:

-- Group key: It is a key by which we Group by the data. It will arrange the similar records at same place which is easier when we go for any Calculation.-- Aggregate type: Here we have Calculate/ Count/ Recount options.Calculate: we can do calculations like Sum, Average, Percentage, Minimum, Maximum, Mean and so on.Recalculation: This option is used when we have to do more than one calculation. For this scenario we need to connect one Aggregator stage followed by another Aggregator. Now the output data set from first stage will be the input for the second stage where re-calculation is done.COPY stage:Copy stage is an Active stage. It can have a single input link and any number of output links. It is used to copy single input data set to multiple output data sets. Thus, Each record from the input data is copied to every output data set. This stage is used to take backup of a data set to some other location on the disk or when we want multiple copies of data for different processes in other job. Here we cannot alter data in any order.Properties:-- Force = True / False

When we have a single input and single output in the job, set the Force as True and when we have single input and multiple outputs, then set as False. By default it is set as False. Funnel:

Funnel stage is an Active stage. It is used to copy multiple input data sets to a single output data set. It accepts any number of input links and only one output link. For this stage all input data sets must have the same metadata.

Properties:Funnel type: Continuous Funnel / Sort Funnel / Sequential funnel

Continuous Funnel: It picks the input records randomly and there is no particular order by which it picks data. It picks one record from each input link in their order. If data is not available in one input data set, the stage skips to the next link than waiting for the data.

Sort Funnel: It combines the input records in the order defined by the key columns and the order of the output records is determined by these sorting keys. All Input data sets for sort funnel must be hash partitioned before theyre sorted. It ensures that all records with the same key column are located in the same partition and will be processed to the same node.Sequential funnel: It copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on.

Remove Duplicates stage:

It is an Active stage. It accepts one input and a single output. It takes sorted data set as input, removes duplicate records and populates the resultant records to an output data set.

The input data to this stage should be sorted so that all records with identical key values are present at the same place. Properties:

Key: The key based on which the record is considered as a Duplicate. We can simply select the columns from the scroll down window.

Duplicates to Retain: First / Last

This option helps to decide which duplicate record to be retained by the stage. By default it is set to first.

Filter stage:

Filter stage is an Active stage. It can have one input link and any number of output links and a single Reject link.

This stage transforms the input data with respect to the given conditions and filter out the remaining record which doesnt satisfy the condition. The filtered records can also be sent to the reject link.Properties:Where clause: Here we can provide the condition as Store_Loc=New York. Now only the records which satisfy this condition are populated to the output. We can give multiple conditions for different columns from the input link.Output rejects: True / FalseIf we set the option as True, the record which doesnt match the condition is sent to reject link and if option set as False, the unmatched records are ignored.Output records only once: True / FalseWhen we give multiple conditions to the records from the input, Some times a single record may satisfy more than one condition. Then we can send the same record to multiple outputs by selecting the above condition as False. It allows to propagate the same valid record to multiple outputs accordingly. Set the option as True doesnt allow a record to multiple outputs being satisfied by multiple conditions.Sort stage:Sort stage is an Active stage. This stage can have one input link which carries the data to be sorted, and a single output link carrying the sorted data. We specify sorting keys as the criteria on which to perform the sort. We can specify more than one key for sorting. First key is called as primary key and other as Secondary key column. If multiple records have the same value for the primary key column, then this stage uses the secondary columns to sort the records. The stage uses temporary disk space when performing the sort operation.Properties:Sort key: A key is a column on which data to be sorted. For example, we sort the data on City_Code=New York, the data is sorted on this key and populated to the output.Allow Duplicates: True / False

If multiple records have identical sorting key values, only one record is retained if it is set to False. Duplicate records also populated to the output if set to True. It is set to True by default.If Stable Sort is True, then the first record is retained. This property is not available for the UNIX sort type.Sort Utility: Datastage / UNIX--- DataStage: It is by default. This uses the built-in DataStage sorter; you do not require any additional software to use this option. Stable Sort: It is applicable when we select the Sort Utility as DataStage. True guarantees that this sort operation will not rearrange records that are already in a properly sorted data set. If set to False, no prior ordering of records is guaranteed to be preserved by the sorting operation.--- UNIX. This specifies that the UNIX sort command is used to perform the sort.Output Statistics: True / FalseIf set as True, it causes the sort operation to output statistics. This property is not available for the UNIX sort type. It is set as False by default.Surrogate key generator stage:Surrogate Key stage is an Active stage. It can have one input link and a single output link. It generates key columns for an existing data set. This stage generates sequentially incrementing unique integers from a given starting point. The existing columns of the data set are passed straight through the stage. Based on the nodes we are generating key values, and the input data partitions should be perfectly balanced across the nodes. This can be achieved using round robin partitioning method. If the stage is operating in parallel, each node will increment the key by the number of partitions being written to.Properties:

Surrogate key name: The new column name given to the surrogate key field.Output type: 16 bit / 32 bit / 64 bitStart Value: Here we can provide the number from where the new key should be generated. Change capture stage: Change Capture Stage is a processing stage. It accepts two inputs and a single output link. The stage compares two data sets and makes a record of the differences. Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The table definition of change data set is transferred from the after data sets table definition with additional column as change code with values encoding the four actions: insert, delete, copy, and edit.

The comparison is based on set of key columns, rows from the two data sets. The stage assumes that the incoming data is hash-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

We can use both Sequential as well as parallel modes of execution for change capture stage. We can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set.It generates the codes for Copy, Edit, Delete and Inserted records. The Default codes are: Copy 0, Insert 1, Delete 2, Update - 3 Properties: Change key: Specifies the name of a difference key input column. We can specify multiple difference key input columns here. This is a key on which we identify the record as new/update/deleted record. --- Sort order: Ascending / Descending We can sort the key values in ascending or descending order for better performance. Change Value: value of an input column which is also considered as a key to identify the changes in the input records. We can select a column from drop down list.Change Mode: Explicit Keys & Values / All keys Explicit values / Explicit Keys, All ValuesThis mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded.

Drop output for Copy: True / False

Drops the input Copied records for output data set if set as True and populates the records to output data set if set as False.Drop output for Insert: True / False

Drops the input Inserted records for output data set if set as True and populates the records to output data set if set as False.Drop output for Delete: True / False

Drops the input Deleted records for output data set if set as True and populates the records to output data set if set as False.Drop output for Update: True / False

Drops the input Updated records for output data set if set as True and populates the records to output data set if set as False.

Change Apply stage:Change Apply Stage is a processing stage. It accepts two input link and a single output link. It takes the input from the Change Capture stage and before data set to apply the changes to a before data set to compute an after data set. Change Apply Stage follows the Change Capture Stage.

Properties:

Change key: Specifies the name of a difference key input column. We can specify multiple key input columns here. This is a key on which we identify the record as new/update/deleted record. --- Sort order: Ascending / Descending We can sort the key values in ascending or descending order for better performance. Change Value: value of an input column which is also considered as a key to identify the changes in the input records. We can select the column from drop down list.Change Mode: Explicit Keys & Values / All keys Explicit values / Explicit Keys, All Values

This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded.Check value columns on delete:

Log statistics: True / FalseModify Stage:Modify stage is a processing stage. It can have one input link and a single output link. Modify stage alters the record schema of its input data set. The modified data set is then output. We can Change data types, Rename the columns, Drop columns. Keep the columns and Handle nulls.

For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32 and so on. We have to provide the specification for the destination column in the properties window.

Syntax for conversion:

new_columnname:new_type = conversion_function (old_columnname)

Column_Name=Handle_Null('Column_Name',Value)

Properties:

Options:

Specification: Here we need to specify the conversion for the output column. Ex: HIREDATE = date_from_timestamp (HIREDATE)

Switch Stage:Switch stage is a processing stage. It can have one input link, up to 128 output links and a single reject link. The switch stage takes a single data set as input and assigns each input row to an output data set based on the value of a selector field (column_Name).This stage performs an operation similar to a C switch statement. Rows that satisfy none of the cases are populated to the rejects link.Properties:

Selector: Specifies the input column that the switch applies to. Unlike Filter stage, we can specify multiple conditions on a single column here.

Selector Mode: Auto / Hash / User-defined Mapping

Specifies how you are going to define the case statements for the switch. We can choose one from the above options.

Auto can be used where there are as many distinct selector values (columns) as there are output links. Here we cannot say which record is populated to which output link. The records are randomly sent to the output data sets. We can have a Reject link in this mode.

Hash: The incoming rows are hashed on the selector column modulo the number of output links and assigned to an output link accordingly. We cannot have a Reject link in this mode.

User-defined Mapping means you must provide explicit mappings from case values to outputs. If you use this mode you specify the switch expression under the User-defined Mapping category. This is the default option in the stage.

When we select the User-defined Mapping option, it will ask for Case value which is nothing but the number of outputs to populate the resultant data.

Case: This property appears if you have chosen a Selector Mode of User-defined Mapping. You must specify a selector value for each value of the input column that you want to direct to an output column. Repeat the Case property to specify multiple values. You can omit the output link label if the value is intended for the same output link.

Syntax for the expression: Selector_Value = Output_Link_Label_NumberOptions Category:

If not found: Fail / Drop / Output

Specifies the action to take if a row fails to match any of the case statements. It is not visible if you choose a Selector Mode of Hash. We can choose between the following optionsFail: Causes the job to fail.

Drop: Drops the record.

Output: Record will be sent to the Reject link.

Pivot Stage:Pivot stage is an Active stage. It accepts one input and a single output. It converts multiple columns in to rows. We have to see that the Data types for both the input columns should be the same and the output is created with the same data type.

Scenario: Let us assume that Mark-1 and Mark-2 are two columns in the input data set. Now we need to convert these two columns in to one column with Name "Marks". So, we have to provide the derivation in the derivation field of the output column Marks during the process. Thus a new column "Marks" is derived from the input columns Mark-1 and Mark-2.We need to provide the following derivation in the Marks column derivation:

(Marks = Mark-1, Mark-2)

JOIN Stage: Join stage is a processing stage. It has any number of input links and a single output link. It doesnt allow the Reject link. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set. The input data sets are called as the right set and the left set and intermediate sets. You can specify which is which.

Join stage can perform four join operations: Inner Join , Left outer Join, Right outer Join, Full outer Join. The default is inner Join.The data sets input to the Join stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. Choosing the auto partitioning method will ensure that partitioning and sorting is done. If sorting and partitioning are carried out on separate stage before the Join stage, DataStage in auto mode will detect this and wont repartition the data again.

Properties:

Join Key: This is the Column name on which the input tables are joined together and matched data is sent to the output data set. We can select multiple keys for joining tables.

Join type: Inner / Left Outer / Right Outer / Full Outer

Inner Join: Transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped

Left Outer Join: Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.

Right Outer Join: Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.

Full Outer Join: Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. Full outer joins do not support more than two input links.Merge Stage:Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.

Merge Stage combines a sorted master data set with one or more update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records.Unlike Join stage and Lookup stage, the Merge stage allows you to specify several reject links. You must have the same number of reject links as you have update links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node.

Properties:

Merger Key: It is the key column that exists in both the master and update records. We can select the common key from the drop down list. We can give multiple key columns to merge tables.

Sort order: Ascending / Descending

Options:

Unmatched master mode = Drop / KeepIf selected as Keep, it specifies that unmatched rows from the master link are output to the merged data set. If Set to Drop, it specifies the unmatched records to drop It Set to Keep by default.Warn on reject updates = True / False

Warn on Unmatched Masters = True / False

Warn On Unmatched Masters: This will warn you when bad records from the master link are not matched. Set it to False to receive no warnings. It is set to True by default.

Warn On Reject Updates: It will warn you when bad records from any update links are rejected. Set it to False to receive no warnings. It is set to True by default.

Lookup stage:

Lookup stage is a processing stage. It can have a reference link, a single input link, a single output link, and a single reject link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links.

Lookup Stage performs lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups directly in a DB2 or Oracle database or in a lookup table contained in a Lookup File Set stage. Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the keys values, the row is rejected.

There are two types of lookup available with this stage. Normal lookup and sparse lookup. Normal lookup is used when the reference data is small in size compared to the master data set and when reference data is in huge then we need to keep the lookup type as sparse. Instead of this, its better to go for join stage here.

We can set the lookup operations based on the following properties available in the stage tab.

Condition Not Met and Lookup Failure:Choose the options from the Condition Not Met drop-down list. Possible actions are,

Continue: It continues processing any further lookups before sending the row to the output link.

Drop: Drops the row and continues with the next lookup.

Fail: Causes the job to issue a fatal error in the log file and stop.

Reject: Sends the row to the reject link.

To specify the action taken if a lookup on a link fails, choose an action from the Lookup Failure drop-down list. Possible actions are:

Continue: Continues processing any further lookups before sending the row to the output link.

Drop: Drops the row and continues with the next lookup.

Fail. Causes the job to issue a fatal error and stop.

Reject: Sends the row to the reject link.

Difference between Lookup, Merge & Join Stage:These "three Stages" combine two or more input links according to values of user-designated "key" column(s). They differ mainly in:

1. Memory usage

2. Treatment of rows with unmatched key values 3. Input requirements (sorted, de-duplicated)

The main difference between joiner and lookup is in the way they handle the data and the reject links. Lookup provides a reject link. Join does not allow reject links.

Lookup is used if the data being looked up can fit in the available temporary memory. If the volume of data is quite huge, then it is safe to go for Join stage and If the volume of data is huge to be fit into memory you go for join and avoid lookup as paging can occur when lookup is used.

Join requires the input dataset to be key partitioned and sorted. Lookup does not have this requirement.

Merge allow us to capture failed lookups from each reference input separately. It also requires identically sorted and partitioned inputs and, if more than one reference input, de-duplicated reference inputs.

For merge stage, duplicates should be removed from master dataset before processing and also from update dataset if there are more than one updated dataset. The above-mentioned step is not required for join and lookup stages

Dataset stage:Dataset stage is a file stage. It allows you to read data from or write data to a data set. The stage can have one input link or a single output link. It wont allow both input and output links at the same time. The data in the Dataset is stored in internal format. Using datasets wisely can be key to good performance in a set of linked jobs.

A Dataset consists of two parts:

1. Descriptor file: Contains metadata and data location.

2. Data file: Contains the data.Datasets are operating system files, each referred to by a control file, which has the suffix (.ds). Parallel jobs use datasets to manage data within a job. It allows you to store data in persistent form, which can then be used by other jobs. We can also manage data sets independently of a job using the Data Set Management utility, available from the Datastage Director and Manager.

Dataset can be saved across nodes using partitioning method selected, so it is always faster when we used as a source or target. It can be configured to execute in parallel or sequential mode. If the Data Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. By default the stage partitions in Auto mode.

Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a single sequential stream of data. Properties:Source category:

File: The name of the control file for the data set. We can browse for the file or enter a job parameter. By convention this file has the suffix (.ds)

Target category: (When used as a Target stage)

File: The name of the control file for the data set. You can enter the file name manually or browse for the file or enter via job parameter if the file already exists. By convention, the file has the suffix .ds.Update Policy: Specifies what action will be taken if the data set you are writing to already exists.

Append: Append any new data to the existing data. Create (Error if exists): DataStage reports an error if the data set already exists. Overwrite: Overwrites any existing data with new data. The default is Overwrite.Sequential File Stage:

Sequential File stage is a file stage. The stage can have a single input link or

a single output link, and a single rejects link.

It allows you to read data from or write data to one or more flat files. We can access text files and .csv files from this stage. The stage executes in parallel mode if reading multiple files but executes sequentially if it is reading only one file. By default a complete file will be read by a single node.

Sequential file cannot handle nulls by itself. We have to handle nulls at the time of source extract. We can provide the format of the flat file to which we are writing with the help of Format Tab available in the file stage.By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages and how many nodes are specified in the Configuration file.If the Sequential File stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method. If the Sequential File stage is set to execute in parallel (i.e., is writing to multiple files), then we can set a partitioning method by selecting from the Partition type drop-down list. This will override the current partitioning.

Properties:Source category:

File: The name of the source flat file. We can browse for the file or enter a job parameter. Read method: specific files / File pattern

Specific files: specify the pathname of the file being read from (repeat this for

reading multiple files). File pattern: specify the pattern of the file to read from.Options:

First line is column name: True / False

The stage considers the first line of the file as Column Name if selected True and ignores the first line if set to False.Keep file partitions: True / False

It keeps the partitioning available in the source and doesnt re-partition if set as TrueMissing file Mode: depends / Error / Ok

Depends means the default is error unless the file hasError to stop job if one of the files mentioned does not existOk to skip the fileReject mode: Continue / Fail / OutputContinue means the stage discards any rejected recordsFail to stop if any record is rejectedOutput to send the rejected records to a reject link

Report Progress: Yes / No

Enable or disable logging of a progress report at intervalsColumn Generator stage:

Column Generator stage is a development / debugging stage. It can have one input link and a single output link. The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new data set is then considered as an output.

The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

Options:

Column method: Explicit / Schema file

Explicit means you should specify the meta data for the columns you want to generate on the Output Page Columns tab. If you use the Explicit method, you also need to specify which of the output link columns you are generating. You can repeat this property to specify multiple columns. If you use the Schema File method, you should specify the schema file.Schema file is a plain text file in which Meta data for a stage is specified.

A Schema consists of a record definition. The following is an example for record schema:

record(

name:string[];

address:nullable string[];

date:date[];

)

Row Generator stage:

Row Generator stage is a Development/Debugging stage. It has no input links, and a single output link. It is used to generate the Mock data. We need to tell the stage how many columns to be generated and what data type each column has. We do this in the Output page Columns tab by specifying the column name and column definitions.

By default the Row Generator stage runs sequentially, generating data in a single partition. You can, however, configure it to run in parallel, and you can use the partition number when you are generating data to. For example, increment a value by the number of partitions. You will also get the Number of Records you specify in each partition.

Properties:

Options:

Number of Records: The number of records you want your generated data set to contain. The default number is 10.Schema File: (optional) by default the stage will take the Meta data defined on the input link to base the mock data set on. But we can specify the column definitions in a schema file, if required. We can browse for the schema file or specify a job parameter.Peek stage:

Peek stage is a development / debug stage. It has one input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. This can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

Properties:

Rows Category:All Records (After Skip): True / False

Prints all records from each partition if set to True. It is set to False by default.

Number of Records (Per Partition): Specifies the number of records to print from each partition. The number of records is 10 by default.

Columns Category:Peek All Input Columns = True / False

Set to False to specify that only selected columns will be printed and specify these columns using the Input Column to Peek property. It is set to True by default and Prints all the input columns to the outputInput Column to Peek: If you have set Peek All Input Columns to False, use this property to specify a column to be printed. Repeat the property to specify multiple columns.

Partitions Category:

All Partitions: True / False

Set to False to specify that only certain partitions should have columns printed, and specify which partitions using the Partition Number property. It is set to True by default.

Partition Number: If you have set All Partitions to False, use this property to specify which partition you want to print columns from. Repeat the property to specify multiple columns.

Options Category:

Peek Records Output Mode: Job log / Output

Specifies whether the output should go to an output column (the Peek Records column) or to the job log.

Show Column Names: True / False

If set as True, causes the stage to print the column name, followed by a colon, followed by the column value. If set to False, the stage prints only the column value, followed by a space. It is set to True by default.

Transformer stage:

Transformer stage is a processing stage. Transformer stage can have one input link and any number of output links. It can also have a reject link that takes any rows which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Transformer stages do not extract data or write data to a target database. It is used to handle extracted data, perform any conversions required, and pass data to another Transformer stage or a stage that writes data to the target.In Transformer Editor window, we can Create new columns, Delete columns from a link, Move columns within a link, Edit column meta data, Define output column derivations, Define link constraints, Specify the order in which the links to be processed and Define local stage variables.

We can simply drag and drop the metadata from input link to the output link. We can specify derivations for columns in the output pane where we can provide System variables, functions, job_parameters, ds_macros and ds_routines.

Stage constraints: A constraint is an expression that specifies criteria that data must meet before it is passed to the output link. If the constraint expression evaluates to TRUE for an input row, the data row is output on that link. Rows that are not output on any of the links can be output on the otherwise link. Constraint expressions on different links are independent.

Stage variables: This provides a method of defining expressions which can be reused in the output column derivations. These values are not passed to the output.The stage variables in the transformer have required order, if it has the dependencies on other stage variables. For example, if you have three stage variables called A, B and C, in these stage variables if B depends up on A then you need to maintain A, B and C in order. Otherwise you will get very strange and wrong results.

If the result of a stage variable is Single character then keep the Length of the variable as Varchar 1. Other wise, you will not get proper resultThe following is the order of execution at the time of processing records:

Stage variables ( Stage Constraints ( Column Derivations

Oracle Enterprise stage:

Oracle Enterprise Stage is a database stage. It allows you to read data

from and write data to an Oracle database. The Oracle Enterprise Stage can have one input link and a single output link, or a single reject link or output reference link.

This stage helps us in creating tables, loading data into an existing table, updating the table, deleting the records from the table and truncates fields in the table. Once you have given the source table name in the properties window, you need to load the metadata of the table so that data can be retrieved from the table.

For better performance, always go for user-defined queries than auto-generated queries when you have joins in the table. Create Indexes on tables. Some times partitioning in the database level also helps in performance. Properties:

Table: Specifies the name of the table to write to. We can specify a job parameter if required. It appears only when Write Method = Load.Write Method: Load / Delete Rows / Upsert

Load: To load the data to the target table.

Delete Rows: Allows you to specify how the delete statement is to be derived from SQL property by using Auto-generated Delete or User-defined Delete actions.Upsert: It allows you to provide the insert and update SQL statements for inserting records. We can restrict oracle stage to take actions like update only or update and insert. These tasks can be achieved by User-defined update or Auto-generated update actions available in oracle stage.

Write Mode: Append / Create / Replace / TruncateIt appears only when Write Method = Load.

Append: New records are appended to an existing table. This is the default option.

Create: It creates a new table. If the Oracle table already exists, an error occurs and the job terminates. You must specify this mode if the Oracle table does not exist.

Replace: The existing table is first dropped and an entirely new table is created in its place. Oracle uses the default partitioning method for the new table.

Truncate: The existing table attributes (including schema) and the Oracle partitioning keys are retained, but any existing records are discarded. New records are then appended to the table.

Connection Category: DB Options: Specify a user name and password for connecting the database.

DB Options Mode: Auto generate / User defined

Here we can provide the user name and password for connecting to the remote server. If you select User-defined, you have to edit the database options yourself.

Options Category:

Disable Constraints: Set True to disable all enabled constraints on a table when loading, then attempt to re enable them at end of the load.

Silently Drop Columns Not in Table. This only appears for the Load Write Method. It is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing Oracle table. Otherwise the stage reports an error and terminates the job.

Truncate Column Names. This only appears for the Load Write Method. Set this property to True to truncate column names up to 30 characters.

DATASTAGE PX

_1274649935.unknown