37
Accenture Accenture Ab Initio Training Ab Initio Training 1 Introduction to Ab Initio Prepared By : Ashok Chanda

04 Join Component

Embed Size (px)

DESCRIPTION

Join components

Citation preview

Page 1: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 11

Introduction to Ab Initio

Prepared By : Ashok Chanda

Page 2: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 22

Ab initio Session 4Ab initio Session 4

Join Join

Page 3: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 33

Join Basic DefinitionJoin Basic Definition

Join performs inner, outer, and semi-Join performs inner, outer, and semi-joins with multiple flows of data joins with multiple flows of data records. records.

Page 4: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 44

JOINJOIN

Join reads data from two or more input Join reads data from two or more input ports, combines records with matching ports, combines records with matching keys according to the transform you keys according to the transform you specify, and sends the transformed specify, and sends the transformed records to the output port. Additional records to the output port. Additional ports allow you to collect rejected and ports allow you to collect rejected and unused records. There can be as many unused records. There can be as many as 20 input ports. as 20 input ports.

Page 5: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 55

Alphabetical List of Join Alphabetical List of Join ParametersParameters

count maintain-order count maintain-order dedupn max-core dedupn max-core driving max-memory driving max-memory join-type override- keyn join-type override- keyn key ramp key ramp limit record- limit record-

requiredn requiredn logging reject-threshold logging reject-threshold log_input selectn log_input selectn log_intermediate sorted-input log_intermediate sorted-input log_output transform log_output transform log_reject log_reject

Page 6: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 66

Parameter Descriptions for Parameter Descriptions for JoinJoin

count : count : (integer, required) (integer, required) An integer An integer nn from 2 to 20 specifying the total number of from 2 to 20 specifying the total number of

inputs (inputs (inin ports) to Join. This in turn determines the number ports) to Join. This in turn determines the number of the following ports and parameters:of the following ports and parameters: unusedunused ports ports rejectreject ports ports errorerror ports ports record-requiredrecord-required parameters parameters dedup dedup parameters parameters select select parameters parameters override-keyoverride-key parameters Default is parameters Default is 22..

Each Each inin port (always 2 or more) has a number port (always 2 or more) has a number nn appended. appended. There can be as many as 20 There can be as many as 20 inin ports altogether. Each ports altogether. Each outoutnn, , unusedunusednn, , rejectrejectnn, and , and errorerrornn port corresponds to an port corresponds to an ininnn port.port.

Page 7: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 77

Parameter Descriptions for Parameter Descriptions for JoinJoin

sorted-input : sorted-input : (choice, required) (choice, required) When set to When set to In memory: Input need not In memory: Input need not

be sortedbe sorted, Join accepts unsorted input, , Join accepts unsorted input, and permits the use of the and permits the use of the maintain-maintain-order order parameter.parameter.

When set to When set to Inputs must be sortedInputs must be sorted, Join , Join requires sorted input, and the requires sorted input, and the maintain-maintain-order order parameter is not available.parameter is not available.

Default is Default is Inputs must be sortedInputs must be sorted..

Page 8: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 88

Parameter Descriptions for Parameter Descriptions for JoinJoin

key : key : (key specifier, required) (key specifier, required) Name(s) of the field(s) in the input Name(s) of the field(s) in the input

records that must have matching records that must have matching values for Join to call the transform values for Join to call the transform function.function.

Page 9: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 99

Parameter Descriptions for Parameter Descriptions for JoinJoin

transform : transform : (filename or string, required) (filename or string, required) Either the name of the file containing the Either the name of the file containing the

transform function, or a transform string. In the transform function, or a transform string. In the file specified in the file specified in the transformtransform parameter or in parameter or in the transform string, create a transform function the transform string, create a transform function that has the following characteristics:that has the following characteristics: The transform function takes the number of input The transform function takes the number of input

arguments specified in the arguments specified in the countcount parameter. parameter. The first argument is a data record with the record The first argument is a data record with the record

format of the format of the in0in0 port. The second argument is a data port. The second argument is a data record with the record format of the record with the record format of the in1in1 port, and so on. port, and so on.

The transform function has an explicit or implicit rule The transform function has an explicit or implicit rule that assigns a value to every field in the output record. that assigns a value to every field in the output record.

Page 10: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1010

Parameter Descriptions for Parameter Descriptions for JoinJoin

join-type : join-type : (choice, required) (choice, required) Choose from the following:Choose from the following:

Inner joinInner join — sets the — sets the record-requirednrecord-requiredn parameters parameters for all ports to for all ports to TrueTrue. . Inner joinInner join is the default. The GDE is the default. The GDE does not display the does not display the record-requiredrecord-requirednn parameters parameters because they all have the same value. because they all have the same value.

Outer joinOuter join — sets the — sets the record-requirednrecord-requiredn parameters parameters for all ports to for all ports to FalseFalse. The GDE does not display the . The GDE does not display the record-requiredrecord-requirednn parameters because they all have parameters because they all have the same value. the same value.

ExplicitExplicit — allows you to set the — allows you to set the record-requirednrecord-requiredn parameter for each port individually. If you set the parameter for each port individually. If you set the dedupndedupn parameter to parameter to TrueTrue on the driving input, set the on the driving input, set the join-typejoin-type parameter to parameter to Inner joinInner join. (The driving input is . (The driving input is the largest input, as specified by the driving parameter.)the largest input, as specified by the driving parameter.)

Page 11: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1111

Parameter Descriptions for Parameter Descriptions for JoinJoin

dedupdedupnn : : (boolean, required) (boolean, required) Set the Set the dedupdedupnn parameter to parameter to TrueTrue to remove duplicates to remove duplicates

from the corresponding from the corresponding ininnn port before joining. This allows port before joining. This allows you to choose only one record from a group with matching you to choose only one record from a group with matching key values as the argument to the transform function. key values as the argument to the transform function. Default is Default is FalseFalse, which does not remove duplicates., which does not remove duplicates.

If you remove duplicates on this input port before joining it If you remove duplicates on this input port before joining it to the driving input, set the to the driving input, set the record-requiredrecord-requirednn parameter parameter to to TrueTrue on all other ports. (The driving input is the largest on all other ports. (The driving input is the largest input, as specified by the driving parameter.)input, as specified by the driving parameter.)

There is one There is one dedupndedupn parameter associated with each parameter associated with each inninn port. port.

Page 12: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1212

Parameter Descriptions for Parameter Descriptions for JoinJoin

selectselectnn : : (expression, optional) (expression, optional) Filter for records before join function. Filter for records before join function.

One per One per inninn port; port; nn represents the represents the number of an number of an inin port. port.

Page 13: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1313

Parameter Descriptions for Parameter Descriptions for JoinJoin

override-keyoverride-keynn : : (key specifier, optional) (key specifier, optional) Alternative name(s) for the key field(s) for a particular Alternative name(s) for the key field(s) for a particular inninn

port. port. Supported for Co>Operating System Version:Supported for Co>Operating System Version:

2.1 and higher with the 2.1 and higher with the sorted-inputsorted-input parameter set to parameter set to In In memory: Input need not be sortedmemory: Input need not be sorted

2.2.2 and higher with the 2.2.2 and higher with the sorted-inputsorted-input parameter set to parameter set to Inputs must be sortedInputs must be sorted There is one There is one override-keyoverride-keynn parameter per parameter per inninn port. The port. The nn corresponds to the number of corresponds to the number of an an inin port. port.

To use key field(s) other than the key field(s) specified in To use key field(s) other than the key field(s) specified in the the keykey parameter for a particular parameter for a particular inninn port, specify the key port, specify the key field(s) you want to use in the corresponding field(s) you want to use in the corresponding override-override-keykeynn parameter. Default is parameter. Default is 0.00.0..

Page 14: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1414

Parameter Descriptions for Parameter Descriptions for JoinJoin

max-memorymax-memory : (integer, required) : (integer, required) Maximum memory usage in bytes before Maximum memory usage in bytes before

Join writes temporary files to disk. Only Join writes temporary files to disk. Only available when the available when the sorted-inputsorted-input parameter is set to parameter is set to Inputs must be Inputs must be sortedsorted..

The default value is The default value is 83886088388608 bytes (8 bytes (8 megabytes). Start by using this default, and megabytes). Start by using this default, and then adjust to higher or lower values as then adjust to higher or lower values as necessary if you encounter performance necessary if you encounter performance difficulties. It is very unlikely you will ever difficulties. It is very unlikely you will ever need to change the value of this parameter.need to change the value of this parameter.

Page 15: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1515

Parameter Descriptions for Parameter Descriptions for JoinJoin

driving driving : : (integer, required) (integer, required) Number of the port to which you connect the Number of the port to which you connect the

driving input. The driving input is the largest input. driving input. The driving input is the largest input. All other inputs are read into memory.All other inputs are read into memory.

The The drivingdriving parameter is only available when the parameter is only available when the sorted-inputsorted-input parameter is set to parameter is set to In memory: In memory: Input need not be sortedInput need not be sorted. For example, suppose . For example, suppose the largest input to be joined is on the the largest input to be joined is on the in1in1 port. port. Specify a port number of Specify a port number of 1 1 as the value of the as the value of the drivingdriving parameter. The Join component reads all parameter. The Join component reads all other inputs to the join, for example, other inputs to the join, for example, in0in0, and , and in2in2, , into memory. into memory.

Default is Default is 00, which specifies that the driving input is , which specifies that the driving input is on port on port in0in0..

Page 16: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1616

Parameter Descriptions for Parameter Descriptions for JoinJoin

maintain-order : maintain-order : (boolean, required) (boolean, required) Set toSet to TrueTrue to ensure that records remain in the to ensure that records remain in the

original order of the driving input. (The driving original order of the driving input. (The driving input is the largest input, as specified by the input is the largest input, as specified by the drivingdriving parameter.) Default is parameter.) Default is FalseFalse. .

Only available when the Only available when the sorted-inputsorted-input parameter parameter is set to is set to In memory: Input need not be In memory: Input need not be sortedsorted. If the . If the sorted-inputsorted-input parameter is set to parameter is set to Inputs must be sortedInputs must be sorted, and all inputs are , and all inputs are sorted on the fields given in the sorted on the fields given in the keykey parameter, parameter, then the output maintains the sort order on that then the output maintains the sort order on that key without the use of this parameter.key without the use of this parameter.

Page 17: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1717

Parameter Descriptions for Parameter Descriptions for JoinJoin

maintain-order : maintain-order : (boolean, required) (boolean, required) If any inputs, other than the driving input, are too If any inputs, other than the driving input, are too

large to fit within the memory limit specified by large to fit within the memory limit specified by max-coremax-core, and you set , and you set maintain-ordermaintain-order to to

FalseFalse — Join stores some of its intermediate results — Join stores some of its intermediate results in temporary files on disk. This alters the order of in temporary files on disk. This alters the order of records in the driving input. records in the driving input.

TrueTrue — Join stops execution of the graph. Even if — Join stops execution of the graph. Even if you leave you leave maintain-ordermaintain-order set to set to FalseFalse, Join , Join groups together all output records for a given groups together all output records for a given driving input record. If the driving input is grouped driving input record. If the driving input is grouped by key value, you can still use components by key value, you can still use components downstream that require records grouped by key downstream that require records grouped by key value.value.

Page 18: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1818

Parameter Descriptions for Parameter Descriptions for JoinJoin

max-core : max-core : (integer, required) (integer, required) Maximum memory usage in bytes. Only Maximum memory usage in bytes. Only

available when the available when the sorted-inputsorted-input parameter parameter is set to is set to In memory: Input need not be In memory: Input need not be sortedsorted. The default value is . The default value is 6710886467108864 bytes (64 megabytes). bytes (64 megabytes).

If the total size of the intermediate results If the total size of the intermediate results Join holds in memory exceeds the number of Join holds in memory exceeds the number of bytes specified in the bytes specified in the max-coremax-core parameter, parameter, Join writes temporary files to diskJoin writes temporary files to disk

Page 19: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 1919

Runtime Behavior of JoinRuntime Behavior of Join

The Join component: The Join component: Reads data records from multiple Reads data records from multiple inninn ports. ports. Applies the expression in any defined Applies the expression in any defined selectselectnn

parameter to the records on the corresponding parameter to the records on the corresponding ininnn port. port. If the expression evaluates to If the expression evaluates to 00 for a record, Join does for a record, Join does

not process the record, and it does not appear on any not process the record, and it does not appear on any output port. output port.

If the expression produces NULL for a particular record, If the expression produces NULL for a particular record, Join writes a descriptive error message and stops graph Join writes a descriptive error message and stops graph execution. execution.

If the expression evaluates to anything other than If the expression evaluates to anything other than 00 or or NULL for a particular record, Join processes the record. NULL for a particular record, Join processes the record.

Page 20: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2020

Runtime Behavior of JoinRuntime Behavior of Join

If you do not supply an expression for a If you do not supply an expression for a selectselectnn parameter, Join parameter, Join processes all the records on the corresponding processes all the records on the corresponding ininnn port. port.

Operates on the records that have matching key values using a Operates on the records that have matching key values using a multi-input transform function. multi-input transform function.

Writes the result to the Writes the result to the outout port. If you connect a flow to an port. If you connect a flow to an unusedunusedn n port, Join writes to the port, Join writes to the unusedunusednn port, from the port, from the corresponding corresponding ininnn port, any of the selected records that it does port, any of the selected records that it does not pass through the transform function. In other words, Join not pass through the transform function. In other words, Join writes the following records to writes the following records to unusednunusedn ports: ports:

For an For an inner joininner join — all unmatched records — all unmatched records For an For an outer joinouter join — no records, since Join passes all records — no records, since Join passes all records

through the transform function through the transform function For an For an explicit joinexplicit join — — records for which the transform is not called records for which the transform is not called For an input port with the For an input port with the dedupdedupnn parameter set to parameter set to TrueTrue — —

records with duplicate key values records with duplicate key values

Page 21: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2121

Runtime Behavior of JoinRuntime Behavior of Join

Thus, the set of records that Join passes Thus, the set of records that Join passes through the transform function is mutually through the transform function is mutually exclusive with the set of records that come exclusive with the set of records that come out the out the unusednunusedn port, and the two sets are port, and the two sets are also collectively exhaustive. The result is that also collectively exhaustive. The result is that all selected records are accounted for exactly all selected records are accounted for exactly once. once.

If the transform function returns NULL, Join If the transform function returns NULL, Join writes: writes:

Each input record to the corresponding Each input record to the corresponding rejectnrejectn port. port.

Page 22: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2222

Join TypesJoin Types

Inner Join (Equi-Join)Inner Join (Equi-Join) Full Outer Join (Cartesian Product)Full Outer Join (Cartesian Product) Explicit Join (Left or Right Outer JoinExplicit Join (Left or Right Outer Join

Page 23: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2323

Join Types : Inner & OuterJoin Types : Inner & Outer

Page 24: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2424

Join Types : ExplicitJoin Types : Explicit

Page 25: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2525

Joining of DataJoining of Data

Page 26: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2626

Input data sorted before Input data sorted before JoinJoin

Page 27: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2727

The Join ComponentThe Join Component

Join performs a join of inputs. By default, the Join performs a join of inputs. By default, the inputs to join must be sorted and an inner join is inputs to join must be sorted and an inner join is computed.computed.

Note: The following slides and the on-line Note: The following slides and the on-line example assume the join-type parameter is set to example assume the join-type parameter is set to ‘Outer’, and thus compute an outer join.‘Outer’, and thus compute an outer join.

Page 28: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2828

Building the Output RecordBuilding the Output Record

in0:record decimal(4) id; string(6) name; string(8) city; decimal(3) amount;end

in1:record decimal(4) id; date(”YYMMDD”) dt; decimal(9.2) cost;end

out:record decimal(4) id; string(8) city; decimal(3) amount; date(“YYYY/MM/DD”)dt;end

Page 29: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 2929

out:record decimal(4) id; string(8) city; decimal(3) amount; date(“YYYY/MM/DD”)dt;end

What if the in1 record is missing?What if the in1 record is missing?

in0:record decimal(4) id; string(6) name; string(8) city; decimal(3) amount;end

in1:record decimal(4) id; date(”YYMMDD”) dt; ??? decimal(9.2) cost;end

Page 30: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3030

Ports of Join Component

unused0

unused1

reject0

reject1

error0

error1 Log Port

Page 31: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3131

Cust Cust IdId

Cust Cust NameName

C1C1 Name1Name1

C2C2 Name2Name2

C3C3 Name3Name3

Tran Tran IdId

Cust IdCust Id Tran Tran AmtAmt

T1T1 C2C2 3434

T2T2 C1C1 4545

T3T3 C1C1 6767

T4T4 C2C2 2323

T5T5 C4C4 3333

CustomerTransaction

Cust IdCust Id Cust NameCust Name Tran AmtTran Amt

Fields in Output of Join

How many records will be generated at the output of Join

Key

Join Type : ScenarioJoin Type : Scenario

Page 32: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3232

Cust Cust IdId

Cust Cust NameName

C1C1 Name1Name1

C2C2 Name2Name2

C3C3 Name3Name3

Tran Tran IdId

Cust IdCust Id Tran Tran AmtAmt

T1T1 C2C2 3434

T2T2 C1C1 4545

T3T3 C1C1 6767

T4T4 C2C2 2323

T5T5 3333

CustomerTransaction

Cust IdCust Id Cust NameCust Name Tran AmtTran Amt

Fields in Output of Join

How many records will be generated at the output of Join

Key

Join Type : ScenarioJoin Type : Scenario

Page 33: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3333

Cust Cust IdId

Cust Cust NameName

C1C1 Name1Name1

C2C2 Name2Name2

Name3Name3

Tran Tran IdId

Cust IdCust Id Tran Tran AmtAmt

T1T1 C2C2 3434

T2T2 C1C1 4545

T3T3 C1C1 6767

T4T4 C2C2 2323

T5T5 C4C4 3333

CustomerTransaction

Cust IdCust Id Cust NameCust Name Tran AmtTran Amt

Fields in Output of Join

How many records will be generated at the output of Join

Key

Join Type : ScenarioJoin Type : Scenario

Page 34: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3434

Minimize operations on Minimize operations on large datalarge data When joining a very small dataset When joining a very small dataset

to a very large dataset, it may be to a very large dataset, it may be more efficient to broadcast the more efficient to broadcast the small dataset or use it as a lookup small dataset or use it as a lookup file rather than repartition and re-file rather than repartition and re-sort the large dataset.sort the large dataset.

Page 35: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3535

Minimize sorted join component and Minimize sorted join component and if possible replace them by in-if possible replace them by in-memory join/hash join.memory join/hash join.

If the two inputs are huge then use If the two inputs are huge then use sorted join, otherwise use hash join sorted join, otherwise use hash join with proper driving port with proper driving port

Page 36: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3636

Page 37: 04 Join Component

AccentureAccenture Ab Initio TrainingAb Initio Training 3737

Thank YouThank You

End of Session 4End of Session 4