Upload
trandang
View
236
Download
0
Embed Size (px)
Citation preview
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 1
ETL Benchmarks
Comparing
� DATASTAGE SERVER 7.5
� DATASTAGE PX 7.5
� TALEND OPEN STUDIO 2.4.1
� INFORMATICA 8.1.1
� PENTAHO DATA INTEGRATOR 3.0.0
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 2
This document is published under the Creative Commons license:
http://creativecommons.org/licenses/by/3.0/us/
You are free:
to Share — to copy, distribute, display, and perform the work
to Remix — to make derivative works
Under the following conditions:
Attribution. You must attribute the work in the manner specified by the author or
licensor (but not in any way that suggests that they endorse you or your use of the
work).
� For any reuse or distribution, you must make clear to others the license terms of this work.
The best way to do this is with a link to this web page.
� Any of the above conditions can be waived if you get permission from the copyright holder.
� Apart from the remix rights granted under this license, nothing in this license impairs or
restricts the author's moral rights.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 3
Table of Contents
General comments .................................................................................................................................. 5
Hardware Configuration.......................................................................................................................... 5
Test 1: File Input Delimited > File Output Delimited............................................................................... 6
Scenario: .............................................................................................................................................. 6
Test results: ....................................................................................................................................... 12
Test 2: File Input Delimited > Table MySQL Output.............................................................................. 14
Scenario: ............................................................................................................................................ 14
Test results: ....................................................................................................................................... 17
Test 3: Table Oracle Input > File Output Delimited............................................................................... 19
Scenario: ............................................................................................................................................ 19
Test results: ....................................................................................................................................... 25
Test 4: File Input Delimited > Table Output Oracle BULK ..................................................................... 27
Scenario: ............................................................................................................................................ 27
Test results: ....................................................................................................................................... 33
Test 5: File Input Delimited > Transform > File Output Delimited ........................................................ 34
Scenario: ............................................................................................................................................ 34
Tests result: ....................................................................................................................................... 46
Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT) ................................................ 48
Scenario: ............................................................................................................................................ 48
Test results: ....................................................................................................................................... 54
Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT)........................................ 55
Scenario: ............................................................................................................................................ 55
Test results: ....................................................................................................................................... 64
Test 8: File Input Delimited > Sort > File Output Delimited .................................................................. 66
Scenario: ............................................................................................................................................ 66
Tests result: ....................................................................................................................................... 72
Test 9: File Input Delimited > Aggregate > File Output Delimited ........................................................ 76
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 4
Scenario: ............................................................................................................................................ 76
Tests result: ....................................................................................................................................... 83
Test 10: File Input Delimited > Lookup > File Output Delimited ........................................................... 86
Scenario: ............................................................................................................................................ 86
Tests result: ....................................................................................................................................... 99
Test 11: File Input Delimited > Lookup > File Output Delimited && rejects....................................... 105
Scenario: .......................................................................................................................................... 105
Tests result: ..................................................................................................................................... 118
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 5
General comments
� For the tests with DataStage PX, we used 2 nodes to take advantage of the dual cores and of
the parallelization feature of the tool.
� In terms of intuitiveness and ease of use, Talend Open Studio and DataStage Server are
ahead of the pack. DataStage PX comes in the third position, Informatica in fourth and the
least intuitive is Pentaho Data Integrator. Our main reason for this assessment of Pentaho is
mostly linked to the many parameters that need to be learnt. However, we think that if you
invest lots of time in it, it could become an powerful tool.
� Open Source ETL & Parallelization: Pentaho Data Integrator claims the first position here. It is
easier to parallelize with PDI. We did however fine some issues with the way the tool lets you
to parallelize all the components, but some results are inconsistent.
� ELT: Informatica has an ELT mode named Pushdown Optimization, but we could not figure
out how to use it. Thus, ELT processes were implemented as ETL with Informatica. Only
Talend Open Studio allows to use the ELT mode easily.
Hardware Configuration
� OS: Windows XP Pro SP2
� CPU: Intel Core2 Duo 2 GHz
� JVM 1.6.0_87
� RAM: 4 Go
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 6
Test 1: File Input Delimited > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file output delimited.
File input delimited extract:
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 7
TALEND OPEN STUDIO
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 8
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 9
DATASTAGE SERVER
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 10
DATASTAGE PX
Job name: PX_file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 11
INFORMATICA
Job name: file_input_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 12
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2 2 3,4 40,67
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 13
1 000 000 1,99 0,51 1,54 5,77
5 000 000 2,14 0,32 1,02 1,39
20 000 000 2,58 0,41 0,93 0,75
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 14
Test 2: File Input Delimited > Table MySQL Output
Scenario:
Reading X lines from a file input delimited and writing into a table output MySQL.
Comments:
DataStage 7.5, DataStage PX 7.5 and Informatica 8.1.1 are not tested for this use case. To
begin, the test has been done with default parameters. To optimize the performances, the commit
parameter has been learned. To finish, the job has been parallelize. To parallelize with TOS 2.4.1, we
just have to cut through our file input delimited (With the header and the limit parameters) and
parallelize two sub-jobs. With PDI 3.0.0, we just have to increment the number of copy.
TOS 2.4.1 permits to use the extended insert, which is a MySQL feature. This feature limits
the number of database accesses and increases the performances. With this feature, TOS 2.4.1 is 6
times faster.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 15
TALEND OPEN STUDIO
Job name: file_input_delimited__table_output_mysql
Job (Multi-Thread Execution checked on Job Settings)
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 16
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__table_output_mysql
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 17
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 TOS 2.4.1 Extended Insert
ratio compared with TOS 2.4.1
100 000 0,98 0,18
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 18
1 000 000 1,05 0,17
5 000 000 1,15 0,18
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 19
Test 3: Table Oracle Input > File Output Delimited
Scenario:
Reading X lines from a table output Oracle and writing into a file output delimited.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 20
TALEND OPEN STUDIO
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 21
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__file_output_delimited
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 22
DATASTAGE SERVER
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 23
DATASTAGE PX
Job name: PX_table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 24
INFORMATICA
Job name: table_input_oracle__file_output_delimited
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 25
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,12 1,78 1,78 19,26
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 26
500 000 3,39 1,76 1,28 7,67
1 000 000 2,62 1,33 1,05 3,56
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 27
Test 4: File Input Delimited > Table Output Oracle BULK
Scenario:
Reading X lines from a file input delimited and writing into a table output Oracle BULK.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 28
TALEND OPEN STUDIO
Job name: file_input_delimited__table_output_oracle_bulk
Job
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 29
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 30
DATASTAGE SERVER
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 31
DATASTAGE PX
Job name: PX_file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 32
INFORMATICA
Job name: file_input_delimited__table_output_oracle_bulk
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 33
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 0,6 0,69 1,38 11,93
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 34
1 000 000 1,38 0,81 1,22 2,71
2 000 000 1,46 0,8 1,11 1,61
Test 5: File Input Delimited > Transform > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file output delimited after some
changes.
Changes list:
• The field `rate` content is multiplied by 100.
• The new field `name` is a concatenation (`firstname`+ « » +`lastname`).
• The fields `address` content is converted to uppercase.
Comments:
Pentaho Data Integration hasn’t any graphic component to transform data. Thus, we have to
use a custom code component. The used language is JavaScript. The four others ETL got a
transformer to do this. Talend Open Studio got a custom code too, named tJavaRow or tPerlRow.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 35
TALEND OPEN STUDIO
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 36
tMap
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 37
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 38
JavaScript Custom Code
Select Values
Select Values
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 39
DATASTAGE SERVER
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 40
Transformer
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 41
DATASTAGE PX
Job name: PX_file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 42
Transformer
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 43
INFORMATICA
Job name: file_input_delimited__transformation__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 44
Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 45
Mapping
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 46
Tests result:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 4,07 1,54 3,65 31,15
1 000 000 6 1,18 1,33 5,06
5 000 000 6,02 1,3 0,95 1,01
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 47
20 000 000 6,16 0,97 0,84 0,97
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 48
Test 6: Table Input Oracle > Aggregation > Table Output Oracle (ELT)
Scenario:
Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT
Mod).
Comments:
Only Talend Open Studio permits to use an ELT mod. Informatica got the Push Down
Optimization, but I didn’t find this feature on the tool.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 49
TALEND OPEN STUDIO
Job names: ELT__table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job (ELT)
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 50
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 51
DATASTAGE SERVER
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 52
DATASTAGE PX
Job name: PX_table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 53
INFORMATICA
Job name: table_input_oracle__aggregate_group_by_age_count__table_output_oracle
Job
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 54
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,44 1,94 6,45 39,52
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 55
500 000 15,9 5,71 8,57 36,43
1 000 000 28,28 8,09 10,36 30,77
Test 7: Tables Input Oracle > Transformation > Tables Output Oracle (ELT)
Scenario:
Reading X lines from tables input Oracle and writing into another tables output Oracle (ELT
Mod) after some changes.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 56
TALEND OPEN STUDIO
Job name: table_input_oracle__elt__table_output_oracle
Job (ELT)
Schema of table_lookup_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 57
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 58
PENTAHO DATA INTEGRATION
Job name: table_input_oracle__elt__table_output_oracle
Job
SCHEMA VIEWER NOT POSSIBLE
Schema of table_lookup_oracle
SCHEMA VIEWER NOT POSSIBLE
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 59
DATASTAGE SERVER
Job name: table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 60
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 61
DATASTAGE PX
Job name: PX_table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 62
INFORMATICA
Job name: table_input_oracle__elt__table_output_oracle
Job
Schema of table_lookup_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 63
Schema of table_input_oracle
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 64
Test results:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 6,4 2,12 2,5 8,93
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 65
500 000 8,67 2,79 1,31 3,05
1 000 000 7,26 2,2 0,9 1,9
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 66
Test 8: File Input Delimited > Sort > File Output Delimited
Scenario:
Reading X lines from a file input delimited and writing in a file input delimited sorted.
Sorts list:
• Order by the integer field `age` ASC.
• Order by the string field `firstname` ASC.
• Order by the fields `age` and `firstname` ASC.
Comments:
With the version used, I can’t do sort in memory with Pentaho Data Integrator. But the
feature is present on latest version.
On Talend Open Studio, with a large volume (5 000 000 and 20 000 000), we have to use the
component tExternalSort which use GNU sort, a sort software.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 67
TALEND OPEN STUDIO
Job names:
• file_input_delimited__sort_on_age__file_output_delimited
• file_input_delimited__sort_on_firstname__file_output_delimited
• file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 68
PENTAHO DATA INTEGRATION
Job names:
• file_input_delimited__sort_on_age__file_output_delimited
• file_input_delimited__sort_on_firstname__file_output_delimited
• file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 69
DATASTAGE SERVER
Job names:
• file_input_delimited__sort_on_age__file_output_delimited
• file_input_delimited__sort_on_firstname__file_output_delimited
• file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 70
DATASTAGE PX
Job names:
• PX_file_input_delimited__sort_on_age__file_output_delimited
• PX_file_input_delimited__sort_on_firstname__file_output_delimited
• PX_file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 71
INFORMATICA
Job names:
• file_input_delimited__sort_on_age__file_output_delimited
• file_input_delimited__sort_on_firstname__file_output_delimited
• file_input_delimited__sort_on_firstname_and_age__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 72
Tests result:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 73
100 000 2,51 2,92 2,78 28,82
1 000 000 2,09 3,86 1,03 3,93
5 000 000 0,83 1,42 0,34 1,12
20 000 000 0,66 +++ 0,48 0,64
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 74
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,01 3,55 2,37 24,9
1 000 000 1,73 3,21 0,89 3,45
5 000 000 0,93 2,53 0,34 1,26
20 000 000 0,69 +++ 0,58 0,77
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 75
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,42 5,51 3,38 31,58
1 000 000 1,68 3,45 0,94 3,52
5 000 000 0,71 1,6 0,26 0,95
20 000 000 0,84 +++ 0,58 0,76
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 76
Test 9: File Input Delimited > Aggregate > File Output Delimited
Scenario:
Reading X lines from a file input delimited, achieving an aggregation and writing the
operations result in a file output delimited.
1 – Group by the field `age`; Operation: COUNT.
2 – Group by the field `age`; Operations: COUNT, SUM(rate), AVG(rate), MIN(rate),
MAX(rate).
3 – Group by the field `firstname`; Operations: COUNT.
Comments:
When the output flow is too big (aggregate by firstname with big volume here), we have to
use the tSortedAggregateRow on Talend Open Studio. This component sorts rows before the
aggregation. On this case, Pentaho Data Integrator failed.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 77
TALEND OPEN STUDIO
Job names:
• file_input_delimited__aggregate_group_by_age_count__file_output_delimited
• file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited
• file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Job using the tExternalSortRow component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 78
Schema of file_input_delimited
Schema of file_output_delimited
file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 79
PENTAHO DATA INTEGRATION
Job names:
• file_input_delimited__aggregate_group_by_age_count__file_output_delimited
• file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited
• file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited
file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 80
DATASTAGE SERVER
Job names:
• file_input_delimited__aggregate_group_by_age_count__file_output_delimited
• file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited
• file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited
file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 81
DATASTAGE PX
Job names:
• PX_file_input_delimited__aggregate_group_by_age_count__file_output_delimited
• PX_file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__fi
le_output_delimited
• PX_file_input_delimited__aggregate_group_by_firstname_count__file_output_deli
mited
Job
Schema of file_input_delimited
Schema of file_output_delimited
file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 82
INFORMATICA
Job names:
• file_input_delimited__aggregate_group_by_age_count__file_output_delimited
• file_input_delimited__aggregate_group_by_age_count_sum_avg_min_max__file_o
utput_delimited
• file_input_delimited__aggregate_group_by_firstname_count__file_output_delimit
ed
Job
Schema of file_input_delimited
Schema of file_output_delimited
file_input_delimited__aggregate_group_by_age_count__file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 83
Tests result:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 84
100 000 4,35 3,23 6,45 63,71
1 000 000 3,8 0,86 0,93 5,77
5 000 000 4,47 0,7 0,71 1,49
20 000 000 3,76 1,03 0,63 0,56
Statistics:
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 85
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,1 2,38 13,39 47,23
1 000 000 3,39 1,48 2,06 5,6
5 000 000 3,68 1,33 0,89 1,3
20 000 000 3,06 1,32 1,91 0,65
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 86
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 3,14 2,33 5,23 47,67
1 000 000 3,76 1,77 1,39 12,13
5 000 000 0,82 0,34 0,2 +++
20 000 000 0,59 0,46 0,54 +++
Test 10: File Input Delimited > Lookup > File Output Delimited
Scenario:
Reading X lines from a file input delimited, looking up to another file input delimited, for 4
fields using id_client column. Writing the jointure result into a file output delimited.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 87
TALEND OPEN STUDIO
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 88
Schema file_output_delimited
tMap Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 89
PENTAHO DATA INTEGRATION
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 90
Schema of file_output_delimited
Mapping Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 91
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 92
DATASTAGE SERVER
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 93
Schema of file_lookup_delimited
Schema file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 94
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 95
DATASTAGE PX
Job name: PX_file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 96
Schema of file_lookup_delimited
Schema file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 97
INFORMATICA
Job name: file_input_delimited__file_lookup_delimited__file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 98
Schema file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 99
Tests result:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 100
100 000 2,86 3,45 3,45 30,34
1 000 000 3,35 1,66 1,91 8,87
5 000 000 3,05 1,15 1,39 4,13
20 000 000 2,67 1,28 1,13 3,18
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 101
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 2,03 7,18 1,79 13,72
1 000 000 2,76 3,71 1,46 7,93
5 000 000 3,01 1,73 1,24 4,7
20 000 000 2,52 1,69 1,05 4,28
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 102
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,47 6,93 0,94 5,48
1 000 000 2,26 5,61 1,05 5,22
5 000 000 3,02 2,64 1,04 4,49
20 000 000 4,01 1,67 1,01 4,67
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 103
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 104
100 000 Failed 6,53 0,42 1,71
1 000 000 Failed 5,89 0,43 1,58
5 000 000 Failed 2,49 0,28 1,18
20 000 000 Failed 1,75 0,24 1,13
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 105
Test 11: File Input Delimited > Lookup > File Output Delimited &&
rejects
Scenario:
Reading X lines from a file input delimited, looking up to another file input delimited, for 4
fields using id_client column. Writing the jointure result into a file output delimited and the output
rejects into another files output delimited.
1 – Filter rejects: `age` content < 18
2 – Filter rejects: `age` content < 18 and inner join reject
Comments:
Talend Open Studio and DataStage Server are the more ergonomic tools to manage the
expression filter rejects and inner join rejects (with the Transformer component (tMap on Talend
Open Studio)). For DataStage PX, Pentaho Data Integrator and Informatica, we have to use filter
components.
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 106
TALEND OPEN STUDIO
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 107
Schema of file_output_delimited (age>=18)
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
tMap Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 108
PENTAHO DATA INTEGRATION
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 109
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 110
Mapping Component
DATASTAGE SERVER
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 111
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_ output _delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 112
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 113
DATASTAGE PX
Job name:
PX_file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delim
ited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 114
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 115
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 116
INFORMATICA
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 117
Schema file_lookup_delimited
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 118
Tests result:
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 119
100 000 2,19 3,97 4,64 29,8
1 000 000 2,54 1,56 2,08 8,61
5 000 000 2,65 1,22 1,39 4,37
20 000 000 3 1,42 1,35 3,71
Statistics:
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 120
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,83 6,71 1,76 11,03
1 000 000 2,21 3,66 1,54 7,54
5 000 000 2,51 1,76 1,38 5,23
20 000 000 2,77 1,54 1,39 4,58
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 121
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,38 6,47 0,88 5,78
1 000 000 2,13 4,47 1,18 5,45
5 000 000 2,91 1,7 1,33 4,92
20 000 000 2,52 1,74 1,21 4,75
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 122
TALEND OPEN STUDIO
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 123
Schema of file_lookup_delimited
Schema of file_output_delimited (age>=18)
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 124
tMap Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 125
PENTAHO DATA INTEGRATION
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 126
Schema of file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 127
Mapping Component
DATASTAGE SERVER
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 128
Job
Schema of file_input_delimited
Schema of file_lookup_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 129
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 130
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 131
DATASTAGE PX
Job name:
PX_file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rej
ects_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 132
Schema of file_lookup_delimited
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 133
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 134
INFORMATICA
Job name:
file_input_delimited__file_lookup_delimited__file_output_delimited__rejects_and_innerjoin_rejects
_file_output_delimited
Job
Schema of file_input_delimited
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 135
Schema of file_lookup_delimited
Schema file_output_delimited
Schema of file_output_delimited (age<18) = Schema of file_output_delimited
Schema of file_output_delimited (inner join rejects) = Schema of file_output_delimited
Transformer Component
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 136
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 137
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,83 4,22 6,34 39,15
1 000 000 2,3 1,77 2,7 12,65
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 138
5 000 000 2,43 1,22 1,92 5,1
20 000 000 3,07 1,28 1,37 3,46
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 139
ratio compared with TOS 2.4.1
100 000 1,75 6,73 6,73 15,5
1 000 000 2,21 4,06 1,83 9,08
5 000 000 2,38 2,08 1,45 5,78
20 000 000 2,65 1,57 1,24 4,25
MANAPPS
V 1.1 2008/10/20 ETL Benchmarks
Pg 140
Statistics:
Number of lines TOS 2.4.1 PDI 3.0.0 DataStage 7.5 DataStage PX 7.5 Informatica 8.1.1
ratio compared with TOS 2.4.1
100 000 1,21 3,51 1,18 5,8
1 000 000 1,8 5,96 1,25 5,5
5 000 000 2,05 2,81 1,27 4,78
20 000 000 3,27 1,83 1,06 4,47