Scalable Performance of System S for Extract-Transform-Load Processing

23/04/19

© International Business Machines Corporation 2008

IBM Research - Tokyo

Scalable Performance of System S for Extract-Transform-Load Processing

Toyotaro Suzumura, Toshihiro Yasue and Tamiya OnoderaIBM Research - Tokyo



Outline

Background and Motivation

System S and its suitability for ETL

Performance Evaluation of System S as a Distributed ETL Platform

Performance Optimization

Related Work and Conclusions



What is ETL ?

Extraction : handle the extraction of data from different distributed data sources

Transformation :cleansing and customizing the data for the business needs and rules while transforming the data to match the data warehouse schema

Loading : Load the data into data warehouse

ETL = Extraction + Transformation + Loading

DataWarehouse

DataSources ETL

Extract Load

Transform



Data Explosion in ETL

Data Explosion

– The amount of data stored in a typical contemporary data warehouse may double every 12 to 18 months

Data Source Examples: – Logs for Regulatory Compliance (e.g. SOX)

– POS (Point-of-sale) Transaction of Retail Store (e.g. Wal-Mart)

– Web Data (e.g. internet auction sites, EBay)

– CDR (Call Detail Record) for Telecom companies to analyze customer’s behavior

Trading Data



Near-Real Time ETL

Given the data explosion problem, there are strong needs for ETL processing to be as fast as possible so that business analysts can quickly grasp the trends of customer activities



Our Motivation:

Assess the applicability of System S, data stream processing system to the ETL processing for, considering both qualitative and quantitative ETL constraints

Thoroughly evaluate the performance of System S as a scalable and distributed ETL platform to achieve “Near-Real Time ETL” and solve the data explosion in the ETL domain



Outline








Stream Computing and System S

System S: Stream Computing Middleware developed by IBM Research

System S is productized as “InfoSphere Streams” now.

Traditional ComputingTraditional Computing Stream ComputingStream Computing

Fact finding with data-at-rest Insights from data in motion



InfoSphere Streams Programming Model

Application Programming (SPADE)

Source Adapters Sink AdaptersOperator Repository

Platform optimized compilation



SPADE : Advantages of Stream Processing as Parallelization Model

A stream-centric programming language dedicated for data stream processing

Streams as first class entity– Explicit task and data parallelism

– Intuitive way to exploit multi-core and multi-nodes

Operator and data source profiling for better resource management

Reuse of operators across stored and live data

Support for user-customized operators (UDOP)



A simple SPADE example

[Application]SourceSink trace

[Nodepool]Nodepool np := (“host1”, “host2”, “host3)

[Program]// virtual schema declarationvstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t)

// a source stream is generated by a Source operator – in this case tuples come from an input filestream SenSource ( schemaof(Sensor) ) := Source( ) [ “file:///SenSource.dat” ] {} -> node(np, 0)

// this intermediate stream is produced by an Aggregate operator, using the SenSource stream as inputstream SenAggregator ( schemaof(Sensor) ) := Aggregate( SenSource <count(100),count(1)> ) [ id . location ] { Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) } -> node(np, 1)

// this intermediate stream is produced by a functor operatorstream SenFunctor ( id: Integer, location: Double, message: String ) := Functor( SenAggregator ) [ log(temperature,2.0)>6.0 ] { id, location, “Node ”+toString(id)+ “ at location ”+toString(location) } -> node(np, 2)

// result management is done by a sink operator – in this case produced tuples are sent to a socketNull := Sink( SenFunctor ) [ “udp://192.168.0.144:5500/” ] {} -> node(np, 0)

SinkSource Aggregate Functor



Template Documentation

X86 Box

X86 Blade

CellBlade

X86 Blade

FPGABlade

X86 Blade

X86 Blade

X86Blade

X86 Blade

X86Blade

InfoSphere Streams Runtime

TransportStreams Data Fabric

Processing Element

Container

Processing Element

Container

Processing Element

Container

Processing Element

Container

Processing Element

Container

Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation

Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation



System S as a Distributed ETL Platform ?

Can we use System S as a distributed ETL processing platform ?

??



Outline








Target Application for Evaluation

Inventory processing for multiple warehouses that includes most of the representative ETL primitives (Sort,Join,and Aggregate)h



SPADE Program for Distributed Processing

SourceWarehouse

Items 3(Warehouse_20090901_3.txt)

Split

Functor

Sink

Key=item

WarehouseItems 2

(Warehouse_20090901_2.txt)

WarehouseItems 1

(Warehouse_20090901_1.txt)Source

Source bundle

6 million

Around 60

Source

Item Catalog

Functor

Sink

Sort Join Sort Join

Aggregate

Sort

Functor SinkUDOP

(SplitDuplicatedTuples)

Sink

ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP


Sink

ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP


Sink

ODBCAppend

Sort

Data Distribution Host

Compute host (1) 0100-0300-000100-0900-00

Compute host (2)

Compute host (N)



SPADE Program (1/2)

[Nodepools]nodepool np[] := ("s72x336-00", "s72x336-02",

"s72x336-03", "s72x336-04")

[Program]vstream Warehouse1Schema(id: Integer, item : String, Onhand : String, allocated : String, hardAllocated : String,

fileNameColumn : String)

vstream Warehouse2OutputSchema(id: Integer, item : String, Onhand : String,

allocated : String, hardAllocated : String, fileNameColumn : String, description: StringList)

vstream ItemSchema(item: String, description: StringList)

##===================================================## warehouse 1##===================================================bundle warehouse1Bundle := ()

for_begin @i 1 to 3stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"]warehouse1Bundle += Warehouse1Stream@ifor_end

## stream for computing subindex stream StreamWithSubindex(schemaFor(Warehouse1Schema), subIndex: Integer) := Functor(warehouse1Bundle[:])[] { subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM))-2 } -> node(np, 0), partition["Sources"]

for_begin @i 1 to COMPUTE_NODE_NUM stream ItemStream@i(schemaFor(Warehouse1Schema), subIndex:Integer)for_end := Split(StreamWithSubindex) [ subIndex ]{} -> node(np, 0), partition["Sources"]

for_begin @i 1 to COMPUTE_NODE_NUM

stream Warehouse1Sort@i(schemaFor(Warehouse1Schema)) := Sort(ItemStream@i <count(SOURCE_COUNT@i)>)[item, asc]{} -> node(np, @i-1), partition["CMP%@i"]

stream Warehouse1Filter@i(schemaFor(Warehouse1Schema)) := Functor(Warehouse1Sort@i)[ Onhand="0001.000000" ] {} -> node(np, @i-1), partition["CMP%@i"]

Nil := Sink(Warehouse1Filter@i)["file:///WAREHOUSE1_OUTPUTFILE@i", csvFormat, noDelays]{}

-> node(np, @i-1), partition["CMP%@i"]for_end



SPADE Program (2/2)##====================================================## warehouse 2##====================================================stream ItemsSource(schemaFor(ItemSchema)) := Source()["file:///ITEMS_FILE", nodelays, csvformat]{} -> node(np, 1), partition["ITEMCATALOG"]

stream SortedItems(schemaFor(ItemSchema)) := Sort(ItemsSource <count(ITEM_COUNT)>)[item, asc]{} -> node(np, 1), partition["ITEMCATALOG"]

for_begin @i 1 to COMPUTE_NODE_NUM stream JoinedItem@i(schemaFor(Warehouse2OutputSchema)) := Join(Warehouse1Sort@i <count(SOURCE_COUNT@i)>;

SortedItems <count(ITEM_COUNT)>) [ LeftOuterJoin, {item} = {item} ]{} -> node(np, @i-1), partition["CMP%@i"] ##=================================================## warehouse 3##=================================================for_begin @i 1 to COMPUTE_NODE_NUM stream SortedItems@i(schemaFor(Warehouse2OutputSchema)) := Sort(JoinedItem@i <count(JOIN_COUNT@i)>)[id, asc]{} -> node(np, @i-1), partition["CMP%@i"] stream AggregatedItems@i(schemaFor(Warehouse2OutputSchema),

count: Integer) := Aggregate(SortedItems@i <count(JOIN_COUNT@i)>) [item . id] { Any(id), Any(item), Any(Onhand), Any(allocated), Any(hardAllocated), Any(fileNameColumn), Any(description), Cnt() } -> node(np, @i-1), partition["CMP%@i"]

stream JoinedItem2@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Join(SortedItems@i <count(JOIN_COUNT@i)>; AggregatedItems@i <count(AGGREGATED_ITEM@i)>) [ LeftOuterJoin, {id, item} = {id, item} ] {} -> node(np, @i-1), partition["CMP%@i"]

stream SortJoinedItem@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Sort(JoinedItem2@i <count(JOIN_COUNT@i)>)[id(asc).fileNameColumn(asc)]{} -> node(np, @i-1), partition["CMP%@i"]

stream DuplicatedItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) stream UniqueItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Udop(SortJoinedItem@i)["FilterDuplicatedItems"]{} -> node(np, @i-1), partition["CMP%@i"]

Nil := Sink(DuplicatedItems@i)["file:///DUPLICATED_FILE@i", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP%@i"] stream FilterStream@i(item: String, recorded_indicator: Integer) := Functor(UniqueItems@i)[] { item, 1 } -> node(np, @i-1), partition["CMP@i"] stream AggregatedItems2@i(LoadNum: Integer, Item_Load_Count: Integer) := Aggregate(FilterStream@i <count(UNIQUE_ITEM@i)>) [ recorded_indicator ] { Any(recorded_indicator), Cnt() } -> node(np, @i-1), partition["CMP@i"]

stream AddTimeStamp@i(LoadNum: Integer, Item_Load_Count: Integer, LoadTimeStamp: Long) := Functor(AggregatedItems2@i)[] { LoadNum, Item_Load_Count, timeStampMicroseconds() } -> node(np, @i-1), partition["CMP@i"]

Nil := Sink(AddTimeStamp@i)["file:///final_result.out", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP@i"] for_end



Qualitative Evaluation of SPADE

Implementation

– Lines of SPADE: 76 lines

– # of Operators: 19 (1 UDOP Operator)

Evaluation

– With the built-in operators of SPADE, we could develop the given ETL scenario in a highly productive manner

– The functionality of System S for running a SPADE program on distributed nodes was a great help



Performance Evaluation

2 22

12 32

3 23

13

4 24

14

1 21

11 31

Total = 14 Nodes (Each node has 4 cores)

6 26

16

7 27

17

8 28

18

5 25

15

9 29

19

10 30

20

1 2 3 4

5 6 7 8 9 10 11 12 13 14

Data Distribution

Compute Host (10 Nodes, 40 Cores)

n e0101b0${n}e1

Total Nodes: 14 nodes and 56 CPU cores Spec. for Each Node : Intel Xeon X5365 3.0 GHz Xeon (4 physical cores with HT), 16GB memory, RHEL 5.3 64bit (Linux Kernel 2.6.18.-164.el5) Network : Infiniband Network (DDR 20Gbps) Or 1Gbps Network

Software: InfoSphere Streams: beta versionData : 9 Million Records (1 Record is around 100 Byte)

Item Sorting



Node Assignment

2 22

12 32

3 23

13

4 24

14

1 21

11 31


6 26

16

7 27

17

8 28

18

5 25

15

9 29

19

10 30

20

1 2 3 4

5 6 7 8 9 10 11 12 13 14

Data Distribution


n e0101b0${n}e1

Item Sorting

Not used



Throughput for Processing 9 Million Data

Throughput and Speedup for Processing 9M Data

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

4 8 12 16 20 24 28 32 36 40

# of Cores

Thr

ough

put

(rec

ords

/s)

1

2

3

4

5

6

7

8

Spe

edup

s ag

ains

t 4

Cor

es

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A

Speed-up

Maximum Throughput : around 180000 records per second (144 Mbps)



Analysis (I-a) : Breakdown the Total Time

Elapsed Time for Processing 9 Million Data

0

20

40

60

80

100

120

4 8 12 16 20 24 28 32 36 40

# of Cores (1 Node has 4 Cores)

Elas

ped

Tim

e (s

)

Time for Computation Time for Data Distribution Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A

Data Distribution is Dominant

Computation



Analysis (I-b) Speed-up ratio against 4 cores when focusing on only “computation part”

Speed up ratio

0

2

4

6

8

10

12

14

4 8 12 16 20 24 28 32 36 40

# of Cores

Spe

ed-

up

ratio a

gain

st 4

core

s

Speed up of Throughput for Compute Time Linear Speedup


Over Linear-Scale



CPU Utilization at Compute Hosts

Idle

Computation

Computation



Outline









The previous experiment shows that most of the time is spent in the data distribution or I/O processing

For performance optimization, we implemented a SPADE program in such a way that all the nodes are participated in the data distribution while each source operator is only responsible for certain chunk of data records divided by the number of source operators




SplitKe

y=item

WarehouseItems 1 Source

Around 60

Source

Item Catalog

Functor

Sink

Sort Join Sort Join

Aggregate

Sort

Functor SinkUDOP


Sink

ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP

(SplitDuplicatedTuples)Sink

ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP


ODBCAppend

Sort

Data Distribution HostCompute host (1)

0100-0300-00

0100-0900-00

Compute host (2)

Compute host (N)

SplitWarehouse

Items 1 Source

SplitWarehouse

Items 1 Source

SourceWarehouse

Items 3(Warehouse_20090901_3.txt)

Split

Functor

Sink

Key=item

WarehouseItems 2

(Warehouse_20090901_2.txt)

WarehouseItems 1

(Warehouse_20090901_1.txt)Source

Source bundle

6 million

Around 60

Source

Item Catalog

Functor

Sink

Sort Join Sort Join

Aggregate

Sort

Functor SinkUDOP


Sink

ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP


ODBCAppend

Sort

Join Sort Join

Aggregate

Sort

Functor SinkUDOP


ODBCAppend

Sort

Data Distribution Host

Compute host (1) 0100-0300-000100-0900-00

Compute host (2)

Compute host (N)

Original SPADE Program Optimized SPADE Program

1. We modified the SPADE data-flow program in such a way that multiple Source operators participate in the data distribution

2. Each data distribution node can read a chunk of the whole data



Node Assignment

Data Distribution

All the 14 nodes participate in the data distribution

Each operator reads the number of records that divide the total data records (9M records) with the number of source operators.

The node assignment for compute nodes are the same as Experiment I

6 20 7 21 8 22

1 15 2 16 3 17 4 18

5 19


10 24 11 129 23 13 14

1 2 3 4

5 6 7 8 9 10 11 12 13 14

n e0101b0${n}e1disk disk disk disk

disk disk disk disk disk disk disk disk disk disk

Data Distribution / Compute Host



Elapsed time with varying number of compute nodes and source operators

2024

2832 45

3025

2018

1512

96

30

5

10

15

20

25

30

35

40

Elapsed Time (s)

# of ComputeNodes

# of Source Nodes

Elapsed Time (s) for Processing 9M Records

45 30 25 20 18 15 12 9 6 3

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C

# of source operators



Throughput : Over 800000 records / secThroughput with varying number of source nodes

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

3 6 9 12 15 18 20 25 30 45

# of Source Nodes

Thr

ough

put

(rec

ords

/s)

20 24 28 32

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C



Scalability : Achieved Super-Linear with Data Distribution Optimization

Comparison among various optimizations

0

1

2

3

4

5

6

7

8

9

10

4 8 12 16 20 24 28 32

# of compute nodes

Spee

dup

ratio

aga

inst

4 c

ores

1 Source Operator

3 Source Operators

Optimization (9 Source Operators for 20, 24,28, 32)

More Optimization

Linear

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C



Outline








Related Work

Near Real-Time ETL

– Panos et.al. reviewed the state of the art of both conventional and near real-time ETL [2008 Springer]

ETL Benchmarking

– Wyatt et.al. identifies a common characteristics of ETL workflows in an effort of proposing a unified evaluation method for ETL [2009 Springer Lecture Notes]

– TPC-ETL: formed in 2008 and still under the development by the TPC subcommittee



Conclusions and Future WorkConclusions

– Demonstrated the software productivity and scalable performance of System S in the ETL domain

– After the data distribution optimization, we achieved over linear scalability performance by processing around 800000 records per second on 14 nodes

Future Work

– Comparison with the existing ETL tools / systems and various application scenarios (TPC-ETL?)

– Automatic Data Distribution Optimization



Future Direction: Automatic Data Distribution Optimization

We were able to identify the appropriate number of source operators through a series of long-running experiments.

However, It is not wise for such a distributed systems as System S to force users/developers to experimentally find the appropriate number of source nodes.

We will need to have an automatic optimization mechanism that maximizes the throughput by automatically finding the best number of source nodes in a seamless manner from the user.

d1 S1

S2

S3

Sn

d2

d3

dn

C1

C2

C3

Cm

Source Operators

ComputeOperators

n(S1, C1)

n(Sn, C3)

Node Pool

1 2 3 P



Questions

??Thank You

23/04/19



Backup



Towards Adaptive Optimization

d1 S1

S2

S3

Sn

d2

d3

dn

C1

C2

C3

Cm

Source Operators

ComputeOperators

D S

C1

C2

C3

Cm

Source Operator

ComputeOperators

DataDistribution

Optimizer

Original SPADE Program Optimized SPADE Program

§ The current SPADE compiler has compile-time optimizer by obtaining the statistical data such as tuple/byte rates and CPU ratio for each operator.

§ We would like to let users/developers to write a SPADE program in a left manner without considering the data partitioning and data distribution.

§ By extending the current optimizer, the system automatically could convert the left-hand side program to right-hand program that achieves the maximum data distribution



Comparison among various optimizations

0

1

2

3

4

5

6

7

8

9

10

4 8 12 16 20 24 28 32

# of compute nodes

Spee

dup

ratio

aga

inst 4

cor

es

1 Source Operator

3 Source Operators

Optimization (9 Source Operators for 20, 24,28, 32)

More Optimization

Linear

Executive Summary Elapsed Time for Processing 9 Million Data

0

20

40

60

80

100

120

4 8 12 16 20 24 28 32 36 40


Elasp

ed T

ime (

s)

Time for Computation Time for Data Distribution

Elapsed Time for Baseline Optimized version vs. others

Motivation: § Evaluate System S as an ETL platform at a

large experimental environment, Watson cluster

§ Understand the performance characteristics at such a large testbed such as scalability and performance bottlenecks

Findings: § A series of our experiments have shown that

data distribution cost is dominant in the ETL processing

§ The optimized version in right hand side shows that when changing the number of data feed (or source) operators, the throughput is dramatically increased and obtains higher speed-ups than the others

§ Using the Infiniband network is critical for such an ETL workload that includes barrier before aggregating all the data for sorting operation, and we achieved almost double performance against the one with 1Gbs network

Throughput with varying number of source nodes

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

3 6 9 12 15 18 20 25 30 45

# of Source Nodes

Thro

ughp

ut (r

ecor

ds/s

)

20 24 28 32

Throughput Comparison w/ varying number of source nodes

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

2 3 4 5 6 9 12 15 18 20 25 30 45

# of Source nodes

Thro

ughp

ut (d

ata/

s)

20 24 28 32

Comparison between 1Gbs network and Infiniband Network

1Gbps Network Infiniband Network

Throughput

Optimized version



Node Assignment (B) for Experiment II

2 22

12 32

3 23

13

4 24

14

1 21

11 31


6 26

16

7 27

17

8 28

18

5 25

15

9 29

19

10 30

20

1 2 3 4

5 6 7 8 9 10 11 12 13 14

Data Distribution


n e0101b0${n}e1

Experimental Environment is comprised of 3 source nodes for data distribution, 1 node for item sorting, and 10 nodes for computation. The compute node has 4 cores and we manually allocate each operator with the following scheduling policy. The following diagram shows the case in that 32 operators are used for the computation. Each operator is allocated to adjunct node in order

Item Sorting



SPADE Program with Data Distribution Optimization

WarehouseItems 2

(Warehouse_20090901_2.txt)Source Split

Functor

Sink

Join Sort Join Aggregate

SortFunctor Sink

UDOP(SplitDuplicated

Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort

12

4

c0101b05

c0101b01

WarehouseItems 2


Functor

Sink

c0101b02

WarehouseItems 2


Functor

Sink

c0101b03


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort

12

4

c0101b06


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort

12

4

c0101b07


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort

12

4

….


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort


SortFunctor Sink


Tuples)Sink

ODBCAppend

Sort

12

4

s72x336-14

Since 3 nodes are participated in the data distribution, the number of communication is at maximum 120 (3 x 40).



New SPADE Program

for_begin @j 1 to COMPUTE_NODE_NUM

bundle warehouse1Bundle@j := ()

for_end

#define SOURCE_NODE_NUM 3

for_begin @i 0 to SOURCE_NODE_NUM-1

stream Warehouse1Stream@i(schemaFor(Warehouse1Schema))

:= Source()["file:///SOURCEFILE", nodelays, csvformat]{}

-> node(SourcePool, @i), partition["Sources@i"]

stream StreamWithSubindex@i(schemaFor(Warehouse1Schema), subIndex: Integer)

:= Functor(Warehouse1Stream1)[] {

subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM)) }



stream ItemStream@i@j(schemaFor(Warehouse1Schema), subIndex:Integer)

for_end

:= Split(StreamWithSubindex@i) [ subIndex ]{}



warehouse1Bundle@j += ItemStream@i@j

for_end

for_end


stream StreamForWarehouse1Sort@j(schemaFor(Warehouse1Schema))

:= Functor(warehouse1Bundle@j[:])[]{}

-> node(np, @j-1), partition["CMP%@j"]

stream Warehouse1Sort@j(schemaFor(Warehouse1Schema))

:= Sort(StreamForWarehouse1Sort@j <count(SOURCE_COUNT@j)>)[item, asc]{}


stream Warehouse1Filter@j(schemaFor(Warehouse1Schema))

:= Functor(Warehouse1Sort@j)[ Onhand="0001.000000" ] {}


for_end

bundle warehouse1Bundle := ()for_begin @i 1 to 3stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"]warehouse1Bundle += Warehouse1Stream@ifor_end

Experiment I

After

warehouse2, 3, and 4 are omittedin this chart, but we executed them

for the experiment



Node Assignment (C) for Experiment III

Data Distribution

§ All the 14 nodes participate in the data distribution, and each Source operator is assigned as the manner described in the following diagram. For instance, 24 Source operators are allocated to each node in order and when 14 source operators are allocated to 14 nodes, then the next source operator is allocated to the first node.

§ Each operator reads the number of records that divide the total data records (9M recordss) with the number of source operators. This data division is conducted in prior using a Linux tool called “split”

§ The node assignment for compute nodes are the same as Experiment I

6 20 7 21 8 22

1 15 2 16 3 17 4 18

5 19


10 24 11 129 23 13 14

1 2 3 4

5 6 7 8 9 10 11 12 13 14





Performance Result for Experiment II and Comparison with Experiment I

Comparison in Throughput between Non- Optimization and I/ OOptimization

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

4 8 12 16 20 24 28 32

# of Cores

Thr

ough

put

(Dat

aRec

ords

per

sec

)

0

20

40

60

80

100

120

140

160

Spe

e-up

Rat

io a

gain

stN

on-O

ptim

izat

ion

(%)

Non- Optimization (EXP20091129a) I/ O OptimizationSpeed- up against Non Optimization

When 3 nodes are participated in the data distribution, the throughput is increased to almost double when compared with the result given by Experiment I

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B



Analysis (II-a) Optimization by changing the number of source operators

Motivation for this experimentIn the previous page, the throughput is

saturated around 16 cores due to the lack of data feeding ratio against computation

Experimental Environment§ We changed the number of source

operators while not changing the total volume of data (9M data records), and measured throughput

§ We only tested 9MDATA-32 (32 operators for computation)

Experimental Results In this experiment shows that the 9

source nodes obtains the best throughput.

Performance with varying number of source operators (thetotal data records are the same, 9M, and 32 cores are used

for computation)

0

100000

200000

300000

400000

500000

600000

3 5 9 15

# of Source Nodes

Thr

ough

put

(rec

ords

/s)

Best

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B

6 7 8

1 2 3 4

5


9

1 2 3 4

5 6 7 8 9 10 11 12 13 14



Node Assignment for 9 Data Distribution Node



Analysis (II-b) : Increased Throughput by Data Distribution Optimization The following graph shows the overall results by taking the same optimization approach in previous

experiment, which increases the number of source operators.

3 source operators are used for 4, 8, 12, 16, and 9 source operators are used for 20, 24, 28 and 32.

We achieved 5.84 times speedup against 4 cores at 32 cores

Increased Throughput by Data Distribution Optimization

0

100000

200000

300000

400000

500000

600000

4 8 12 16 20 24 28 32

# of cores

Thro

ughp

ut (d

ata

reco

rds

/ se

c)

1 Source Operators 3 Source Operators Optimization (9 Source Operators for 20, 24,28, 32) Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B



Analysis (II-c) : Increased Throughput by Data Distribution Optimization

Speedup against 4 cores

0

1

2

3

4

5

6

7

8

9

4 8 12 16 20 24 28 32

# of cores

Spe

edup

aga

inst

4 c

ores

1 Source Operators 3 Source Operators

Optimization (9 Source Operators for 20, 24,28, 32) Ideal Scalability

The yellow line shows the best performance since 9 nodes are participated in the data distributionfor 20, 24, 28 and 32 cores.

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B



Experiment (III): Increasing More Source Operators

Motivation

– In this experiment, we understand the performance characteristics by increasing more source operators than previous experiment (Experiment II).

– We also identify the performance comparison between Infiniband network and the commodity 1Gbps network

Experimental Setting

– We increase the number of source operators up to 45 from 3, and test this configuration against relatively large number of computes nodes, 20, 24, 28, 32 nodes.

– Node Assignment for Data Distribution and Computation is the same as previous experiment (Experiment II)



Analysis (II-a): Throughput and Elapsed Time


0

100000

200000

300000

400000

500000

600000

700000

800000

900000

3 6 9 12 15 18 20 25 30 45

# of Source Nodes

Thr

ough

put

(rec

ords

/s)

20 24 28 32

800000 tuples/sec (1 tuple=100byte) = 640 Mbps

The maximum total throughput, around 640 Mbps, is below the network bandwidth of both Infiniband and 1Gbps LAN.

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C

Throughput Elapsed Time



Analysis (III-c) : Performance Without Infiniband In this experiment, we measured the

throughput without Infiniband against varying number of source operators.

Unlike the performance we obtained with Infiniband, the throughput is saturated around 12 – 15.

This result shows that the throughput is around 400000 data records per seconds at maximum, and this accounts for around 360 Mbps.

Although the network we used in this experiment is 1Gbps, this assumes to be an upper limit for consuming full network bandwidth while considering the System S overhead.

Drastic performance degradation from 15 to 18 can be observed, and we assume that this is because, 14 source operators are allocated to 14 nodes and afterwards 2 or more operators (processes) simultaneously accesses the 1Gbs network card and the resource contention is occurred.


0

50000

100000

150000

200000

250000

300000

350000

400000

450000

2 3 4 5 6 9 12 15 18 20 25 30 45

# of Source nodes

Thro

ughp

ut (d

ata/

s)

20242832

Elapsed Time

Comparison with varying number of source nodes

0

10

20

30

40

50

60

2 3 4 5 6 9 12 15 18 20 25 30 45

# of Source nodes

Elap

sed

Tim

e(s)

for

Pro

cess

ing

9M r

ecor

ds

20242832

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C

# of source operator

# of source operator

Throughput

Elapsed Time



Analysis (III-d) : Comparison between W/O Infiniband and W/ Infiniband


0

100000

200000

300000

400000

500000

600000

700000

800000

900000

3 6 9 12 15 18 20 25 30 45

# of Source Nodes

Thro

ughp

ut (r

ecor

ds/s

)

20 24 28 32

W/ Infiniband

This chart shows the performance comparison by enabling or disabling the Infiniband network. The absolute throughput number when enabling Infiniband is “double” against w/o Infiniband. This result indicates that using Infiniband in ETL-typed workloads is essential to obtain high throughput


0

100000

200000

300000

400000

500000

600000

700000

800000

900000

2 3 4 5 6 9 12 15 18 20 25 30 45

# of Source nodes

Thro

ughp

ut (d

ata/

s)

20 24 28 32

W/O Infiniband

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C

# of source operator # of source operator



Analysis (I-c) Elapsed Time for Distributing 9M Data to Multiple Cores

Elapsed Time for Distributing 9 Million Data to Multiple Cores

0

5

10

15

20

25

30

35

40

45

50

55

4 8 12 16 20 24 28 32 36 40


Elap

sed

Tim

e (s

)The following graph demonstrates that the elapsed time for distributing all the data to varying number of compute cores is nearly constant


Documents

Scalable Performance of System S for Extract-Transform-Load Processing