Whitepaper Performance Tuning using Upsert and SCD (Task Factory)

Performance Tuning using Upsert and SCD

Written By: Chris Price

[email protected]

Contents

Upserts 3

Upserts with ssis 3

Upsert with MerGe 6

Upsert with task Factory Upsert Destination 7

Upsert perForMance testinG 8

sUMMary 10

slowly Changing Dimensions 11

slowly chanGinG DiMension (scD) transForM 11

cUstoM scD with ssis 12

scD with MerGe 13

scD with task Factory DiMension MerGe 14

scD perForMance testinG 16

sUMMary 18

wrap-Up 19

Pragmatic Works White Paper Performance tuning using Upsert and scD

www.pragmaticworks.com PAGE 3

UpsertsUpsert is a portmanteau that blends the distinct actions of an

Update and Insert and describes how both occur in the context

of a single execution. Logically speaking, the Upsert process is

extremely straight-forward. Source rows are compared to a

destination, if a match is found based on some specified criteria

the row is updated, otherwise the row is considered new and

an insert occurs. While the process can become more complex

if you decide to do conditional updates rather than doing blind

updates, that is basically it.

To implement an Upsert, you have three primary options in the

SQL Server environment. The first and most obvious is using SSIS

and its data flow components to orchestrate the Upsert process,

the second is using the T-SQL Merge command and finally there is

the Pragmatic Works Task Factory Upsert component.

Upserts with ssis

Implementing an Upsert using purely SSIS is a trivial task that

consists of a minimum of four data flow components. Data

originating from any source are piped through a Lookup

transformation and the output is split into two, one for rows

matched in lookup and one for rows that were not matched. The

no match output contains new rows that must be inserted using

one of the supported destinations in SSIS. The matched rows

are those that need to be updated and an OLE DB Command

transformation is used to issue an update for each row.

As a SQL Server BI Pro developing SSIS packages, you often encounter situations and scenarios that have a

number of different solutions. Choosing the right solution often means balancing tangible performance

requirements with more intangible requirements like making your packages more maintainable. This white

paper will focus on the options for handling two of these scenarios: Upserts and Slowly-Changing Dimensions.

We will review multiple implementation options for each situation, discuss how each is accomplished, review

performance implications and the trades-offs for each in terms of complexity, manageability and opportunities

for configuration of auditing, and look at logging and error handling.



standard ssis Upsert

As this solution is currently designed, every row from the source

will either be inserted or updated. This may or may not be the

desired behavior based on your business requirements. Most

times, you will find that you can screen out rows that have

not changed to improve performance by eliminating updates.

To accomplish this you can use an expression in a conditional

split, the T-SQL CHECKSUM function, if both your source and

destination are SQL Server or a script transformation to generate

a hash for each row.

While this is as simple an Upsert gets in terms of implementation

and maintenance, there are several obvious performance

drawbacks to this approach as the volume of data grows. The

first is the Lookup transformation. The throughput in terms of

rows per second that you get through the lookup transformation

is directly correlated to the cache mode you configure on the

lookup. Full Cache is the optimal setting but depending on the

size of your destination dataset, the time and amount of memory

required may exceed what’s available. Partial Cache mode and No

Cache mode on the other hand are performance killers and there

are limited scenarios you should use either option.

The second drawback and the one most commonly encountered

in terms of performance issues is the OLE DB Command used

to handle updates. The update command works row-by-row,

meaning that if you have 10,000 rows to update, 10,000 updates

will be issued sequentially. This form of processing is the opposite

of batch processing you may be familiar with and has been

termed RBAR or row-by-agonizing-row because of the severe

effect it has on performance.

Despite these drawbacks, this solution excels when the set of

data contains no more than 20,000 rows. If you find that your

dataset is larger, there are several workarounds to mitigate the

drawbacks both of which come at the expense of maintainability

and ease-of-use.

When the Lookup transformation is the bottleneck, you can

replace it with a Merge Join pattern. The Merge Join pattern

facilitates reading both the source and destination in a single-

pass which allows for handling large sets of data more efficiently.



To use this pattern, you need an extra source to read in your

destination data. Keep in mind that the Merge Join transformation

requires two sorted inputs. Allowing the source to handle the

sorting is the most efficient but requires that you configure the

each Source as sorted.

If your source does not support sorting, such as a text file, you

must use a Sort Transformation. The Sort Transformation is a fully

blocking transformation meaning that it must read all rows before

it can output anything further degrading package performance.

The Merge Join transform must be configured to use a left-join

to allow both source rows that match the destination and those

that do not to be passed down the data flow. A conditional split

is then used to determine whether an Insert or Update is needed

for each row.

To overcome the row-by-row operation of the OLE DB Command,

a staging table is needed to allow a single set-based Update to

be called. After you created the staging table, replace the OLE DB

Command with an OLE DB Destination and map the row columns

to the columns in the staging table. In the control flow two

Execute SQL Tasks are needed. The first precedes the Data Flow

and simple truncates the staging table so that it is empty. The

second Execute SQL Task follows the data flow and is responsible

for issuing the set-based Update.

When you combine both of these workarounds, the package

actually will handle large sets of data with ease and even rivals

the performance of the MERGE statement when working with

sets of data that exceed 2 million rows. The trade-off however

is obvious, supporting and maintaining the package is now an

order of magnitude more difficult because of the additional

moving pieces and data structures required.



Upsert with mergeUnlike the prior solution that uses SSIS to execute multiple DML

statements to perform an Upsert operation, the MERGE feature

in SQL Server provides a high performance and efficient way to

perform the Upsert by calling both the Insert and Update in a

single statement.

To implement this solution you must stage all of your source data

in a table on the destination database. In the same manner as

the prior solution, an SSIS package can be used to orchestrate

truncating the staging table, moving the data from the source

to the staging table and then executing the MERGE command.

The difference exists in the T-SQL MERGE command. While

a detailed explanation of the MERGE statement is beyond the

scope of this white paper the MERGE combines both inserts and

updates into a single pass of the data using define criteria to

t-sQl merge statement

determine when records match and what operations to perform

when either a match is or is not found.

The drawback to this method is in the complexity of the statement

as the accompanying figure illustrates. Beyond the complexity of

the syntax, control is also sacrificed as the MERGE statement is

essentially a black box. When you use the MERGE command you

have no control or error handling ability, if a single record fails

either on insert or update, the entire transaction is rolled back.

It’s clear that what the solution provides in terms of performance

and efficiency comes at the cost of complexity and loss of control.

A final note on MERGE is also required. If you find yourself

working on any version of SQL Server prior to 2008, this solution

is not applicable as the MERGE statement was first introduced in

SQL Server 2008



task Factory Upsert Destination Ui

Upsert with task FaCtory Upsert Destination

The Upsert Destination is a component included in the Pragmatic

Works Task Factory library of components and is a balanced

alternative when implementing an Upsert operation. Without

sacrificing performance, much of the complexity is abstracted

away from the developer and is boiled down to configuring

settings across three tabs.

To implement the Upsert Destination, drag-and-drop the Upsert

Destination component to your data flow design surface. The

component requires an ADO.Net connection, so you will need to

create one if one does not already exist. From there, you simply

configure the Destination table, map your source columns to

destination columns (making sure to identify the key column) and

choose your update method and you are ready to go.

Upsert Destination supports four update methods out of the box.

The first and fastest is the Bulk Update. This method is similar to

the one that has been discussed previously as all rows that exist

in the destination are updated. You can also fine tune the update

by choosing to do updates based on timestamps, a last updated

column or even a configurable column comparison. Beyond

the update method you can easily configure the component to

update a Last Modified column, enable identity inserts, provide

insert and update row counts as well as control take control over

the transactional container.

While none of these features are unique to the Task Factory

Upsert Destination, the ease with which you can be up and

running is huge in terms of a developer’s time and effort. When

you consider that there are no staging tables required, no special

requirements of the source data, no workarounds needed and

the component works with SQL Server 2005 and up it is a solid

option to consider.



Upsert perFormanCe testingTo assess each of the methods discussed a simple test was performed. In each test the bulk update method in which all rows are either

inserted or updated was used. The testing methodology required that each test be run three times, taking the average execution time for

all three executions then calculating the throughput in rows per second as the result. The results were then pared with rankings for each

method according to complexity, manageability and configurability.

Prior to each test being run the SQL Server cache and buffers were cleared using DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS.

All tests were run on an IBM x220 laptop with an i7 2640M processor and 16GB of RAM. A default install of SQL Server 2012, with the

maximum server memory set to 2GB was used for all database operations.

Test Case Size Rows Inserted Rows Updated

10,000 6,500 3,500

100,000 65,000 35,000

500,000 325,000 175,000

1,000,000 650,000 350,000

Test Cases



Performance Results

Overall Results

Merge Upsert Destination SSIS (Batch) SSIS

10,000 6917.223887 5169.73979 6609.385327 4144.791379

100,000 28873.91723 19040.36558 28533.38406 1448.862402

500,000 37736.79841 24491.79525 36840.55408 1525.442861

1,000,000 36777.32555 24865.93119 33549.91668 1596.765592

Results in Rows per Second

Performance Complexity Manageability Configurability

Merge 1 4 4 4

Upsert Destination 3 1 2 3

SSIS (Batch) 2 3 3 2

SSIS 4 2 1 1



As expected, from a pure performance perspective the Upsert with Merge outperformed all other methods of implementing an Upsert

operation. It also easily topped all others in terms of complexity while being the least manageable and least configurable. The SSIS (Batch)

method also performed well as it is able to take advantage of bulk inserts into a staging table and followed by a set-based update. While

not as complex as the MERGE method it does require both sorted sources and staging tables ultimately bumping its manageability down.

The Upsert Destination performed well and was the only method whose performance did not degrade through-out testing. It also tested

out as the least complex and most manageable method for implementing an Upsert operation. Finally, the SSIS implement while being

easy to manage and allowing for the greatest degree of configuration it performed the worst.

sUmmary



slowly Changing Dimensions

When Slowly Changing Dimensions are discussed the two primary types considered are Type-1 and Type-2 Slowly Changing Dimensions.

Recalling that the difference between these two types depends on whether history is tracked when the dimension changes the

fundamental implementation of each is the same. In terms of implementation options you have three available out of the box. You can

use the Slowly Changing Dimension transformation, implement custom slowly changing dimension logic or use the Insert over MERGE.

A fourth option is available using the Task Factory Dimension Merge transformation. No matter which option you choose, understanding

the strengths and weaknesses of each is critical towards selecting the best solution for the task at hand.

The SCD Transform is a wizard based component that consists

of five steps. The first step in the wizard requires that you select

the destination dimension table, map the input columns and

identify key columns. The second step allows you to configure

the SCD type for each column. The three types: Fixed (Type-

0), Changing (Type-1) and Historical (Type-2) allow for mixing

Slowly Changing Dimension Types within the dimension table.

The third, fourth and fifth steps allow for further configuration

of the SCD implementation by allowing you to configure the

behavior for Fixed and Changing Attributes, define how the

Historical versions are implemented and finally set-up support

for inferred members.

Once the wizard completes, a series of new transformations

are added to the data flow of your package to implement the

configured solution. While the built-in SCD Transform excels in ease-

of-use, its numerous drawbacks have been thoroughly discussed

and dissected in a number of books, blogs and white papers.

slowly Changing Dimension (sCD) transForm

Built-in sCD transform



Starting with performance, the SCD Transform underachieves

both in the way in which source and dimension rows are compared

within the transform and by its reliance on the OLE DB Command

to handle the expiration of Type-2 rows as well as Type-1 updates.

As discussed previously, the OLE DB Command is a Row-by-Row

operation which is a significant drag on performance.

Manageability is also as issue since it is not possible to re-enter the

wizard to change or update the configuration option without the

transformation regenerating each of the downstream data flow

transformations. This may or may not be a huge issue depending

on your requirements but can be a headache if manually update

the downstream transforms for either performance tuning or

functionality reasons.

Despite its numerous issues, the SCD Transform has its place.

If your dimension is small and performance it not an issue, this

transform may be suitable as it is the easiest to implement and

requires nothing beyond the default installation of SSIS.

CUstom sCD with ssis

Implementing a custom SCD solution is handled in a manner

similar to the output of the SCD Transform. Instead of relying

on the SCD to look up and then compare rows, you as the

developer implement each of those task using data flow

transformations. In its simplest form, a custom SCD would use

a Lookup transformation to lookup the dimension rows. New

rows that were not matched to be bulk inserted using the OLE

DB Destination. Rows that matched would need to be compared

using an expression, the T-SQL CHECKSUM or another of the

methods that were previously discussed. A conditional split

transformation would be used to send each match row to the

appropriate output destination, whether Type-1, Type-2 or

Ignored for rows that have not changed.

The Custom SCD implementation gives you the most flexibility

as you would expect since you are responsible for implementing

Custom sCD

each and every step. While this flexibility can be beneficial it also

adds complexity to the solution particularly when the SCD is

extended to implement additional features such as surrogate key

management and inferred member support.

Performance is another area of concern. Building the Custom

SCD allows you to bypass the lookup and match performance

issues associated with the built-in SCD Transform, but if you use

OLE DB Commands it ultimately means you are going to face the

performance penalty of row-by-row operations. Issues could also

arise with the lookup as the dimension grows.

Stepping back to the discussion on Upserts with SSIS, two

patterns are applicable to help you get around these performance

issues. The Merge Join pattern will optimize and facilitate lookups

against large dimension tables, while you could implement



staging tables to handle perform set-based updates instead of

using the RBAR approach. Both of these patterns will improve

performance but add further complexity to the overall solution.

sCD with merge

Implementing a Slowly Changing Dimension with the T-SQL MERGE

is an almost identical solution to that discussed in the Upsert with

MERGE with just two key differences. First a straight-forward set-

based update is executed to handle all the Type-1 changes. Next,

instead of a straight MERGE statement as done with the Upsert,

an Insert over Merge is used to handle the expiration of Type-2

rows as well as the inserting the new version of the row.

For the MERGE to work, the matching criterion is configured

such that only matching rows with Type-2 changes are affected.

The update statement simply expires the current row. The Insert

over MERGE statement takes advantage of OUTPUT clause which

then allows you to pass the columns from your source and the

merge action in the form of the $action variable back out of

the merge. Using this functionality you can screen the rows that

where updated and pass them back into an insert statement to

complete the Type-2 change.

The benefits and drawbacks to this solution are exactly the same

as with the Upsert using MERGE. This solution performs extremely

well at the expense of both complexity and lack of manageability.

sample insert over merge



Like the built-in SCD Transform, the Task Factory Dimension

Merge uses a wizard to allow for easy configuration of slowly

changing dimension support. You start by composing the existing

dimensions which includes identifying the business and surrogate

keys as well as configuring the SCD Type for each dimension

column. Column mappings between the source input and the

destination dimension are then defined and can be tweaked by

dragging and dropping the columns to create mappings.

From there, you get into more refined or advanced configuration

than is available in other implementations. You can configure

the Row Change Detection to ignore case, leading/trailing

spaces and nulls during comparisons. Advanced date handling

is supported for Type-2 changes to allow both specific date

endpoints and flexible flag columns to indicate current rows.

Other advanced features include built-in Surrogate Key Handling,

Inferred Member support, input and output row count auditing,

advanced component logging so you know what is happening

internally and a performance tab that allows you to suppress

warning, set timeouts, configure threading and select a hashing

algorithm to use.

sCD with task FaCtory Dimension merge

task Factory Dimension merge Ui



The Task Factory Dimension Merge does not perform any of the

inserts or updates required for the Slowly Changing Dimension.

Instead, each row is directed to one or more outputs and then

the outputs are handled by the developer working with the

transformation. Standard outputs are available for New, Updated

Type-1, Type-2 Expiry/ Type-1 Combined, Type-2 New, Invalid,

Unchanged and Deleted rows. In addition outputs are provided

for auditing and statistical information. The flexibility this

implementation provides allows the developer to choose the level

of complexity of the implementation in terms of either a row-by-

row or set-based update approach.

task Factory Dimension merge implementation

Performance-wise the Task Factory Dimension Merge is

comparable to that of the Custom SCD implementation. While

the Custom SCD implementation will outperform the Dimension

Merge on smaller sets of data, the Dimension Merge excels as the

data set grows. Much like the Task Factory Upsert Destination,

the Dimension Merge also benefits from the simplicity in set-up

and manageability, saving you both time and effort and unlike

the built-in SCD transform; you have the ability to edit the

transformation configuration at any time without losing anything

downstream.



Test CasesSource Size New Type-1 Type-2 Unchanged

15,000 rows 5,000 500 500 9,000

50,000 rows 20,000 1,000 1,000 23,000

100,000 rows 25,0000 5,000 5,000 65,000

sCD perFormanCe testing

Continuing the testing methodology used for the Upsert testing, a similar test was constructed for each SCD implementation discussed.

Each test consisted of a set of source data that contained both Type-1 and Type-2 changes as well as new rows and rows which were

unchanged. Every test was run three times and the average execution time was taken and used to calculate the throughput in terms of

rows per second. The hardware and environment set-up was the same as previously noted.



Performance Results

Overall Results

Built-In SCD Custom SCD Dimension Merge Merge

15,000 Rows 297.626921 3669.87441 2543.666271 10804.322

50,000 Rows 205.451308 2560.73203 2095.733087 15166.835

100,000 Rows 170.500949 406.19859 501.1501396 18192.844

Results in Rows per Second

Performance Complexity Manageability Configurability

Built-In SCD 4 1 3 3

Custom SCD 2 3 2 2

Dimension Merge 2 2 1 1

Merge 1 4 4 4



The big winner in terms of performance was the MERGE implementation and much like the previous test it also was the most complex and

least configurable and least manageable. The Dimension Merge and Custom SCD implementations are the most balanced approaches.

Both are similar in performance, with the Dimension Merge gaining an edge in terms of complexity, manageability and configurability.

The Built-In SCD transformation as expected performed the worst, yet is the simplest solution.

sUmmary



When it comes time to implement an Upsert and/or Slowly Changing Dimension you clearly have options. Often times, business

requirements and your environment will help eliminate one or more possible solutions. What remains requires that you balance the

performance needs with complexity, manageability and the opportunity for configuration whether it be to support auditing, logging or

error handling.

Integration Services offers you the opportunity to implement each of these tasks with a varying degree of support. When you use the

out-of-the-box tools however, regardless of the implementation selected, performance and complexity are directly correlated. The Task

Factory Upsert Destination and Dimension Merge on the other hand both represent a balance implementation. Both components offer

tangible performance while limiting the complexity found in other implementations. In addition, both will save you time and effort in

implementing either an Upsert or Slowly Changing Dimension.

wrap-Up

Software

Whitepaper Performance Tuning using Upsert and SCD (Task Factory)