Boi300 en Col91 Nw

BusinessObjects Data Integrator XI 3.0/3.1:Core Concepts

Learner’s GuideBOI300

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

For Any SAP or Oracle Materials Purchase Visit : www.sapcertified.com OR Contact Via Email Directly At : [email protected]

Copyright© 2009 SAP® BusinessObjects™. All rights reserved. SAPBusinessObjects owns the followingUnited States patents,whichmay cover products that are offered and licensed by SAPBusinessObjects and/or affliated companies: 5,295,243; 5,339,390;5,555,403; 5,590,250; 5,619,632; 5,632,009; 5,857,205; 5,880,742;5,883,635; 6,085,202; 6,108,698; 6,247,008; 6,289,352; 6,300,957;6,377,259; 6,490,593; 6,578,027; 6,581,068; 6,628,312; 6,654,761;6,768,986; 6,772,409; 6,831,668; 6,882,998; 6,892,189; 6,901,555;7,089,238; 7,107,266; 7,139,766; 7,178,099; 7,181,435; 7,181,440;7,194,465; 7,222,130; 7,299,419; 7,320,122 and 7,356,779. SAPBusinessObjects and its logos, BusinessObjects, Crystal Reports®,Rapid Mart™, Data Insight™, Desktop Intelligence™, RapidMarts®, Watchlist Security™, Web Intelligence®, and Xcelsius®are trademarks or registered trademarks of Business Objects,an SAP company and/or affiliated companies in the UnitedStates and/or other countries. SAP® is a registered trademarkof SAPAG inGermany and/or other countries. All other namesmentioned hereinmay be trademarks of their respective owners.

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


C O N T E N T S

About this CourseCourse introduction...................................................................................................xiiiCourse description.....................................................................................................xivCourse audience.........................................................................................................xivPrerequisites................................................................................................................xivAdditional education.................................................................................................xivLevel, delivery, and duration....................................................................................xvCourse success factors.................................................................................................xvCourse setup.................................................................................................................xvCourse materials..........................................................................................................xvLearning process .........................................................................................................xv

Lesson 1Describing Data ServicesLesson introduction.......................................................................................................1Describing the purpose of Data Services....................................................................2

Describing Data Services benefits .......................................................................2Understanding data integration processes.........................................................2Understanding the Data Services packages.......................................................3

Describing Data Services architecture........................................................................5Defining Data Services components ...................................................................5Describing the Designer .......................................................................................6Describing the repository .....................................................................................7Describing the Job Server......................................................................................8Describing the engines........................................................................................12Describing the Access Server..............................................................................12Describing the adapters.......................................................................................12Describing the real-time services ......................................................................12Describing the Address Server...........................................................................13Describing the Cleansing Packages, dictionaries, and directories................13Describing the Management Console ..............................................................13Defining other Data Services tools.....................................................................14

Defining Data Services objects...................................................................................16Understanding Data Services objects ...............................................................16Defining relationship between objects .............................................................17Defining projects and jobs ..................................................................................18Using work flows.................................................................................................18

iiiTable of Contents—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing the object hierarchy .........................................................................19Using the Data Services Designer interface.............................................................21

Describing the Designer window .....................................................................21Using the Designer toolbar ................................................................................22Using the Local Object Library ..........................................................................23Using the project area .........................................................................................24Using the tool palette ..........................................................................................26Using the workspace ...........................................................................................27

Quiz: Describing Data Services .................................................................................28Lesson summary..........................................................................................................29

Lesson 2Defining Source and Target MetadataLesson introduction.....................................................................................................31Using datastores...........................................................................................................32

Explaining datastores .........................................................................................32Using adapters .....................................................................................................33Creating a database datastore ...........................................................................33Changing a datastore definition ........................................................................34Importing metadata from data sources ............................................................35Importing metadata by browsing .....................................................................36Activity: Creating source and target datastores..............................................37

Using datastore and system configurations.............................................................42Creating multiple configurations in a datastore .............................................42Activity: Modifying the datastore connection for internal jobs....................45Creating a system configuration .......................................................................46

Defining file formats for flat files..............................................................................48Explaining file formats .......................................................................................48Creating file formats ...........................................................................................48Handling errors in file formats ..........................................................................52Activity: Creating a file format for a flat file....................................................53

Defining file formats for Excel files...........................................................................55Using Excel as a native data source ..................................................................55Activity: Creating a file format for an Excel file..............................................58

Defining file formats for XML files...........................................................................60Importing data from XML documents..............................................................60Importing metadata from a DTD file................................................................60Importing metadata from an XML schema......................................................63Explaining nested data........................................................................................65Unnesting data......................................................................................................67

Quiz: Defining source and target metadata.............................................................69Lesson summary..........................................................................................................70

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guideiv

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Lesson 3Creating Batch JobsLesson introduction.....................................................................................................71Working with objects..................................................................................................72

Creating a project ................................................................................................72Creating a job .......................................................................................................74Adding, connecting, and deleting objects in the workspace ........................75Creating a work flow ..........................................................................................76Defining the order of execution in work flows ...............................................76

Creating a data flow....................................................................................................78Using data flows ..................................................................................................78Using data flows as steps in work flows ..........................................................78Changing data flow properties .........................................................................79Explaining source and target objects ................................................................81Adding source and target objects .....................................................................82

Using the Query transform........................................................................................83Describing the transform editor ........................................................................83Explaining the Query transform .......................................................................84

Using target tables.......................................................................................................88Accessing the target table editor........................................................................88Setting target table options ................................................................................89Using template tables .........................................................................................92

Executing the job..........................................................................................................95Explaining job execution ....................................................................................95Setting execution properties ..............................................................................95Executing the job .................................................................................................97Activity: Creating a basic data flow...................................................................98

Quiz: Creating batch jobs..........................................................................................101Lesson summary........................................................................................................102

Lesson 4Troubleshooting Batch JobsLesson introduction...................................................................................................103Using descriptions and annotations........................................................................104

Using descriptions with objects.......................................................................104Using annotations to describe objects ............................................................105

Validating and tracing jobs......................................................................................106Validating jobs ...................................................................................................106Tracing jobs ........................................................................................................107Using log files ....................................................................................................110Examining trace logs .........................................................................................110Examining monitor logs ...................................................................................111Examining error logs ........................................................................................111

vTable of Contents—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using the Monitor tab .......................................................................................112Using the Log tab ..............................................................................................112Determining the success of the job .................................................................113Activity: Setting traces and adding annotations............................................114

Using View Data and the Interactive Debugger...................................................115Using View Data with sources and targets ...................................................115Using the Interactive Debugger ......................................................................117Setting filters and breakpoints for a debug session ......................................119Activity: Using the Interactive Debugger.......................................................121

Setting up auditing....................................................................................................123Setting up auditing.............................................................................................123Defining audit points.........................................................................................123Defining audit labels..........................................................................................124Defining audit rules...........................................................................................124Defining audit actions.......................................................................................125Choosing audit points.......................................................................................129Activity: Using auditing in a data flow...........................................................129

Quiz: Troubleshooting batch jobs ...........................................................................131Lesson summary........................................................................................................132

Lesson 5Using Functions, Scripts, and VariablesLesson introduction...................................................................................................133Defining built-in functions.......................................................................................134

Defining functions .............................................................................................134Listing the types of operations for functions ................................................134Defining other types of functions ...................................................................136

Using functions in expressions................................................................................138Defining functions in expressions ...................................................................138Activity: Using the search_replace function...................................................141

Using the lookup function........................................................................................143Using lookup tables ..........................................................................................143Activity: Using the lookup_ext() function......................................................147

Using the decode function........................................................................................149Explaining the decode function ......................................................................149Activity: Using the decode function ...............................................................151

Using scripts, variables, and parameters................................................................153Defining scripts ..................................................................................................153Defining variables .............................................................................................153Defining parameters .........................................................................................154Combining scripts, variables, and parameters ..............................................154Defining global versus local variables ...........................................................154Setting global variables using job properties ................................................159Defining substitution parameters....................................................................159

Using Data Services scripting language.................................................................162

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guidevi

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using basic syntax .............................................................................................162Using syntax for column and table references in expressions ....................162Using operators .................................................................................................163Reviewing script examples ..............................................................................164Using strings and variables ..............................................................................164Using quotation marks .....................................................................................165Using escape characters ....................................................................................165Handling nulls, empty strings, and trailing blanks .....................................165

Scripting a custom function......................................................................................169Creating a custom function ..............................................................................169Importing a stored procedure as a function ..................................................173Activity: Creating a custom function..............................................................174

Quiz: Using functions, scripts, and variables........................................................177Lesson summary........................................................................................................178

Lesson 6Using Platform TransformsLesson introduction...................................................................................................179Describing platform transforms..............................................................................180

Explaining transforms ......................................................................................180Describing platform transforms ......................................................................181

Using the Map Operation transform.......................................................................182Describing map operations...............................................................................182Explaining the Map Operation transform .....................................................183Activity: Using the Map Operation transform...............................................184

Using the Validation transform...............................................................................185Explaining the Validation transform ..............................................................185Activity: Using the Validation transform.......................................................190

Using the Merge transform......................................................................................194Explaining the Merge transform .....................................................................194Activity: Using the Merge transform..............................................................195

Using the Case transform.........................................................................................198Explaining the Case transform ........................................................................198Activity: Using the Case transform.................................................................201

Using the SQL transform..........................................................................................203Explaining the SQL transform .........................................................................203Activity: Using the SQL transform..................................................................205

Quiz: Using platform transforms............................................................................207Lesson summary........................................................................................................208

Lesson 7Setting up Error HandlingLesson introduction...................................................................................................209Using recovery mechanisms....................................................................................210

viiTable of Contents—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Avoiding data recovery situations..................................................................210Describing levels of data recovery strategies ................................................211Configuring work flows and data flows ........................................................211Using recovery mode ........................................................................................212Recovering from partially-loaded data ..........................................................213Recovering missing values or rows ................................................................213Defining alternative work flows .....................................................................214Using try/catch blocks and automatic recovery ..........................................216Activity: Creating an alternative work flow ..................................................220

Quiz: Setting up error handling ..............................................................................223Lesson summary........................................................................................................224

Lesson 8Capturing Changes in DataLesson introduction...................................................................................................225Updating data over time...........................................................................................226

Explaining Slowly Changing Dimensions (SCD) .........................................226Updating changes to data ................................................................................228Explaining history preservation and surrogate keys ...................................229Comparing source-based and target-based CDC .........................................230

Using source-based CDC..........................................................................................231Using source tables to identify changed data................................................231Using CDC with timestamps............................................................................231Managing overlaps.............................................................................................234Activity: Using source-based CDC..................................................................235

Using target-based CDC...........................................................................................239Using target tables to identify changed data .................................................239Identifying history preserving transforms ....................................................240Explaining the Table Comparison transform.................................................241Explaining the History Preserving transform ...............................................243Explaining the Key Generation transform .....................................................246Activity: Using target-based CDC ..................................................................247

Quiz: Capturing changes in data ............................................................................249Lesson summary........................................................................................................250

Lesson 9Using Data Integrator TransformsLesson introduction...................................................................................................251Describing Data Integrator transforms...................................................................252

Defining Data Integrator transforms ..............................................................252Using the Pivot transform........................................................................................254

Explaining the Pivot transform .......................................................................254Activity: Using the Pivot transform.................................................................257

Using the Hierarchy Flattening transform.............................................................258

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guideviii

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Explaining the Hierarchy Flattening transform.............................................258Activity: Using the Hierarchy Flattening transform.....................................260

Describing performance optimization....................................................................265Describing push-down operations .................................................................265Viewing SQL generated by a data flow .........................................................267Caching data ......................................................................................................267Slicing processes.................................................................................................268

Using the Data Transfer transform.........................................................................269Explaining the Data Transfer transform.........................................................269Activity: Using the Data Transfer transform..................................................270

Using the XML Pipeline transform.........................................................................272Explaining the XML Pipeline transform.........................................................272Activity: Using the XML Pipeline transform..................................................273

Quiz: Using Data Integrator transforms.................................................................276Lesson summary........................................................................................................277

Answer KeyQuiz: Describing Data Services ...............................................................................281Quiz: Defining source and target metadata...........................................................282Quiz: Creating batch jobs..........................................................................................283Quiz: Troubleshooting batch jobs ...........................................................................284Quiz: Using functions, scripts, and variables........................................................285Quiz: Using platform transforms............................................................................286Quiz: Setting up error handling ..............................................................................287Quiz: Capturing changes in data ............................................................................288Quiz: Using Data Integrator transforms.................................................................289

ixTable of Contents—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guidex

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



A G E N D ABusinessObjects Data Integrator XI 3.0/3.1: Core

Concepts

Introductions, Course Overview...........................................30 minutes

Lesson 1Describing Data Services..............................................................2 hours❒ Describing the purpose of Data Services❒ Describing Data Services architecture❒ Defining Data Services objects❒ Using the Data Services Designer interface

Lesson 2Defining Source and Target Metadata.........................................1 hour❒ Using datastores❒ Using datastore and system configurations❒ Defining file formats for flat files❒ Defining file formats for Excel files❒ Defining file formats for XML files

Lesson 3Creating Batch Jobs....................................................................1.5 hours❒ Working with objects❒ Creating a data flow❒ Using the Query transform❒ Using target tables❒ Executing the job

Lesson 4Troubleshooting Batch Jobs.....................................................1.5 hours❒ Using descriptions and annotations❒ Validating and tracing jobs❒ Using View Data and the Interactive Debugger❒ Setting up auditing

xiAgenda—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Lesson 5Using Functions, Scripts, and Variables...............................2.5 hours❒ Defining built-in functions❒ Using functions in expressions❒ Using the lookup function❒ Using the decode function❒ Using scripts, variables, and parameters❒ Using Data Services scripting language❒ Scripting a custom function

Lesson 6Using Platform Transforms.......................................................2.5 hours❒ Describing platform transforms❒ Using the Map Operation transform❒ Using the Validation transform❒ Using the Merge transform❒ Using the Case transform❒ Using the SQL transform

Lesson 7Setting up Error Handling................................................................1 hour❒ Using recovery mechanisms

Lesson 8Capturing Changes in Data...........................................................3 hours❒ Updating data over time❒ Using source-based CDC❒ Using target-based CDC

Lesson 9Using Data Integrator Transforms..............................................3 hours❒ Describing Data Integrator transforms❒ Using the Pivot transform❒ Using the Hierarchy Flattening transform❒ Describing performance optimization❒ Using the Data Transfer transform❒ Using the XML Pipeline transform

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guidexii

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


About this Course

Course introductionThis section explains the conventions used in the course and in this training guide.

xiiiAbout this Course—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Course descriptionBusinessObjects™Data Integrator XI 3.0/3.1 enables you to integrate disparate data sources todeliver more timely and accurate data that end users in an organization can trust. In thisthree-day course, you will learn about creating, executing, and troubleshooting batch jobs,using functions, scripts and transforms to change the structure and formatting of data, handlingerrors, and capturing changes in data.

As a business benefit, by being able to create efficient data integration projects, you can usethe transformed data to help improve operational and supply chain efficiencies, enhancecustomer relationships, create new revenue opportunities, and optimize return on investmentfrom enterprise applications.

Course audience

The target audience for this course is individuals responsible for implementing, administering,and managing data integration projects.

Prerequisites

To be successful, learners who attend this course should have experience with the following:

It is also recommended you review the following articles, which can be found at:http://www.rkimball.com/html/articles.html .• Knowledge of data warehousing and ETL concepts• Experience with MySQL and SQL language• Experience using functions, elementary procedural programming, and flow-of-control

statements such as If-Then-Else and While Loop statements• Data Warehouse Fundamentals: TCO Starts with the End User and Fact Tables and Dimension

Tables• Data Warehouse Architecture and Modeling: There Are No Guarantees• Advance Dimension Table Topics: Surrogate Keys,It's Time for Time, and Slowly Changing

Dimensions• Industry- and Application-Specific Issues: Think Globally, Act Locally• Data Staging and Data Quality: Dealing with Dirty Data

Additional education

To increase your skill level and knowledge of Data Services, the following courses arerecommended:• BusinessObjects Data Quality XI 3.0/3.1: Core Concepts• BusinessObjects Data Integrator XI R2 Accelerated: Advanced Workshop

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guidexiv

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


http://www.rkimball.com/html/articles.html

Level, delivery, and duration

This instructor-led core offering is a three-day course.

Course success factors

Your learning experience will be enhanced by:• Activities that build on the life experiences of the learner• Discussion that connects the training to real working environments• Learners and instructor working as a team• Active participation by all learners

Course setup

Refer to the setup guide for details on hardware, software, and course-specific requirements.

Course materials

The materials included with the course materials are:• Name card• Learner’s Guide

The Learner’s Guide contains an agenda, learner materials, and practice activities.

The Learner’s Guide is designed to assist students who attend the classroom-based courseand outlines what learners can expect to achieve by participating in this course.

• Evaluation form

At the conclusion of this course, you will receive an electronic feedback form as part of ourevaluation process. Provide feedback on the course content, instructor, and facility. Yourcomments will assist us to improve future courses.

Additional resources include:• Sample files

The sample files can include required files for the course activities and/or supplementalcontent to the training guide.

• OnlineHelp

Retrieve information and find answers to questions using the onlineHelp and/or user’sguide that are included with the product.

Learning process

Learning is an interactive process between the learners and the instructor. By facilitating acooperative environment, the instructor guides the learners through the learning framework.

xvAbout this Course—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Introduction

Why am I here? What’s in it for me?

The learners will be clear about what they are getting out of each lesson.

Objectives

How do I achieve the outcome?

The learners will assimilate new concepts and how to apply the ideas presented in the lesson.This step sets the groundwork for practice.

Practice

How do I do it?

The learners will demonstrate their knowledge as well as their hands-on skills through theactivities.

Review

How did I do?

The learners will have an opportunity to review what they have learned during the lesson.Review reinforces why it is important to learn particular concepts or skills.

Summary

Where have I been and where am I going?

The summary acts as a recap of the learning objectives and as a transition to the next section.

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guidexvi

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Lesson 1Describing Data Services

Lesson introductionData Services is a graphical interface for creating and staging jobs for data integration and dataquality purposes.

After completing this lesson, you will be able to:

• Describe the purpose of Data Services• Describe Data Services architecture• Define Data Services objects• Use the Data Services Designer interface

1Describing Data Services—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing the purpose of Data Services

IntroductionBusinessObjects Data Services provides a graphical interface that allows you to easily createjobs that extract data from heterogeneous sources, transform that data to meet the businessrequirements of your organization, and load the data into a single location.

Note: Although Data Services can be used for both real-time and batch jobs, this course coversbatch jobs only.

After completing this unit, you will be able to:

• List the benefits of Data Services• Describe data integration processes• Describe the functionality available in Data Services packages

Describing Data Services benefits

The Business Objects Data Services platform enables you to perform enterprise-level dataintegration and data quality functions. With Data Services, your enterprise can:• Create a single infrastructure for data movement to enable faster and lower cost

implementation.• Manage data as a corporate asset independent of any single system.• Integrate data across many systems and re-use that data for many purposes.• Improve performance.• Reduce burden on enterprise systems.• Prepackage data solutions for fast deployment and quick return on investment (ROI).• Cleanse customer and operational data anywhere across the enterprise.• Enhance customer and operational data by appending additional information to increase

the value of the data.• Match and consolidate data atmultiple levelswithin a single pass for individuals, households,

or corporations.

Understanding data integration processes

Data Services combines both batch and real-time data movement and management withintelligent caching to provide a single data integration platform for information managementfrom any information source and for any information use. This unique combination allows youto:• Stage data in an operational datastore, data warehouse, or data mart.• Update staged data in batch or real-time modes.

BusinessObjects Data Integrator XI 3.0/3.1: Core Concepts—Learner’s Guide2

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Create a single environment for developing, testing, anddeploying the entire data integrationplatform.

• Manage a singlemetadata repository to capture the relationships betweendifferent extractionand access methods and provide integrated lineage and impact analysis.

Data Services performs three key functions that can be combined to create a scalable,high-performance data platform. It:• Loads Enterprise Resource Planning (ERP) or enterprise application data into an operational

datastore (ODS) or analytical data warehouse, and updates in batch or real-time modes.• Creates routing requests to a data warehouse or ERP system using complex rules.• Applies transactions against ERP systems.

Data mapping and transformation can be defined using the Data Services Designer graphicaluser interface. Data Services automatically generates the appropriate interface calls to accessthe data in the source system.

For most ERP applications, Data Services generates SQL optimized for the specific targetdatabase (Oracle, DB2, SQL Server, Informix, and so on). Automatically-generated, optimizedcode reduces the cost of maintaining data warehouses and enables you to build data solutionsquickly, meeting user requirements faster than other methods (for example, custom-coding,direct-connect calls, or PL/SQL).

Data Services can apply data changes in a variety of data formats, including any custom formatusing a Data Services adapter. Enterprise users can apply data changes against multipleback-office systems singularly or sequentially. By generating calls native to the system inquestion, Data Services makes it unnecessary to develop and maintain customized code tomanage the process.

You can also design access intelligence into each transaction by adding flow logic that checksvalues in a datawarehouse or in the transaction itself before posting it to the target ERP system.

Understanding the Data Services packages

Data Services provides a wide range of functionality, depending on the package and optionsselected:• Data Integrator packages provide platform transforms for core functionality, and Data

Integrator transforms to enhance data integration projects.• DataQuality packages provide platform transforms for core functionality, andDataQuality

transforms to parse, standardize, cleanse, enhance, match, and consolidate data.• Data Services packages provide all of the functionality of both the Data Integrator and Data

Quality packages.

When yourData Services projects are based on enterprise applications such as SAP, PeopleSoft,Oracle, JDEdwards, Salesforce.com, and Siebel, BusinessObjectsRapidMarts provide specializedversions of Data Services functionality. Rapid Marts combine domain knowledge with dataintegration best practices to deliver prebuilt data models, transformation logic, and dataextraction. Rapid Marts are packaged, powerful, and flexible data integration solutions thathelp organizations:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Jumpstart business intelligence deployments and accelerate time to value• Deliver best-practice data warehousing solutions• Develop custom solutions to meet your unique requirements


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



Describing Data Services architecture

IntroductionData Services relies on several unique components to accomplish the data integration and dataquality activities required to manage your corporate data.


• Describe standard Data Services components• Describe Data Services management tools

Defining Data Services components

Data Services includes the following standard components:• Designer• Repository• Job Server• Engines• Access Server• Adapters• Real-time Services• Address Server• Cleansing Packages, Dictionaries, and Directories• Management Console

This diagram illustrates the relationships between these components:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing the Designer

Data Services Designer is a Windows client application used to create, test, and manuallyexecute jobs that transform data and populate a data warehouse. Using Designer, you createdata management applications that consist of data mappings, transformations, and controllogic.

You can create objects that represent data sources, and then drag, drop, and configure them inflow diagrams.

Designer allows you to manage metadata stored in a local repository. From the Designer, youcan also trigger the Job Server to run your jobs for initial application testing.

To log in to Designer

1. From the Startmenu, click Programs ➤ BusinessObjects XI 3.0/3.1 ➤ BusinessObjectsData Services ➤ Data Services Designer to launch Designer.The path may be different, depending on how the product was installed.

2. In the BusinessObjects Data Services Repository Login dialog box, enter the connectioninformation for the local repository.

3. ClickOK.

4. To verify the Job Server is running in Designer, hover the cursor over the Job Server icon inthe bottom right corner of the screen.The details for the Job Server display in the status bar in the lower left portion of the screen.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing the repository

The Data Services repository is a set of tables that holds user-created and predefined systemobjects, source and target metadata, and transformation rules. It is set up on an openclient/server platform to facilitate sharingmetadatawith other enterprise tools. Each repositoryis stored on an existing Relational Database Management System (RDBMS).

There are three types of repositories:• A local repository (known in Designer as the Local Object Library) is used by an application

designer to store definitions of source and target metadata and Data Services objects.• A central repository (known in Designer as the Central Object Library) is an optional

component that can be used to supportmulti-user development. The Central Object Libraryprovides a shared library that allows developers to check objects in and out for development.

• A profiler repository is used to store information that is used to determine the quality ofdata.

Each repository is associated with one or more Data Services Job Servers.

To create a local repository

1. From the Startmenu, click Programs ➤ BusinessObjects XI 3.0/3.1 ➤ BusinessObjectsData Services ➤ Data Services Repository Manager to launch the Repository Manager.The path may be different, depending on how the product was installed.

2. In theBusinessObjectsData Services RepositoryManager dialog box, enter the connectioninformation for the local repository.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

3. Create Create.

You may need to confirm that you want to overwrite the existing repository, if it alreadyexists.

If you select the Show Details check box, you can see the SQL that is applied to create therepository.System messages confirm that the local repository is created.

4. To see the version of the repository, clickGet Version.

The version displays in the pane at the bottom of the dialog box. Note that the versionnumber refers only to the last major point release number.

5. Click Close.

Describing the Job Server

Each repository is associated with at least one Data Services Job Server, which retrieves the jobfrom its associated repository and starts the datamovement engine. The datamovement engineintegrates data frommultiple heterogeneous sources, performs complex data transformations,andmanages extractions and transactions from ERP systems and other sources. The Job Servercan move data in batch or real-time mode and uses distributed query optimization,


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

multithreading, in-memory caching, in-memory data transformations, and parallel processingto deliver high data throughput and scalability.

While designing a job, you can run it from the Designer. In your production environment, theJob Server runs jobs triggered by a scheduler or by a real-time service managed by the DataServices Access Server. In production environments, you can balance job loads by creating aJob Server Group (multiple Job Servers), which executes jobs according to overall system load.

Data Services provides distributed processing capabilities through the Server Groups. A ServerGroup is a collection of Job Servers that each reside on different Data Services server computers.Each Data Services server can contribute one, and only one, Job Server to a specific ServerGroup. Each Job Server collects resource utilization information for its computer. Thisinformation is utilized by Data Services to determine where a job, data flow or sub-data flow(depending on the distribution level specified) should be executed.

To verify the connection between repository and Job Server

1. From the Startmenu, click Programs ➤ BusinessObjects XI 3.0/3.1 ➤ BusinessObjectsData Services ➤ Data Services Server Manager to launch the Server Manager.The path may be different, depending on how the product was installed.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

2. In the BusinessObjects Data Services Server Manager dialog box, click Edit Job ServerConfig.

3. In the Job Server Configuration Editor dialog box, select the Job Server.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

4. Click Resync with Repository.

5. In the Job Server Properties dialog box, select the repository.

6. Click Resync.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


A system message displays indicating that the Job Server will be resynchronized with theselected repository.

7. ClickOK to acknowledge the warning message.

8. In the Password field, enter the password for the repository.

9. Click Apply.

10.ClickOK to close the Job Server Properties dialog box.

11.ClickOK to close the Job Server Configuration Editor dialog box.

12.In the BusinessObjects Data Services Server Manager dialog box, click Restart to restartthe Job Server.A system message displays indicating that the Job Server will be restarted.

13.ClickOK.

Describing the engines

When Data Services jobs are executed, the Job Server starts Data Services engine processes toperform data extraction, transformation, and movement. Data Services engine processes useparallel processing and in-memory data transformations to deliver high data throughput andscalability.

Describing the Access Server

The Access Server is a real-time, request-reply message broker that collects incoming XMLmessage requests, routes them to a real-time service, and delivers a message reply within auser-specified time frame. The Access Server queues messages and sends them to the nextavailable real-time service across any number of computing resources. This approach providesautomatic scalability because the Access Server can initiate additional real-time services onadditional computing resources if traffic for a given real-time service is high.

You can configure multiple Access Servers.

Describing the adapters

Adapters are additional Java-based programs that can be installed on the job server to provideconnectivity to other systems such as Salesforce.com or the Java Messaging Queue. There isalso a Software Development Kit (SDK) to allow customers to create adapters for customapplications.

Describing the real-time services

The Data Services real-time client communicates with the Access Server when processingreal-time jobs. Real-time services are configured in the Data Services Management Console.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Describing the Address Server

The Address Server is used specifically for processing European addresses using the DataQualityGlobalAddressCleanse transform. It provides access to detailed address line informationfor most European countries.

Describing the Cleansing Packages, dictionaries, and directories

The Data Quality Cleansing Packages, dictionaries, and directories provide referential data forthe Data Cleanse and Address Cleanse transforms to use when parsing, standardizing, andcleansing name and address data.

Cleansing Packages are packages that enhance the ability of Data Cleanse to accurately processvarious forms of global data by including language-specific reference data and parsing rules.Directories provide information on addresses from postal authorities; dictionary files are usedto identify, parse, and standardize data such as names, titles, and firm data. Dictionaries alsocontain acronym, match standard, gender, capitalization, and address information.

Describing the Management Console

The Data Services Management Console provides access to the following features:• Administrator• Auto Documentation• Data Validation• Impact and Lineage Analysis• Operational Dashboard• Data Quality Reports

Administrator

Administer Data Services resources, including:• Scheduling, monitoring, and executing batch jobs• Configuring, starting, and stopping real-time services• Configuring Job Server, Access Server, and repository usage• Configuring and managing adapters• Managing users• Publishing batch jobs and real-time services via web services• Reporting on metadata

Auto Documentation

View, analyze, and print graphical representations of all objects as depicted in Data ServicesDesigner, including their relationships, properties, and more.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Data Validation

Evaluate the reliability of your target data based on the validation rules you create in yourDataServices batch jobs in order to quickly review, assess, and identify potential inconsistencies orerrors in source data.

Impact and Lineage Analysis

Analyze end-to-end impact and lineage for Data Services tables and columns, and BusinessObjects Enterprise objects such as universes, business views, and reports.

Operational Dashboard

View dashboards of status and performance execution statistics of Data Services jobs for oneor more repositories over a given time period.

Data Quality Reports

Use data quality reports to view and export Crystal reports for batch and real-time jobs thatinclude statistics-generating transforms. Report types include job summaries, transform-specificreports, and transform group reports.

To generate reports for Match, US Regulatory Address Cleanse, and Global Address Cleansetransforms, you must enable theGenerate report data option in the Transform Editor.

Defining other Data Services tools

There are also several tools to assist you in managing your Data Services installation.

Describing the Repository Manager

The Data Services Repository Manager allows you to create, upgrade, and check the versionsof local, central, and profiler repositories.

Describing the Server Manager

TheData Services ServerManager allows you to add, delete, or edit the properties of Job Servers.It is automatically installed on each computer on which you install a Job Server.

Use the Server Manager to define links between Job Servers and repositories. You can linkmultiple Job Servers on different machines to a single repository (for load balancing) or eachJob Server to multiple repositories (with one default) to support individual repositories (forexample, separating test and production environments).

Describing the License Manager

The License Manager displays the Data Services components for which you currently have alicense.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Describing the Metadata Integrator

The Metadata Integrator allows Data Services to seamlessly share metadata with BusinessObjects Intelligence products. Run the Metadata Integrator to collect metadata into the DataServices repository for Business Views and Universes used by Crystal Reports, DesktopIntelligence documents, and Web Intelligence documents.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Defining Data Services objects

IntroductionData Services provides you with a variety of objects to use when you are building your dataintegration and data quality applications.


• Define the objects available in Data Services• Explain relationships between objects

Understanding Data Services objects

In Data Services, all entities you add, define, modify, or work with are objects. Some of themost frequently-used objects are:• Projects• Jobs• Work flows• Data flows• Transforms• Scripts

This diagram shows some common objects.

All objects have options, properties, and classes. Each can be modified to change the behaviorof the object.

Options

Options control the object. For example, to set up a connection to a database, the database nameis an option for the connection.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Properties

Properties describe the object. For example, the name and creation date describewhat the objectis used for and when it became active. Attributes are properties used to locate and organizeobjects.

Classes

Classes define how an object can be used. Every object is either re-usable or single-use.

Single-use objects

Single-use objects appear only as components of other objects. They operate only in the contextin which they were created.

Note: You cannot copy single-use objects.

Re-usable objects

A re-usable object has a single definition and all calls to the object refer to that definition. Ifyou change the definition of the object in one place, and then save the object, the change isreflected to all other calls to the object.

Most objects created in Data Services are available for re-use. After you define and save are-usable object, Data Services stores the definition in the repository. You can then re-use thedefinition as often as necessary by creating calls to it.

For example, a data flow within a project is a re-usable object. Multiple jobs, such as a weeklyload job and a daily load job, can call the same data flow. If this data flow is changed, both jobscall the new version of the data flow.

You can edit re-usable objects at any time independent of the current open project. For example,if you open a new project, you can open a data flow and edit it. However, the changes youmake to the data flow are not stored until you save them.

Defining relationship between objects

Jobs are composed of work flows and/or data flows:• A work flow is the incorporation of several data flows into a sequence.• A data flow is the process by which source data is transformed into target data.

A work flow orders data flows and the operations that support them. It also defines theinterdependencies between data flows.

For example, if one target table depends on values from other tables, you can use the workflow to specify the order in which you want Data Services to populate the tables. You can alsouse work flows to define strategies for handling errors that occur during project execution, orto define conditions for running sections of a project.

This diagram illustrates a typical work flow.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

A data flow defines the basic task that Data Services accomplishes, which involves movingdata from one or more sources to one or more target tables or files. You define data flows byidentifying the sources fromwhich to extract data, the transformations the data should undergo,and targets.

Defining projects and jobs

A project is the highest-level object in Designer. Projects provide a way to organize the otherobjects you create in Designer.

A job is the smallest unit of work that you can schedule independently for execution. A projectis a single-use object that allows you to group jobs. For example, you can use a project to groupjobs that have schedules that depend on one another or that you want to monitor together.

Projects have the following characteristics:• Projects are listed in the Local Object Library.• Only one project can be open at a time.• Projects cannot be shared among multiple users.

The objects in a project appear hierarchically in the project area. If a plus sign (+) appears nextto an object, you can expand it to view the lower-level objects contained in the object. DataServices displays the contents as both names and icons in the project area hierarchy and in theworkspace.

Note: Jobs must be associated with a project before they can be executed in the project area ofDesigner.

Using work flows

Jobswith data flows can be developedwithout usingwork flows.However, one should considernesting data flows inside of work flows by default. This practice can provide various benefits.

Always using work flows makes jobs more adaptable to additional development and/orspecification changes. For instance, if a job initially consists of four data flows that are to runsequentially, they could be set upwithoutwork flows. Butwhat if specification changes requirethat they be merged into another job instead? The developer would have to replicate theirsequence correctly in the other job. If these had been initially added to awork flow, the developercould then have simply copied that work flow into the correct position within the new job.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

There would be no need to learn, copy, and verify the previous sequence. The change can bemade more quickly with greater accuracy.

Even if there is one data flow per work flow, there are benefits to adaptability. Initially, it mayhave been decided that recovery units are not important; the expectation being that if the jobfails, the whole process could simply be rerun. However, as data volumes tend to increase, itmay be determined that a full reprocessing is too time consuming. The jobmay then be changedto incorporate work flows to benefit from recovery units to bypass reprocessing of successfulsteps. However, these changes can be complex and can consume more time than allotted forin a project plan. It also opens up the possibility that units of recovery are not properly defined.Setting these up during initial development when the nature of the processing is being mostfully analyzed is preferred.

Describing the object hierarchy

In the repository, objects are grouped hierarchically from a project, to jobs, to optional workflows, to data flows. In jobs, work flows define a sequence of processing steps, and data flowsmove data from source tables to target tables.

This illustration shows the hierarchical relationships for the key object types within DataServices:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

This course focuses on creating batch jobs using database datastores and file formats.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Data Services Designer interface

IntroductionThe Data Services Designer interface allows you to plan and organize your data integrationand data quality jobs in a visual way. Most of the components of Data Services can beprogrammed through this interface.


• Explain how Designer is used• Describe key areas in the Designer window

Describing the Designer window

The Data Services Designer interface consists of a single application window and severalembedded supportingwindows. The applicationwindow contains themenu bar, toolbar, LocalObject Library, project area, tool palette, and workspace.

Tip: You can access theData Services TechnicalManuals for reference or help through theDesignerinterface Help menu. These manuals are also accessible by going through Start ➤ Programs ➤

Business Objects XI 3.0/3.1 ➤ BusinessObjects Data Services ➤ Data Services Documentation➤ Technical Manuals.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using the Designer toolbar

In addition to many of the standard Windows toolbar buttons, Data Services provides thefollowing unique toolbar buttons:

DescriptionToolButton

Saves all new or updated objects.Save All

Closes all open windows in the workspace.Close All Windows

Opens and closes the Local Object Librarywindow.Local Object Library

Opens and closes the Central Object Librarywindow.Central Object Library

Opens and closes theVariables and Parameterswindow.Variables

Open and closes the project area.Project Area

Opens and closes theOutputwindow.Output

Enables the system-level setting for viewing objectdescriptions in the workspace.

View EnabledDescriptions

Validates the object definition open in the active tab of theworkspace. Other objects included in the definition arealso validated.

ValidateCurrentView

Validates all object definitions open in the workspace.Objects included in the definition are also validated.

ValidateAll Objects inView

Opens the Auditwindow. You can collect audit statisticson the data that flows out of any Data Services object.Audit

Opens theOutputwindow,which lists parent objects (suchas jobs) of the object currently open in theworkspace (suchas a data flow).

View Where Used

Moves back in the list of active workspace windows.Back


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionToolButton

Move forward in the list of active workspace windows.Forward

Opens and closes theData ServicesManagementConsole,which provides access to Administrator, AutoData Services

Management Console Documentation, Data Validation, Lineage and ImpactAnalysis, Operational Dashboard, and Data QualityReports.

Opens Data Insight, which allows you to assess andmonitor the quality of your data.Assess and Monitor

Opens the Data Services Technical Manuals.Contents

Using the Local Object Library

The Local Object Library gives you access to the object types listed in the table below. The tableshows the tab on which the object type appears in the Local Object Library and describes theData Services context in which you can use each type of object.

DescriptionTab

Projects are sets of jobs available at a given time.

Jobs are executablework flows. There are two job types: batch jobs and real-time jobs.

Work flows order data flows and the operations that support data flows, defining theinterdependencies between them.

Data flows describe how to process a task.

Transforms operate on data, producing output data sets from the sources you specify.The Local Object Library lists both platform, Data Integrator, and Data Qualitytransforms.

Datastores represent connections to databases and applications used in your project.Under each datastore is a list of the tables, documents, and functions imported intoData Services


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionTab

Formats describe the structure of a flat file, Excel file, XML file, or XML message.

Custom functions are functions written in the Data Services Scripting Language.

You can import objects to and export objects from your Local Object Library as a file. Importingobjects from a file overwrites existing objects with the same names in the destination LocalObject Library.

Whole repositories can be exported in either .atl or .xml format. Using the .xml file format canmake repository content easier for you to read. It also allows you to export Data Services toother products.

To import a repository from a file

1. On any tab of the Local Object Library, right-click the white space and select Repository➤ Import from File from the menu.TheOpen Import File dialog box displays.

2. Browse to the destination for the file.

3. ClickOpen.A warningmessage displays to let you know that it takes a long time to create new versionsof existing objects.

4. ClickOK.You must restart Data Services after the import process completes.

To export a repository to a file

1. On any tab of the Local Object Library, right-click the white space and select Repository➤ Export To File.TheWrite Repository Export File dialog box displays.

2. Browse to the destination for the export file.

3. In the File name field, enter the name of the export file.

4. In the Save as type list, select the file type for your export file.

5. Click Save.The repository is exported to the file.

Using the project area

The project area provides a hierarchical view of the objects used in each project. Tabs on thebottom of the project area support different tasks. Tabs include:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionTab

Create, view, and manage projects.

This provides a hierarchical view of all objects used in each project.

View the status of currently executing jobs.

Selecting a specific job execution displays its status, including which steps arecomplete andwhich steps are executing. These tasks can also be done using theData Services Management Console.

View the history of complete jobs.

Logs can also be viewed with the Data Services Management Console.

To change the docked position of the project area

1. Right-click the border of the project area.

2. From the menu, select Floating.

3. Click and drag the project area to dock and undock at any edge within Designer.When you drag the project area away from a window edge, it stays undocked. When youposition the project area where one of the directional arrows highlights a portion of thewindow, this signifies a placement option. The project area does not dock inside theworkspace area.

4. To switch between the last docked and undocked locations, double-click the gray border.

To change the undocked position of the project area


2. From the menu, select Floating to remove the check mark and clear the docking option.

3. Click and drag the project area to any location on your screen.

To lock and unlock the project area

1. Click the pin icon ( ) on the border to unlock the project area.The project area hides.

2. Move the mouse over the docked pane.The project area re-appears.

3. Click the pin icon to lock the pane in place again.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To hide/show the project area


2. From the menu, selectHide.The project area disappears from the Designer window.

3. To show the project area, click Project Area in the toolbar.

Using the tool palette

The tool palette is a separate window that appears by default on the right edge of the Designerworkspace. You can move the tool palette anywhere on your screen or dock it on any edge ofthe Designer window.

The icons in the tool palette allow you to create new objects in the workspace. The icons aredisabled when they are invalid entries to the diagram open in the workspace.

To show the name of each icon, hold the cursor over the icon until the tool tip for the iconappears.

When you create an object from the tool palette, you are creating a new definition of an object.If a new object is re-usable, it is automatically available in the Local Object Library after youcreate it.

For example, if you select the data flow icon from the tool palette and define a new data flowcalled DF1, you can later drag that existing data flow from the Local Object Library and add itto another data flow called DF2.

The tool palette contains these objects:

Available inDescriptionToolIcon

All objectsReturns the tool pointer to a selectionpointer for selecting andmoving objectsin a diagram.

Pointer

Jobs and work flowsCreates a new work flow.Workflow

Jobs and work flowsCreates a new data flow.Dataflow

SAP licensedextension

Creates a new data flow with the SAPlicensed extension only.

R/3 dataflow

Data flowsCreates a query to define columnmappings and row selections.

Querytransform


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Available inDescriptionToolIcon

Data flowsCreates a new table for a target.Templatetable

Data flowsCreates a new XML file for a target.TemplateXML

SAP Licensedextension

Create a data transport flow for the SAPLicensed extension.

Datatransport

Jobs and work flowsCreates a new script object.Script

Jobs and work flowsCreates a new conditional object.Conditional

Work flowsRepeats a sequence of steps in a workflow as long as a condition is true.

WhileLoop

Jobs and work flowsCreates a new try object that tries analternate work flow if an error occurs ina job.

Try

Jobs and work flowsCreates a new catch object that catcheserrors in a job.Catch

Jobs, work flows, anddata flows

Creates an annotation used to describeobjects.Annotation

Using the workspace

When you open a job or any object within a job hierarchy, the workspace becomes active withyour selection. Theworkspace provides a place tomanipulate objects and graphically assembledata movement processes.

These processes are represented by icons that you drag and drop into a workspace to create adiagram. This diagram is a visual representation of an entire data movement application orsome part of a data movement application.

You specify the flow of data by connecting objects in the workspace from left to right in theorder you want the data to be moved.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Describing Data Services1. List two benefits of using Data Services.

2. Which of these objects is single-use?

a. Job

b. Project

c. Data flow

d. Work flow

3. Place these objects in order by their hierarchy: data flows, jobs, projects, and work flows.

4. Which tool do you use to associate a job server with a repository?

5. Which tool allows you to create a repository?

6. What is the purpose of the Access Server?


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Lesson summaryAfter completing this lesson, you are now able to:

• Describe the purpose of Data Services• Describe Data Services architecture• Define Data Services objects• Use the Data Services Designer interface


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly




Lesson 2Defining Source and Target Metadata

Lesson introductionTo define data movement requirements in Data Services, you must import source and targetmetadata.


• Use datastores• Use datastore and system configurations• Define file formats for flat files• Define file formats for Excel files• Define file formats for XML files

31Defining Source and Target Metadata—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using datastores

IntroductionDatastores represent connections between Data Services and databases or applications.


• Explain datastores• Create a database datastore• Change a datastore definition• Import metadata

Explaining datastores

A datastore provides a connection or multiple connections to data sources such as a database.Through the datastore connection, Data Services can import the metadata that describes thedata from the data source.

Data Services uses these datastores to read data from source tables or load data to target tables.Each source or target must be defined individually and the datastore options available dependon which Relational Database Management System (RDBMS) or application is used for thedatastore. Database datastores can be created for the following sources:• IBM DB2, Microsoft SQL Server, Oracle, Sybase, and Teradata databases (using native

connections)• Other databases (through ODBC)• A simple memory storage mechanism using a memory datastore• IMS, VSAM, and various additional legacy systems using BusinessObjects Data Services

Mainframe Interfaces such as Attunity and IBM Connectors

The specific information that a datastore contains depends on the connection. When yourdatabase or application changes, you must make corresponding changes in the datastoreinformation in Data Services. Data Services does not automatically detect structural changesto the datastore.

There are three kinds of datastores:• Database datastores: provide a simple way to import metadata directly from an RDBMS.• Application datastores: let users easily import metadata from most Enterprise Resource

Planning (ERP) systems.• Adapter datastores: can provide access to an application’s data andmetadata or justmetadata.

For example, if the data source is SQL-compatible, the adapter might be designed to accessmetadata, while Data Services extracts data from or loads data directly to the application.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using adapters

Adapters provide access to a third-party application’s data and metadata. Depending on theadapter implementation, adapters can provide:• Application metadata browsing• Application metadata importing into the Data Services repository

For batch and real-time data movement between Data Services and applications, BusinessObjects offers an Adapter Software Development Kit (SDK) to develop your own customadapters. You can also buy Data Services prepackaged adapters to access application data andmetadata in any application.

For more information on these adapters, see Chapter 5 in the Data Services Designer Guide.

You can use the Data Mart Accelerator for Crystal Reports adapter to import metadata fromBusinessObjects Enterprise. See the documentation folder underAdapters located in yourDataServices installation for more information on the Data Mart Accelerator for Crystal Reports.

Creating a database datastore

You need to create at least one datastore for each database file system with which you areexchanging data. To create a datastore, you must have appropriate access privileges to thedatabase or file system that the datastore describes. If you do not have access, ask your databaseadministrator to create an account for you.

To create a database datastore

1. On theDatastores tab of the Local Object Library, right-click thewhite space and selectNewfrom the menu.

The Create New Datastore dialog box displays.

2. In theDatastore name field, enter the name of the new datastore.The name can contain any alphanumeric characters or underscores (_). It cannot containspaces.

3. In theDatastore Type drop-down list, ensure that the default value of Database is selected.

4. In theDatabase type drop-down list, select the RDBMS for the data source.

5. Enter the other connection details, as required.The values you select for the datastore type anddatabase type determine the options availablewhen you create a database datastore. The entries that you must make to create a datastoredepend on the selections youmake for these two options. Note that if you are usingMySQL,any ODBC connection provides access to all of the available MySQL schemas.

6. Leave the Enable automatic data transfer check box selected.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

7. ClickOK.

Changing a datastore definition

Like all Data Services objects, datastores are defined by both options and properties:• Options control the operation of objects. These include the database server name, database

name, user name, and password for the specific database.

The Edit Datastore dialog box allows you to edit all connection properties except datastorename and datastore type for adapter and application datastores. For database datastores,you can edit all connection properties except datastore name, datastore type, database type,and database version.

• Properties document the object. For example, the name of the datastore and the date onwhich it is created are datastore properties. Properties are descriptive of the object and donot affect its operation.

DescriptionProperties Tab

Contains the name and description of the datastore, if available. Thedatastore name appears on the object in the Local Object Library andGeneral in calls to the object. You cannot change the name of a datastore aftercreation.

Includes the date you created the datastore. This value cannot bechanged.Attributes

Includes overall datastore information such as description and datecreated.Class Attributes


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To change datastore options

1. On the Datastores tab of the Local Object Library, right-click the datastore name and selectEdit from the menu.The Edit Datastore dialog box displays the connection information.

2. Change the database server name, database name, username, and password options, asrequired.

3. ClickOK.The changes take effect immediately.

To change datastore properties

1. On the Datastores tab of the Local Object Library, right-click the datastore name and selectProperties from the menu.The Properties dialog box lists the datastore’s description, attributes, and class attributes.

2. Change the datastore properties, as required.

3. ClickOK.

Importing metadata from data sources

Data Services determines and stores a specific set of metadata information for tables. You canimport metadata by name, searching, and browsing. After importing metadata, you can editcolumn names, descriptions, and datatypes. The edits are propagated to all objects that callthese objects.

DescriptionMetadata

The name of the table as it appears in the database.Table name

The description of the table.Table description

The name of the table column.Column name

The description of the column.Columndescription

The datatype for each column.

Column datatypeIf a column is defined as an unsupported datatype (see datatypes listedbelow) Data Services converts the datatype to one that is supported.In some cases, if Data Services cannot convert the datatype, it ignoresthe column entirely.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionMetadata

The following datatypes are supported: BLOB, CLOB, date, datetime,decimal, double, int, interval, long, numeric, real, time, timestamp,and varchar.

The column that comprises the primary key for the table.Primary keycolumn After a table has been added to a data flow diagram, this columns is

indicated in the column list by a key icon next to the column name.

Information Data Services records about the table such as the datecreated and date modified if these values are available.Table attribute

Name of the table owner.Owner name

You can also import stored procedures fromDB2,MS SQL Server, Oracle, and Sybase databasesand stored functions and packages from Oracle. You can use these functions and proceduresin the extraction specifications you give Data Services.

Information that is imported for functions includes:• Function parameters• Return type• Name• Owner

Imported functions and procedures appear in the Function branch of each datastore tree onthe Datastores tab of the Local Object Library.

You can configure imported functions and procedures through the Function Wizard and theSmart Editor in a category identified by the datastore name.

Importing metadata by browsing

The easiest way to import metadata is by browsing. Note that functions cannot be importedusing this method.

Formore information on importing by searching and importing byname, see “Ways of importingmetadata”, Chapter 5 in the Data Services Designer Guide.

To import metadata by browsing

1. On the Datastores tab of the Local Object Library, right-click the datastore and selectOpenfrom the menu.The items available to import appear in the workspace.

2. Navigate to and select the tables for which you want to import metadata.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

You can hold down the Ctrl or Shift keys and click to select multiple tables.

3. Right-click the selected items and select Import from the menu.Theworkspace contains columns that indicate whether the table has already been importedinto Data Services (Imported) and if the table schema has changed since it was imported(Changed). To verifywhether the repository contains themost recentmetadata for an object,right-click the object and select Reconcile.

4. In the Local Object Library, expand the datastore to display the list of imported objects,organized into Functions, Tables, and Template Tables.

5. To view data for a imported datastore, right-click a table and select View Data from themenu.

Activity: Creating source and target datastores

You have been hired as a Data Services designer for Alpha Acquisitions. Alpha has recentlyacquired Beta Businesses, an organization that develops and sells software products and relatedservices.

In an effort to consolidate and organize the data, and simplify the reporting process for thegrowing company, the Omega data warehouse is being constructed to merge the data for bothorganizations, and a separate data mart is being developed for reporting onHuman Resourcesdata. You also have access to a database for staging purposes called Delta. To start thedevelopment process, youmust create datastores and import the metadata for all of these datasources.

Objective

• Create datastores and import metadata for the Alpha Acquisitions, Beta Businesses, Delta,HR Data Mart, and Omega databases.

Instructions

1. In your Local Object Library, create a new source datastore for the Alpha Acquisitions datawith the following options:

ValueField

AlphaDatastore name

DatabaseDatastore type

Microsoft SQL ServerDatabase type

Microsoft SQL Server 2005Database version


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


ValueField

To be provided by instructorDatabase server name

ALPHADatabase name

sourceuserUser name

sourcepassPassword

2. Import the metadata for the following source tables:• source.category• source.city• source.country• source.customer• source.department• source.employee• source.hr_comp_update• source.order_details• source.orders• source.product• source.region

3. View the data for the category table and confirm that there are four records.

4. Create a second source datastore for the Beta Businesses data with the following options:

ValueField

BetaDatastore name





BETADatabase name


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

ValueField

sourceuserUser name

sourcepassPassword

5. Import the metadata for the following source tables:• source.addrcodes• source.categories• source.city• source.country• source.customers• source.employees• source.orderdetails• source.orders• source.products• source.region• source.shippers• source.suppliers• source.usa_customers

6. View the data for the usa_customers table and confirm that JaneHartley fromPlanview Inc.is the first customer record.

7. Create a datastore for the Delta staging database with the following options:

ValueField

DeltaDatastore name





DELTADatabase name


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

ValueField

To be provided by instructorUser name

To be provided by instructorPassword

You do not need to import any metadata.

8. Create a target datastore for the HR data mart with the following options:

ValueField

HR_datamartDatastore name





HR_DATAMARTDatabase name



9. Import the metadata for the following target tables:• dbo.emp_dept• dbo.employee• dbo.hr_comp_update• dbo.recovery_status

10.Create a target datastore for the Omega data warehouse with the following options:

ValueField

OmegaDatastore name


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

ValueField





OMEGADatabase name



11.Import the metadata for the following target tables:• dbo.emp_dim• dbo.product_dim• dbo.product_target• dbo.time_dim


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using datastore and system configurations

IntroductionData Services supports multiple datastore configurations, which allow you to change yourdatastores depending on the environment in which you are working.

After completing this unit you will be able to:

• Create multiple configurations in a datastore• Create a system configuration

Creating multiple configurations in a datastore

A configuration is a property of a datastore that refers to a set of configurable options (such asdatabase connection name, database type, user name, password, and locale) and their values.When you create a datastore, you can specify one datastore configuration at a time and specifyone as the default. Data Services uses the default configuration to importmetadata and executejobs. You can create additional datastore configurations using the Advanced option in thedatastore editor. You can combine multiple configurations into a system configuration that isselectablewhen executing or scheduling a job.Multiple configurations and systemconfigurationsmake portability of your job much easier (for example, different connections for development,test, and production environments).

When you add a new configuration, Data Services modifies the language of data flows thatcontain table targets and SQL transforms in the datastore based on what you defined in thenew configuration.

To create multiple datastore configurations in an existing datastore

1. On the Datastores tab of the Local Object Library, right-click a datastore and select Editfrom the menu.The Edit Datastore dialog box displays.

2. Click Advanced >>.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

A grid of additional datastore properties and the multiple configuration controls displays.

3. Click Edit next to the Configurations count at the bottom of the dialog box.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Configurations for Datastore dialog box displays. The default configuration displays.Each subsequent configuration displays as an additional column.

4. Double-click the header for the default configuration to change the name, and then clickoutside of the header to commit the change.

5. Click Create New Configuration in the toolbar.

The Create New Configuration dialog box displays.

6. In theName field, enter the name for your new configuration.Do not include spaces when assigning names for your datastore configurations.

7. Select the database type and version.

8. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

A second configuration is added to the Configurations for Datastorewindow.

9. Adjust the other properties of the new configuration to correspond with the existingconfiguration, as required.If a property does not apply to a configuration, the cell does not accept input. Cells thatcorrespond to a group header also do not accept input, and are marked with hatched graylines.

10.If required, click Create New Alias to create an alias for the configuration, enter a value forthe alias at the bottom of the page, and clickOK to return to the Edit Datastore dialog box.

11.ClickOK to complete the datastore configuration.

12.ClickOK to close the Edit Datastore dialog box.

Activity: Modifying the datastore connection for internal jobs

The CD_DS_d0cafae2 datastore supports two internal jobs. The first calculates usagedependencies on repository tables and the second updates server group configurations. If youchange your repository password, user name, or other connection information, set theDisplayDIInternalJobs option to TRUE, close and reopen the Designer, then update theCD_DS_d0cafae2 datastore configuration to match your new repository configuration. Thisenables the calculate usage dependency job (CD_JOBd0cafae2) and the server group job(di_job_al_mach_info) to run without a connection error. In some training environments therepository database may have been mirrored or moved resulting in obsolete connectioninformation.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Objective

• Update the repository connection information for a hidden datastore.

Instructions

1. Start Designer.

2. From the Toolsmenu, selectOptions.

3. On theOptions page, select Job Server.

4. On the Job Server page, selectGeneral.Make sure the default Job Server is set in Tools/Options/Designer/Enviornment.

5. Add the following attribute values to unhide the hidden objects in the repository:

ValueAttribute

stringSection

DisplayDIInternalJobsKey

TrueValue

6. ClickOK to close theOptionswindow.

7. Restart Designer to enforce the new options.A new datastore named CD_DS_d0cafae2 displays.

8. Edit the datastore with the correct connection parameters for your repository includingData server name, Database name, User name, and Password.

9. To hide the datastore again, repeat Steps 1-7, but enter a Value of False.

Creating a system configuration

System configurations define a set of datastore configurations that you want to use togetherwhen running a job. In many organizations, a Data Services designer defines the requireddatastore and system configurations, and a system administrator determines which systemconfiguration to use when scheduling or starting a job in the Administrator.

When designing jobs, determine and create datastore configurations and system configurationsdepending on your business environment and rules. Create datastore configurations for thedatastores in your repository before you create the system configurations for them.

Data Services maintains system configurations separately. You cannot check in or check outsystem configurations. However, you can export system configurations to a separate flat filewhich you can later import. Bymaintaining system configurations in a separate file, you avoid


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

modifying your datastore each time you import or export a job, or each time you check in andcheck out the datastore.

You cannot define a system configuration if your repository does not contain at least onedatastore with multiple configurations.

To create a system configuration

1. From the Toolsmenu, select System Configurations.

The System Configuration Editor dialog box displays columns for each datastore.

2. In the Configuration name column, enter the system configuration name.Use the SC_ prefix in the system configuration name so that you can easily identify this fileas a system configuration, particularly when exporting.

3. In the drop-down list for each datastore column, select the appropriate datastoreconfiguration that you want to use when you run a job using this system configuration.

4. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Defining file formats for flat files

IntroductionFile formats are connections to flat files in the same way that datastore are connections todatabases.


• Explain file formats• Create a file format for a flat file

Explaining file formats

A file format is a generic description that can be used to describe one file or multiple data filesif they share the same format. It is a set of properties describing the structure of a flat file (ASCII).File formats are used to connect to source or target data when the data is stored in a flat file.The Local Object Library stores file format templates that you use to define specific file formatsas sources and targets in data flows.

File format objects can describe files in:• Delimited format — delimiter characters such as commas or tabs separate each field.• Fixed width format — the fixed column width is specified by the user.• SAP R/3 format — this is used with the predefined Transport_Format or with a custom

SAP R/3 format.

Creating file formats

Use the file format editor to set properties for file format templates and source and target fileformats. The file format editor has three work areas:• Property Value: Edit file format property values. Expand and collapse the property groups

by clicking the leading plus or minus.• Column Attributes: Edit and define columns or fields in the file. Field-specific formats

override the default format set in the Properties-Values area.• Data Preview: View how the settings affect sample data.

The properties and appearance of the work areas vary with the format of the file.

Date formats

In the Property Values work area, you can override default date formats for files at the fieldlevel. The following data format codes can be used:

DescriptionCode

2-digit day of the monthDD


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionCode

2-digit monthMM

Full name of the monthMONTH

3-character name of the monthMON

2-digit yearYY

4-digit yearYYYY

2-digit hour of the day (0-23)HH24

2-digit minute (0-59)MI

2-digit second (0-59)SS

Up to 9-digit sub-secondsFF

To create a new file format

1. On the Formats tab of the Local Object Library, right-click Flat Files and selectNew fromthe menu to open the File Format Editor.To make sure your file format definition works properly, it is important to finish inputtingthe values for the file properties before moving on to the Column Attributes work area.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

2. In the Type field, specify the file type:• Delimited: select this file type if the file uses a character sequence to separate columns.• Fixed width: select this file type if the file uses specified widths for each column.

If a fixed-width file format uses a multi-byte code page, then no data is displayed in theData Preview section of the file format editor for its files.

3. In theName field, enter a name that describes this file format template.Once the name has been created, it cannot be changed. If an error is made, the file formatmust be deleted and a new format created.

4. Specify the location information of the data file including Location,Root directory, and Filename.The Group File Read can read multiple flat files with identical formats through a single fileformat. By substituting a wild card character or list of file names for the single file name,multiple files can be read.

5. Click Yes to overwrite the existing schema.This happens automatically when you open a file.

6. Complete the other properties to describe files that this template represents. Overwrite theexisting schema as required.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


7. For source files, specify the structure of each column in the Column Attributes work areaas follows:

DescriptionColumn

Enter the name of the column.Field Name

Select the appropriate datatype from the drop-down list.Data Type

For columns with a datatype of varchar, specify the lengthof the field.Field Size

For columns with a datatype of decimal or numeric, specifythe precision of the field.Precision

For columns with a datatype of decimal or numeric, specifythe scale of the field.Scale

For columns with any datatype but varchar, select a formatfor the field, if desired. This information overrides the defaultformat set in the PropertyValueswork area for that datatype.

Format

You do not need to specify columns for files used as targets. If you do specify columns andthey do not match the output schema from the preceding transform, Data Services writesto the target file using the transform’s output schema.

For a decimal or real datatype, if you only specify a source column format and the columnnames and datatypes in the target schema do not match those in the source schema, DataServices cannot use the source column format specified. Instead, it defaults to the formatused by the code page on the computer where the Job Server is installed.

8. Click Save & Close to save the file format and close the file format editor.

9. In the Local Object Library, right-click the file format and select View Data from the menuto see the data.

To create a file format from an existing file format

1. On the Formats tab of the Local Object Library, right-click an existing file format and selectReplicate.The File Format Editor opens, displaying the schema of the copied file format.

2. In theName field, enter a unique name for the replicated file format.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Data Services does not allow you to save the replicated file with the same name as theoriginal (or any other existing File Format object). After it is saved, you cannot modify thename again.

3. Edit the other properties as desired.

4. Click Save & Close to save the file format and close the file format editor.

To read multiple flat files with identical formats through a single file format

1. On the Formats tab of the Local Object Library, right-click an existing file format and selectEdit from the menu.The format must be based on one single file that shares the same schema as the other files.

2. In the location field of the format wizard, enter one of the following:• Root directory (optional to avoid retyping)• List of file names, separated by commas• File name containing a wild character (*)

When you use the (*) to call the name of several file formats, Data Services reads one fileformat, closes it and then proceeds to read the next one. For example, if you specify the filename revenue*.txt, Data Services reads all flat files starting with revenue in the file name.

Handling errors in file formats

One of the features available in the File Format Editor is error handling. When you enableerror handling for a file format, Data Services:• Checks for the two types of flat-file source errors:

○ Datatype conversion errors. For example, a field might be defined in the File FormatEditor as having a datatype of integer but the data encountered is actually varchar.

○ Row-format errors. For example, in the case of a fixed-width file, Data Services identifiesa row that does not match the expected width value.

• Stops processing the source file after reaching a specified number of invalid rows.• Logs errors to the Data Services error log. You can limit the number of log entries allowed

without stopping the job.

You can choose to write rows with errors to an error file, which is a semicolon-delimited textfile that you create on the same machine as the Job Server.

Entries in an error file have this syntax:

source file path and name; row number in source file; Data Services error; column

number where the error occurred; all columns from the invalid row


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To enable flat file error handling in the File Format Editor

1. On the Formats tab of the Local Object Library, right-click the file format and select Editfrom the menu.

2. Under the Error handling section, in the Capture data conversion errors drop-down list,select Yes.

3. In the Capture row format errors drop-down list, select Yes.

4. In theWrite error rows to file drop-down list, select Yes.You can also specify the maximum warnings to log and the maximum errors before a jobis stopped.

5. In the Error file root directory field, click the folder icon to browse to the directory in whichyou have stored the error handling text file you created.

6. In the Error file name field, enter the name for the text file you created to capture the flatfile error logs in that directory.

7. Click Save & Close.

Activity: Creating a file format for a flat file

In addition to the main databases for source information, records for some of the orders forAlpha Acquisitions are stored in flat files.

Objective

• Create a file format for the orders flat files so you can use them as source objects.

Instructions

1. In the Local Object Library, create a new delimited file format called Orders_Format for theorders_12_21_06.txt flat file in the Activity_Source folder.The path depends on where the folder has been copied from the Learner Resources.

2. Adjust the format so that it reflects the source file.

Consider the following:• The column delimiter is a semicolon (;).• The row delimiter is {Windows new line}.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

• The date format is dd-mon-yyyy.

Note: Write over an existing date format to create a new date format.

• The row header should be skipped.

3. In the ColumnAttributes pane, adjust the datatypes for the columns based on their content.

DatatypeColumn

intORDERID

varchar(15)EMPLOYEEID

dateORDERDATE

intCUSTOMERID

varchar(50)COMPANYNAME

varchar(50)CITY

varchar(50)COUNTRY

4. Save your changes and view the data to confirm that order 11196 was placed on December21, 2006.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Defining file formats for Excel files

IntroductionYou can create file formats for Excel files in the same way that you would for flat files.


• Create a file format for an Excel file

Using Excel as a native data source

It is possible to connect to Excel workbooks natively as a source, with no ODBC connectionsetup and configuration needed. You can select specific data in the workbook using customranges or auto-detect, and you can specify variable for file and sheet names formore flexibility.

As with file formats and datastores, these Excel formats show up as sources in impact andlineage analysis reports.

To import and configure an Excel source

1. On the Formats tab of the LocalObject Library, right-clickExcelWorkbooks and selectNewfrom the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Import Excel Workbook dialog box displays.

2. In the Format name field, enter a name for the format.The name may contain underscores but not spaces.

3. On the Format tab, click the drop-down button beside theDirectory field and select <Selectfolder...>.

4. Navigate to and select a new directory, and then clickOK.

5. Click the drop-down button beside the File name field and select <Select file...>.

6. Navigate to and select an Excel file, and then clickOpen.

7. Do one of the following:• To reference a named range for the Excel file, select theNamed range radio button and

enter a value in the field provided.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• To reference an entire worksheet, select theWorksheet radio button and then select theAll fields radio button.

• To reference a custom range, select theWorksheet radio button and the Custom rangeradio button, click the ellipses (...) button, select the cells, and close the Excel file byclicking X in the top right corner of the worksheet.

8. If required, select the Extend range checkbox.The Extend range checkbox provides a means to extend the spreadsheet in the event thatadditional rows of data are added at a later time. If this checkbox is checked, at executiontime, Data Services searches row by row until a null value row is reached. All rows abovethe null value row are included.

9. If applicable, select the Use first row values as column names option.If this option is selected, field names are based on the first row of the imported Excel sheet.

10.Click Import schema.The schema is displayed at the top of the dialog box.

11.Specify the structure of each column as follows:

DescriptionColumn

Enter the name of the column.Field Name

Select the appropriate datatype from the drop-down list.Data Type

For columns with a datatype of varchar, specify the length of thefield.Field Size

For columns with a datatype of decimal or numeric, specify theprecision of the field.Precision

For columnswith a datatype of decimal or numeric, specify the scaleof the field.Scale

If desired, enter a description of the column.Description

12.If required, on the Data Access tab, enter any changes that are required.The Data Access tab provides options to retrieve the file via FTP or execute a customapplication (such as unzipping a file) before reading the file.

13.ClickOK.The newly imported file format appears in the Local Objects Library with the other Excelworkbooks. The sheet is now available to be selected for use as a native data source.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Activity: Creating a file format for an Excel file

Compensation information for Alpha Acquisitions is stored in an Excel spreadsheet. To usethis information in data flows, you must create a file format.

Objective

• Create a file format to enable you to use the compensation spreadsheet as a source object

Instructions

1. In the Local Object Library, create a new file format for an Excel Workbook called Comp_HR.

2. Navigate to the Comp_HR.xls file in theActivity_Source folder located in theActivities folder.The path depends on where the folder has been copied from the Learner Resources.

3. Select theWorksheet radio button.

4. From theWorksheet drop-down list, select Comp_HRworksheet.

5. Click the ellipses (...) button.

6. Select all the cells that contain data, including the first row.

7. Close the spreadsheet.

8. Specify that you want to be able to extend the range.

9. Use the first row for the column names.

10.Import the schema and adjust the datatypes for the columns as follows:

DatatypeColumn

varchar(10)EmployeeID

intEmp_Salary

intEmp_Bonus

intEmp_VacationDays

datetimedate_updated

11.Save your changes and view the data to confirm that employee 2Lis5 has 16 vacation daysaccrued.When creating a flat file format for an Excel file, the Excel file needs to be local to theDesignermachine. When the job executes, the Excel file must be present on the Job Server machine.If the directory paths are the same on both machines there is not an issue. If they differ, thepath can be modified after the format is defined, either in the data flow where the format


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

is used or in the local object library. After defining the format, the path to the Excel file canbe edited on the Format tab.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Defining file formats for XML files

IntroductionData Services allows you to import and export metadata for XML documents that you can useas sources or targets in jobs.


• Import data from XML documents• Explain nested data

Importing data from XML documents

XML documents are hierarchical and the set of properties describing their structure is storedin separate format files. These format files describe the data contained in the XML documentand the relationships among the data elements, the schema. The format of an XML file ormessage (.xml) can be specified using either a document type definition (.dtd) or XML Schema(.xsd).

Data flows can read and write data to messages or files based on a specified DTD format orXML Schema. You can use the same DTD format or XML Schema to describe multiple XMLsources or targets.

Data Services uses Nested Relational Data Modeling (NRDM) to structure imported metadatafrom format documents, such as .xsd or .dtd files, into an internal schema to use for hierarchicaldocuments.

Importing metadata from a DTD file

As an example, an XML document that contains information to place a sales order, such asorder header, customer, and line items, the corresponding DTD includes the order structureand the relationship between the data elements.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

You can import metadata from either an existing XML file (with a reference to a DTD) or aDTD file. If you import the metadata from an XML file, Data Services automatically retrievesthe DTD for that XML file.

When importing a DTD format, Data Services reads the defined elements and attributes, andignores other parts, such as text and comments, from the file definition. This allows you tomodify imported XML data and edit the datatype as needed.

To import a DTD format

1. On the Formats tab of the Local Object Library, right-clickDTDs, and selectNew.

The Import DTD Format dialog box appears.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


2. In theDTD definition name field, enter the name you want to give the imported DTDformat.

3. Beside the File name field, clickBrowse, locate the file path that specifies the DTDyouwantto import, and open the DTD.

4. In the File type area, select a file type.The default file type is DTD. Use the XML option if the DTD file is embedded within theXML data.

5. In the Root element name field, select the name of the primary node of the XML that theDTD format is defining.Data Services only imports elements of the format that belong to this node or any sub-nodes.This option is not available when you select the XML file option type.

6. In the Circular level field, specify the number of levels the DTD, if applicable.If the DTD format contains recursive elements, for example, element A contains B andelement B contains A, this value must match the number of recursive levels in the DTDformat’s content. Otherwise, the job that uses this DTD format will fail.

7. In theDefault varchar size field, set the varchar size to import strings into Data Services.The default varchar size is 1024.

8. ClickOK.After you import the DTD format, you can view the DTD format’s column properties, andedit the nested table and column attributes in theDTD - XML Format editor. For moreinformation on DTD attributes, see Chapter 2 in the Data Services Reference Guide.

To edit column attributes of nested schemas

1. On the Formats tab of the Local Object Library, expandDTDs and double-click the DTDname to open it in the workspace.

2. In the workspace, right-click a nested column or column and select Properties.

3. In the Column Propertieswindow, click the Attributes tab.

4. To change an attribute, click the attribute name and enter the appropriate value in theValuefield.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

5. ClickOK.

Importing metadata from an XML schema

For an XML document that contains, for example, information to place a sales order, such asorder header, customer, and line items, the corresponding XML schema includes the orderstructure and the relationship between the data as shown:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

When importing an XML Schema, Data Services reads the defined elements and attributes,and imports:• Document structure• Table and column names• Datatype of each column• Nested table and column attributes

Note: While XML Schemas make a distinction between elements and attributes, Data Servicesimports and converts them all to nested table and column attributes. For more information on DataServices attributes, see Chapter 2 in the Data Services Reference Guide.

To import an XML schema

1. On the Formats tab of the Local Object Library, right-click XML Schemas, and selectNew.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Import XML Schema Format editor appears.

2. In the Format name field, enter the name you want to give the format.

3. In the File name/ URL field, enter the file name and URL address of the source file, or clickBrowse, locate the file path that specifies the XML Schema you want to import, and openthe file.

4. In the Root element name drop-down list, select the name of the primary node you wantto import.Data Services only imports elements of the XML Schema that belong to this node or anysubnodes. If the root element name is not uniquewithin the XMLSchema, select a namespaceto identify the imported XML Schema.

5. In the Circular level field, specify the number of levels the XML Schema has, if applicable.If the XML Schema contains recursive elements, for example, element A contains B andelement B contains A, this value must match the number of recursive levels in the XMLSchema’s content. Otherwise, the job that uses this XML Schema will fail.

6. In theDefault varchar size field, set the varchar size to import strings into Data Services.The default varchar size is 1024.

7. ClickOK.After you import an XML Schema, you can view the XML schema’s column properties, andedit the nested table and column attributes in the workspace.

Explaining nested data

Sales orders are often presented using nested data. For example, the line items in a sales orderare related to a single header and are represented using a nested schema. Each row of the salesorder data set contains a nested line item schema as shown:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the nested data method can be more concise (no repeated information), and can scale topresent a deeper level of hierarchical complexity.

To expand on the example above, columns inside a nested schema can also contain columns.

There is a unique instance of each nested schema for each row at each level of the relationshipas shown:

Generalizing further with nested data, each row at each level can have any number of columnscontaining nested schemas.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Data Services maps nested data to a separate schema implicitly related to a single row andcolumn of the parent schema. This mechanism is called Nested Relational Data Modeling(NRDM). NRDM provides a way to view and manipulate hierarchical relationships withindata flow sources, targets, and transforms.

In Data Services, you can see the structure of nested data in the input and output schemas ofsources, targets, and transforms in data flows.

Unnesting data

Loading a data set that contains nested schemas into a relational target requires that the nestedrows be unnested.

For example, a sales order may use a nested schema to define the relationship between theorder header and the order line items. To load the data into relational schemas, the multi-levelmust be unnested.

Unnesting a schema produces a cross-product of the top-level schema (parent) and the nestedschema (child).

You can also load different columns from different nesting levels into different schemas. Forexample, a sales order can be flattened so that the order number is maintained separately witheach line-item and the header and line-item information are loaded into separate schemas.

Data Services allows you to unnest any number of nested schemas at any depth. No matterhowmany levels are involved, the result of unnesting schemas is a cross product of the parentand child schemas.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


When more than one level of unnesting occurs, the inner-most child is unnested first, then theresult—the cross product of the parent and the inner-most child—is then unnested from itsparent, and so on to the top-level schema.

Keep inmind that unnesting all schemas to create a cross product of all datamight not producethe results you intend. For example, if an order includes multiple customer values such asship-to and bill-to addresses, flattening a sales order by unnesting customer and line-itemschemas produces rows of data that might not be useful for processing the order.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Defining source and target metadata1. What is the difference between a datastore and a database?

2. What are the two methods in which metadata can be manipulated in Data Services objects?What does each of these do?

3. Which of the following is NOT a datastore type?

a. Database

b. Application

c. Adapter

d. File Format

4. What is the difference between a repository and a datastore?


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Use datastores• Use datastore and system configurations• Define file formats for flat files• Define file formats for Excel files• Define file formats for XML files


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 3Creating Batch Jobs

Lesson introductionOncemetadata has been imported for your datastores, you can create data flows to define datamovement requirements.


• Work with objects• Create a data flow• Use the Query transform• Use target tables• Execute the job

71Creating Batch Jobs—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Working with objects

IntroductionData flows define how information is moved from source to target. These data flows areorganized into executable jobs, which are grouped into projects.


• Create a project• Create a job• Add, connect, and delete objects in the workspace• Create a work flow

Creating a project

Aproject is a single-use object that allows you to group jobs. It is the highest level of organizationoffered by Data Services. Opening a project makes one group of objects easily accessible in theuser interface. Only one project can be open at a time.

A project is used solely for organizational purposes. For example, you can use a project togroup jobs that have schedules that depend on one another or that youwant tomonitor together.

The objects in a project appear hierarchically in the project area in Designer. If a plus sign (+)appears next to an object, you can expand it to view the lower-level objects.

The objects in the project area also display in the workspace, where you can drill down intoadditional levels:.

To create a new project

1. From the Projectmenu, selectNew ➤ Project.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


You can also right-click the white space on the Projects tab of the Local Object Library andselectNew from the menu.

The Project - New dialog box displays.

2. Enter a unique name in the Project name field.The name can include alphanumeric characters and underscores (_). It cannot contain blankspaces.

3. Click Create.The new project appears in the project area. As you add jobs and other lower-level objectsto the project, they also appear in the project area.

To open an existing project

1. From the Projectmenu, selectOpen.

The Project - Open dialog box displays.

2. Select the name of an existing project from the list.

3. ClickOpen.If another project is already open, Data Services closes that project and opens the new onein the project area.

To save a project

1. From the Projectmenu, select Save All.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Save all changes dialog box lists the jobs, work flows, and data flows that you editedsince the last save.

2. Deselect any listed object to avoid saving it.

3. ClickOK.You are also prompted to save all changes made in a job when you execute the job or exitthe Designer.

Creating a job

A job is the only executable object in Data Services. When you are developing your data flows,you canmanually execute and test jobs directly inData Services. In production, you can schedulebatch jobs and set up real-time jobs as services that execute a process when Data Servicesreceives a message request.

A job is made up of steps that are executed together. Each step is represented by an object iconthat you place in the workspace to create a job diagram. A job diagram is made up of two ormore objects connected together. You can include any of the following objects in a job definition:• Work flows• Scripts• Conditionals• While loops• Try/catch blocks• Data flows

○ Source objects○ Target objects○ Transforms

If a job becomes complex, you can organize its content into individual work flows, and thencreate a single job that calls those work flows.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Tip: It is recommended that you follow consistent naming conventions to facilitate objectidentification across all systems in your enterprise.

To create a job in the project area

1. In the project area, right-click the project name and selectNew Batch Job from the menu.

A new batch job is created in the project area.

2. Edit the name of the job.The name can include alphanumeric characters and underscores (_). It cannot contain blankspaces.Data Services opens a new workspace for you to define the job.

3. Click the cursor outside of the job name or press Enter to commit the changes.You can also create a job and related objects from the Local Object Library.When you createa job in the Local Object Library, youmust associate the job and all related objects to a projectbefore you can execute the job.

Adding, connecting, and deleting objects in the workspace

After creating a job, you can add objects to the jobworkspace area using either the Local ObjectLibrary or the tool palette.

To add objects from the Local Object Library to the workspace

1. In the Local Object Library, click the tab for the type of object you want to add.

2. Click and drag the selected object on to the workspace.

To add objects from the tool palette to the workspace

• In the tool palette, click the desired object, move the cursor to the workspace, and then clickthe workspace to add the object.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Creating a work flow

Awork flow is an optional object that defines the decision-making process for executing otherobjects.

For example, elements in a work flow can determine the path of execution based on a valueset by a previous job or can indicate an alternative path if something goeswrong in the primarypath. Ultimately, the purpose of a work flow is to prepare for executing data flows and to setthe state of the system after the data flows are complete.

Note: In essence, jobs are justwork flows that can be executed. Almost all of the features documentedfor work flows also apply to jobs.

Work flows can contain data flows, conditionals, while loops, try/catch blocks, and scripts.They can also call other work flows, and you can nest calls to any depth. Awork flow can evencall itself.

To create a work flow

1. Open the job or work flow to which you want to add the work flow.

2. Select the Work Flow icon in the tool palette.

3. Click the workspace where you want to place the work flow.

4. Enter a unique name for the work flow.

5. Click the cursor outside of the work flow name or press Enter to commit the changes.

To connect objects in the workspace area

• Click and drag from the triangle or square of an object to the triangle or square of the nextobject in the flow to connect the objects.

To disconnect objects in the workspace area

• Select the connecting line between the objects and pressDelete.

Defining the order of execution in work flows

The connections you make between the icons in the workspace determine the order in whichwork flows execute, unless the jobs containing those work flows execute in parallel. Steps in awork flow execute in a sequence from left to right. Youmust connect the objects in a work flowwhen there is a dependency between the steps.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To execute more complex work flows in parallel, you can define each sequence as a separatework flow, and then call each of the work flows from another work flow, as in this example:

First, you must define Work Flow A:

Next, define Work Flow B:

Finally, create Work Flow C to call Work Flows A and B:

You can specify a job to execute a particular work flow or data flow once only. If you specifythat it should be executed only once, Data Services only executes the first occurrence of thework flow or data flow, and skips subsequent occurrences in the job. Youmight use this featurewhen developing complex jobs with multiple paths, such as jobs with try/catch blocks orconditionals, and you want to ensure that Data Services only executes a particular work flowor data flow one time.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Creating a data flow

IntroductionData flows contain the source, transform, and target objects that represent the key activities indata integration and data quality processes.


• Create a data flow• Explain source and target objects• Add source and target objects to a data flow

Using data flows

Data flows determine how information is extracted from sources, transformed, and loaded intotargets. The lines connecting objects in a data flow represent the flow of data through dataintegration and data quality processes.

Each icon you place in the data flow diagram becomes a step in the data flow. The objects thatyou can use as steps in a data flow are:• Source and target objects• Transforms

The connections you make between the icons determine the order in which Data Servicescompletes the steps.

Using data flows as steps in work flows

Each step in a data flow, up to the target definition, produces an intermediate result. Forexample, the results of a SQL statement contain a WHERE clause that flows to the next step inthe data flow. The intermediate result consists of a set of rows from the previous operation andthe schema in which the rows are arranged. This result is called a data set. This data set may,in turn, be further filtered and directed into yet another data set.

Data flows are closed operations, evenwhen they are steps in awork flow.Any data set createdwithin a data flow is not available to other steps in the work flow.

A work flow does not operate on data sets and cannot provide more data to a data flow;however, a work flow can:• Call data flows to perform data movement operations.• Define the conditions appropriate to run data flows.• Pass parameters to and from data flows.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To create a new data flow

1. Open the job or work flow in which you want to add the data flow.

2. Select the Data Flow icon in the tool palette.

3. Click the workspace where you want to add the data flow.

4. Enter a unique name for your data flow.Data flow names can include alphanumeric characters and underscores (_). They cannotcontain blank spaces.

5. Click the cursor outside of the data flow or press Enter to commit the changes.

6. Double-click the data flow to open the data flow workspace.

Changing data flow properties

You can specify the following advanced data properties for a data flow:

DescriptionData Flow Property

When you specify that a data flow should only execute once,a batch jobwill never re-execute that data flow after the data

Execute only once flow completes successfully, even if the data flow is containedin a work flow that is a recovery unit that re-executes. Youshould not select this option if the parent work flow is arecovery unit.

Database links are communication paths between onedatabase server and another. Database links allow local users

Use database linksto access data on a remote database, which can be on thelocal or a remote computer of the same or different databasetype. For more information see “Database link support forpush-down operations across datastores” in theData ServicesPerformance Optimization Guide.

Degree of parallelism (DOP) is a property of a data flow thatdefines how many times each transform within a data flow

Degree of parallelism replicates to process a parallel subset of data. For moreinformation see “Degree of parallelism” in the Data ServicesPerformance Optimization Guide.

You can cache data to improve performance of operationssuch as joins, groups, sorts, filtering, lookups, and tablecomparisons. Select one of the following values:

Cache type


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionData Flow Property

• InMemory: Choose this value if your data flowprocessesa small amount of data that can fit in the availablememory.

• Pageable: Choose this value if you want to return only asubset of data at a time to limit the resources required.This is the default.

For more information, see “Tuning Caches” in the DataServices Performance Optimization Guide.

To change data flow properties

1. Right-click the data flow and select Properties from the menu.

The Propertieswindow opens for the data flow.

2. Change the properties of the data flow as required.

3. ClickOK.For more information about how Data Integrator processes data flows with multipleproperties, see “Data Flow” in the Data Services Resource Guide.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Explaining source and target objects

A data flow directly reads data from source objects and loads data to target objects.

TypeDescriptionObject

Source and targetA file formatted withcolumns and rows as used inrelational databases.

Table

Source and target

A template table that hasbeen created and saved inTemplate table another data flow (used indevelopment).

Source and targetA delimited or fixed-widthflat file.File

Source and target

A file with anapplication-specific formatDocument (not readable by SQLorXMLparser).

Source and targetA file formatted with XMLtags.XML file

Source onlyA source in real-time jobs.XML message

Target only

An XML file whose format isbased on the preceding

XML template file transform output (used indevelopment, primarily fordebugging data flows).

Source only

A pre-built set of operationsthat can create new data,Transform such as the Date Generationtransform.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Adding source and target objects

Before you can add source and target objects to a data flow, you must first create the datastoreand import the table metadata for any databases, or create the file format for flat files.

To add a source or target object to a data flow

1. In the workspace, open the data flow in which you want to place the object.

2. Do one of the following:• To add a database table, in theDatastores tab of the Local Object Library, select the table.• To add a flat file, in the Formats tab of the Local Object Library, select the file format.

3. Click and drag the object to the workspace.

A pop-up menu appears for the source or target object.

4. SelectMake Source orMake Target from the menu, depending on whether the object is asource or target object.

5. Add and connect objects in the data flow as appropriate.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Query transform

IntroductionTheQuery transform is themost commonly-used transform, and is included inmost data flows.It enables you to select data from a source and filter or reformat it as it moves to the target.


• Describe the transform editor• Use the Query transform

Describing the transform editor

The transform editor is a graphical interface for defining the properties of transforms. Theworkspace can contain these areas:• Input schema area• Output schema area• Parameters area


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The input schema area displays the schema of the input data set. For source objects and sometransforms, this area is not available.

The output schema area displays the schema of the output data set, including any functions.For template tables, the output schema can be defined based on your preferences.

For any data that needs to move from source to target, a relationship must be defined betweenthe input and output schemas. To create this relationship, you must map each input columnto the corresponding output column.

Below the input and output schema areas is the parameters area. The options available on thistab differs based onwhich transform or object you are modifying. The I icon ( ) indicates tabscontaining user-defined entries.

Explaining the Query transform

The Query transform is used so frequently that it is included in the tool palette with otherstandard objects. It retrieves a data set that satisfies conditions that you specify, similar to aSQL SELECT statement.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Query transform can perform the following operations:• Filter the data extracted from sources.• Join data from multiple sources.• Map columns from input to output schemas.• Perform transformations and functions on the data.• Perform data nesting and unnesting.• Add new columns, nested schemas, and function results to the output schema.• Assign primary keys to output columns.

For example, you could use the Query transform to select a subset of the data in a table to showonly those records from a specific region.

The next section gives a brief description the function, data input requirements, options, anddata output results for the Query transform. For more information on the Query transform see“Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output

The data input is a data set from one or more sources with rows flagged with a NORMALoperation code.

The NORMAL operation code creates a new row in the target. All the rows in a data set areflagged as NORMAL when they are extracted by a source table or file. If a row is flagged asNORMAL when loaded into a target table or file, it is inserted as a new row in the target.

The data output is a data set based on the conditions you specify and using the schema specifiedin the output schema area.

Note: When working with nested data from an XML file, you can use the Query transform tounnest the data using the right-click menu for the output schema, which provides options forunnesting.

Options

The input schema area displays all schemas input to the Query transform as a hierarchical tree.Each input schema can contain multiple columns.

Output schema area displays the schema output from the Query transform as a hierarchicaltree. The output schema can contain multiple columns and functions.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Icons preceding columns are combinations of these graphics:

DescriptionIcon

This indicates that the column is a primary key.

This indicates that the column has a simple mapping. A simple mappingis either a single column or an expression with no input column.

This indicates that the column has a complex mapping, such as atransformation or a merge between two source columns.

This indicates that the column mapping is incorrect.

Data Integrator does not perform a complete validation during design,so not all incorrect mappings will necessarily be flagged.

The parameters area of the Query transform includes the following tabs:

DescriptionTab

Specify how the selected output column is derived.Mapping

Select only distinct rows (discarding any duplicate rows).SELECT

Specify the input schemas used in the current output schema.FROM

Specify an inner table and an outer table for joins that you wanttreated as outer joins.OUTER JOIN

Set conditions that determine which rows are output.WHERE

Specify a list of columns for which you want to combine output.For each unique set of values in the group by list, Data Servicescombines or aggregates the values in the remaining columns.

GROUP BY

Specify the columns you want used to sort the output data set.ORDER BY

Create separate sub data flows to process any of the followingresource-intensive query clauses:Advanced


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionTab

• DISTINCT• GROUP BY• JOIN• ORDER BY

For more information, see “Distributed Data Flow execution” inthe Data Services Designer Guide.

Search for a specific work or item in the input schema or theoutput schema.Find

To map input columns to output columns

• In the transform editor, do any of the following:• Drag and drop a single column from the input schema area into the output schema area.• Drag a single input column over the corresponding output column, release the cursor,

and select Remap Column from the menu.• Select multiple input columns (using Ctrl+click or Shift+click) and drag onto Query

output schema for automatic mapping.• Select the output column and manually enter the mapping on the Mapping tab in the

parameters area. You can either type the column name in the parameters area or clickand drag the column from the input schema pane.

• Select the output column, highlight and manually delete the mapping on the Mappingtab in the parameters area.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using target tables

IntroductionThe target object for your data flow can be either a physical table or file, or a template table.


• Access the table table editor• Set target table options• Use template tables

Accessing the target table editor

The target table editor provides a single location to change settings for your target tables.

To access the target table editor

1. In a data flow, double-click the target table.

The target table editor opens in the workspace.

2. Change the values as required.Changes are automatically committed.

3. Click Back to return to the data flow.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Setting target table options

When your target object is a physical table in a database, the target table editor opens in theworkspacewith different tabswhere you can set database type properties, table loading options,and tuning techniques for loading a job.

Note: Most of the tabs in the target table editor focus onmigration or performance-tuning techniques,which are outside the scope of this course.

You can set the following table loading options in the Options tab of the target table editor:

DescriptionOption

Specifies the transaction size in number of rows.Rows per commit

Specifies how the input columns are mapped to outputcolumns. There are two options:Column comparison


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

• Compare_by_position — disregards the columnnames and maps source columns to target columnsby position.

• Compare_by_name—maps source columns to targetcolumns by name.

Validation errors occur if the datatypes of the columnsdo not match.

Sends a TRUNCATE statement to clear the contents ofthe table before loading during batch jobs. Defaults tonot selected.

Delete data from table beforeloading

Specifies the number of loaders (to a maximum of five)and the number of rows per commit that each loaderreceives during parallel loading.

Number of loadersFor example, if you choose a Rows per commit of 1000and set the number of loaders to three, the first 1000rows are sent to the first loader. The second 1000 rowsare sent to the second loader, the third 1000 rows to thethird loader, and the next 1000 rows back to the firstloader.

Writes rows that cannot be loaded to the overflow filefor recovery purposes. Options are enabled for the file

Use overflow file name and file format. The overflow format can includethe data rejected and the operation being performed(write_data) or the SQL command used to produce therejected operation (write_sql).

Specifies a value that might appear in a source columnthat you do not want updated in the target table. When

Ignore columns with value this value appears in the source column, thecorresponding target column is not updated during autocorrect loading. You can enter spaces.

Ensures that NULL source columns are not updated inthe target table during auto correct loading.Ignore columns with null


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Enables Data Integrator to use the primary keys fromthe source table. By default, Data Integrator uses theprimary key of the target table.

Use input keys

Updates key column values when it loads data to thetarget.Update key columns

Ensures that the same row is not duplicated in a targettable. This is particularly useful for data recoveryoperations.

Auto correct loadWhenAuto correct load is selected,Data Integrator readsa row from the source and checks if a row exists in thetarget table with the same values in the primary key. Ifa matching row does not exist, it inserts the new rowregardless of other options. If a matching row exists, itupdates the row depending on the values of Ignorecolumns with value and Ignore columns with null.

Indicates that this target is included in the transactionprocessed by a batch or real-time job. This option allows

Include in transaction

you to commit data tomultiple tables as part of the sametransaction. If loading fails for any one of the tables, nodata is committed to any of the tables.

Transactional loading can require rows to be bufferedto ensure the correct load order. If the data beingbuffered is larger than the virtual memory available,Data Integrator reports a memory error.

The tables must be from the same datastore.

If you choose to enable transactional loading, theseoptions are not available: Rows per commit, Useoverflow file, and overflow file specification, Numberof loaders, Enable partitioning, and Delete data fromtable before loading.

Data Integrator also does not parameterize SQL or pushoperations to the database if transactional loading isenabled.

Indicates where this table falls in the loading order ofthe tables being loaded. By default, there is no ordering.Transaction order


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionOption

All loaders have a transaction order of zero. If youspecify orders among the tables, the loading operationsare applied according to the order. Tableswith the sametransaction order are loaded together. Tables with atransaction order of zero are loaded at the discretion ofthe data flow process.

See the Data Services Performance Optimization Guide and “Description of objects” in the DataServices Reference Guide for more information.

Using template tables

During the initial design of an application, you might find it convenient to use template tablesto represent database tables. Template tables are particularly useful in early applicationdevelopment when you are designing and testing a project.

With template tables, you do not have to initially create a new table in your RDBMS and importthe metadata into Data Services. Instead, Data Services automatically creates the table in thedatabase with the schema defined by the data flow when you execute a job.

After creating a template table as a target in one data flow, you can use it as a source in otherdata flows. Although a template table can be used as a source table in multiple data flows, itcan be used only as a target in one data flow.

You can modify the schema of the template table in the data flow where the table is used as atarget. Any changes are automatically applied to any other instances of the template table.

After a template table is created in the database, you can convert the template table in therepository to a regular table. You must convert template tables so that you can use the newtable in expressions, functions, and transform options. After a template table is converted, youcan no longer alter the schema.

To create a template table

1. Open a data flow in the workspace.

2. In the tool palette, click the Template Table icon and click the workspace to add a newtemplate table to the data flow.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Create Template dialog box displays.

3. In the Table name field, enter the name for the template table.

4. In the In datastore drop-down list, select the datastore for the template table.

5. ClickOK.You also can create a new template table in the Local Object Library Datastore tab byexpanding a datastore and right-clicking Templates.

To convert a template table into a regular table from the Local Object Library

1. On the Datastores tab of the Local Object Library, expand the branch for the datastore toview the template table.

2. Right-click a template table you want to convert and select Import Table from the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Data Services converts the template table in the repository into a regular table by importingit from the database.

3. To update the icon in all data flows, from View menu, select Refresh.On the Datastore tab of the Local Object Library, the table is listed under Tables rather thanTemplate Tables.

To convert a template table into a regular table from a data flow

1. Open the data flow containing the template table.

2. Right-click the template table you want to convert and select Import Table from the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Executing the job

IntroductionOnce you have created a data flow, you can execute the job in Data Services to see how thedata moves from source to target.


• Understand job execution• Execute the job

Explaining job execution

After you create your project, jobs, and associated data flows, you can then execute the job.You can run jobs two ways:• Immediate jobs

Data Services initiates both batch and real-time jobs and runs them immediately fromwithinthe Designer. For these jobs, both the Designer and designated Job Server (where the jobexecutes, usually on the same machine) must be running. You will likely run immediatejobs only during the development cycle.

• Scheduled jobs

Batch jobs are scheduled. To schedule a job, use the Data Services Management Console oruse a third-party scheduler. The Job Server must be running.

If a job has syntax errors, it does not execute.

Setting execution properties

Whenyou execute a job, the following options are available in theExecutionPropertieswindow:

DescriptionOption

Records all trace messages in the log.Print all trace messages

Does not collect audit statistics for this specific jobexecution.

Disable data validation statisticscollection

Collects audit statistics for this specific job execution.Enable auditing


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Enables the automatic recovery feature. When enabled,Data Services saves the results from completed steps andallows you to resume failed jobs.

Enable recovery

Resumes a failed job. Data Services retrieves the resultsfrom any steps thatwere previously executed successfullyand re-executes any other steps.Recover from last failed

execution This option is a run-time property. This option is notavailable when a job has not yet been executed or whenrecovery mode was disabled during the previous run.

Collects statistics that theData Services optimizerwill useto choose an optimal cache type (in-memory or pageable).

Collect statistics foroptimization

Displays cache statistics in the Performance Monitor inAdministrator.Collect statistics for monitoring

OptimizesData Services to use the cache statistics collectedon a previous execution of the job.Use collected statistics

Specifies the system configuration to use when executingthis job. A system configuration defines a set of datastoreconfigurations, which define the datastore connections.

System configuration If a system configuration is not specified, Data Servicesuses the default datastore configuration for each datastore.

This option is a run-time property that is only available ifthere are system configurations defined in the repository.

Specifies the Job Server or server group to execute this job.Job Server or Server Group

Allows a job to be distributed to multiple Job Servers forprocessing. The options are:

Distribution level

• Job - The entire job will execute on one server.• Data flow - Each data flow within the job will execute

on a separate server.• Sub-data flow - Each sub-data flow (can be a separate

transform or function) within a data flow will executeon a separate Job server.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Executing the job

Immediate or on demand tasks are initiated from the Designer. Both the Designer and JobServer must be running for the job to execute.

To execute a job as an immediate task

1. In the project area, right-click the job name and select Execute from the menu.

Data Services prompts you to save any objects that have not been saved.

2. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Execution Properties dialog box displays.

3. Select the required job execution parameters.

4. ClickOK.

Activity: Creating a basic data flow

After analyzing the source data, you have determined that the structure of the customer datafor Beta Businesses is the appropriate structure for the customer data in the Omega datawarehouse, and you must therefore change the structure of the Alpha Acquisitions customerdata to use the same structure in preparation for merging customer data from both datastoresat a later date. Since the target table may later be processed by a Data Quality Transform, youwill also define Content Types for the appropriate columns in the target table.

Objective

• Use the Query transform to change the schema of the Alpha Acquisitions Customer tableand move the data into the Delta staging database.

Instructions

1. Create a new project called Omega.

2. In the Omega project, create a new batch job called Alpha_Customers_Jobwith a new dataflow called Alpha_Customers_DF.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

3. In theworkspace forAlpha_Customers_DF, add the customer table from theAlpha datastoreas the source object.

4. Create a new template table called alpha_customers in the Delta datastore as the targetobject.

5. Add the Query transform to the workspace between the source and target.

6. Connect the objects from source to transform and from transform to target.

7. In the transform editor for the Query transform, create the following output columns:

Content typeData typeName

intCustomerID

Firmvarchar(50)Firm

Namevarchar(50)ContactName

Titlevarchar(30)Title

Addressvarchar(50)Address1

Localityvarchar(50)City

Regionvarchar(25)Region

Postcodevarchar(25)PostalCode

Countryvarchar(50)Country

Phonevarchar(25)Phone

Phonevarchar(25)Fax

8. Map the columns as follows:

Schema OutSchema In

CustomerIDCUSTOMERID


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Schema OutSchema In

FirmCOMPANYNAME

ContactNameCONTACTNAME

TitleCONTACTTITLE

Address1ADDRESS

CityCITY

RegionREGIONID

PostalCodePOSTALCODE

CountryCOUNTRYID

PhonePHONE

FaxFAX

9. Set the CustomerID column as the Primary Key.

10.Execute Alpha_Customers_Job with the default execution properties and save all objectsyou have created.

11.Return to the data flow workspace and view data for the target table to confirm that 25records were loaded.

A solution file called SOLUTION_Basic.atl is included in Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Creating batch jobs1. Does a job have to be part of a project to be executed in the Designer?

2. How do you add a new template table?

3. Name the objects contained within a project.

4. What factors might you consider when determining whether to run work flows or dataflows serially or in parallel?


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



• Work with objects• Create a data flow• Use the Query transform• Use target tables• Execute the job


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 4Troubleshooting Batch Jobs

Lesson introductionTo document decisions and troubleshoot any issues that arise when executing your jobs, youcan validate and add annotations to jobs, work flows, and data flows, set trace options, anddebug your jobs. You can also set up audit rules to ensure the correct data is loaded to thetarget.


• Use descriptions and annotations• Validate and trace jobs• Use View Data and the Interactive Debugger• Use auditing in data flows

103Troubleshooting Batch Jobs—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using descriptions and annotations

IntroductionDescriptions and annotations are a convenient way to add comments to objects andworkspacediagrams.


• Use descriptions with objects• Use annotations to describe flows

Using descriptions with objects

A description is associated with a particular object. When you import or export a repositoryobject, you also import or export its description.

Designer determines when to show object descriptions based on a system-level setting and anobject-level setting. Both settings must be activated to view the description for a particularobject.

Note: The system-level setting is unique to your setup.

There are three requirements for displaying descriptions:• A description has been entered into the properties of the object.• The description is enabled on the properties of that object.• The global View Enabled Object Descriptions option is enabled.

To show object descriptions at the system level

• From the Viewmenu, select Enabled Descriptions.This is a global setting.

Note: The Enabled Descriptions option is only available if it is a viable option.

To add a description to an object

1. In the project area or the workspace, right-click an object and select Properties from themenu.The Properties dialog box displays.

2. In theDescription text box, enter your comments.

3. ClickOK.If you are modifying the description of a re-usable object, Data Services provides a warningmessage that all instances of the re-usable object will be affected by the change.

4. Click Yes.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


The description for the object displays in the Local Object Library.

To display a description in the workspace

• In the workspace, right-click the object in the workspace and select Enable ObjectDescription from the menu.The description displays in the workspace under the object.

Using annotations to describe objects

An annotation is an object in the workspace that describes a flow, part of a flow, or a diagram.An annotation is associated with the object where it appears. When you import or export a job,work flow, or data flow that includes annotations, you also import or export associatedannotations.

To add an annotation to the workspace

1. In the workspace, from the tool palette, click the Annotation icon and then click theworkspace.An annotation appears on the diagram.

2. Double-click the annotation.

3. Add text to the annotation.

4. Click the cursor outside of the annotation to commit the changes.

You can resize and move the annotation by clicking and dragging.

You cannot hide annotations that you have added to the workspace. However, you canmove them out of the way or delete them.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Validating and tracing jobs

IntroductionIt is a good idea to validate your jobs when you are ready for job execution to ensure there areno errors. You can also select and set specific trace properties, which allow you to use thevarious log files to help you read job execution status or troubleshoot job errors.


• Validate jobs• Trace jobs• Use log files• Determine the success of a job

Validating jobs

As a best practice, you want to validate your work as you build objects so that you are notconfronted with too many warnings and errors at one time. You can validate your objects asyou create a job or you can automatically validate all your jobs before executing them.

To validate jobs automatically before job execution

1. From the Toolsmenu, selectOptions.TheOptions dialog box displays.

2. In the Category pane, expand theDesigner branch and clickGeneral.

3. Select the Perform complete validation before job execution option.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

4. ClickOK.

To validate objects on demand

1. From the Validationmenu, select Validate ➤ Current View or All Objects in View.

TheOutput dialog box displays.

2. To navigate to the object where an error occurred, right-click the validation error messageand selectGo To Error from the menu.

Tracing jobs

Use trace properties to select the information that Data Services monitors and writes to thetrace log file during a job. Data Services writes trace messages to the trace log associated withthe current Job Server and writes error messages to the error log associated with the currentJob Server.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The following trace options are available.

DescriptionTrace

Writes a message when a transform imports or exports a row.Row

Writes a message when the job description is read from therepository, when the job is optimized, and when the job runs.Session

Writes amessagewhen thework flowdescription is read fromthe repository, when the work flow is optimized, when thework flow runs, and when the work flow ends.

Work flow

Writes amessagewhen the data flow starts andwhen the dataflow successfully finishes or terminates due to error.Data flow

Writes a message when a transform starts and completes orterminates.Transform

Writes a message when a custom transform starts andcompletes successfully.Custom Transform

Writes amessage of all user invocations of theAE_LogMessagefunction from custom C code.Custom Function

Writes data retrieved before SQL functions:

SQL Functions

• Every row retrieved by the named query before the SQL issubmitted in the key_generation function.

• Every row retrieved by the named query before the SQL issubmitted in the lookup function (but only ifPRE_LOAD_CACHE is not specified).

• When mail is sent using the mail_to function.

Writes a message (using the Table Comparison transform)aboutwhether a row exists in the target table that correspondsto an input row from the source table.

SQL Transforms

Writes the SQL query block that a script, query transform, orSQL function submits to the system. Also writes the SQLresults.

SQL Readers


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionTrace

Writes a message when the bulk loader starts, submits awarningmessage, or completes successfully or unsuccessfully.SQL Loaders

Writes a message for every row retrieved from the memorytable.Memory Source

Writes amessage for every row inserted into thememory table.Memory Target

For Business Objects consulting and technical support use.Optimized Data Flow

Writes a message when a table is created or dropped.Tables

Writes a message when a script is called, a function is calledby a script, and a script successfully completes.Scripts and Script Functions

Writesmessages describing howdata in a data flow is parallelprocessed.Trace Parallel Execution

Writes messages exchanged between the Access Server and aservice provider.

Access ServerCommunication

Writes amessagewhen a stored procedure starts and finishes,and includes key values.Stored Procedure

Writes a message that collects a statistic at an audit point anddetermines if an audit rule passes or fails.Audit Data

To set trace options

1. From the project area, right-click the job name and do one of the following:• To set trace options for a single instance of the job, select Execute from the menu.• To set trace options for every execution of the job, select Properties from the menu.

Save all files.Depending on which option you selected, the Execution Properties dialog box or theProperties dialog box displays.

2. Click the Trace tab.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

3. Under the name column, click a trace object name.The Value drop-down list is enabled when you click a trace object name.

4. From the Value drop-down list, select Yes to turn the trace on.

5. ClickOK.

Using log files

As a job executes, Data Services produces three log files. You can view these from the projectarea. The log files are, by default, also set to display automatically in the workspace when youexecute a job.

You can click the Trace, Monitor, and Error icons to view the following log files, which arecreated during job execution.

Examining trace logs

Use the trace logs to determine where an execution failed, whether the execution steps occurin the order you expect, and which parts of the execution are the most time consuming.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Examining monitor logs

Use monitor logs to quantify the activities of the components of the job. It lists the time spentin a given component of a job and the number of data rows that streamed through thecomponent.

Examining error logs

Use the error logs to determine how an execution failed. If the execution completed withouterror, the error log is blank.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Monitor tab

The Monitor tab lists the trace logs of all current or most recent executions of a job.

The traffic-light icons in the Monitor tab indicate the following:• Green light indicates that the job is running.

You can right-click and select Kill Job to stop a job that is still running.

• Red light indicates that the job has stopped.

You can right-click and select Properties to add a description for a specific trace log. Thisdescription is saved with the log which can be accessed later from the Log tab.

• Red cross indicates that the job encountered an error.

Using the Log tab

You can also select the Log tab to view a job’s log history.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


You may find these job log indicators:

DescriptionIndicator

Indicates that the job executed successfully on this explicitly selected JobServer.

Indicates that the job encountered an error on this explicitly selected JobServer.

Indicates that the job executed successfully by a server group. The Job Serverlisted executed the job.

Indicates that the job encountered an error while being executed by a servergroup. The Job Server listed executed the job.

To view log files from the project area

1. In the project area, click the Log tab.

2. Select the job for which you want to view the logs.

3. In the workspace, in the Filter drop-down list, select the type of log you want to view.

4. In the list of logs, double-click the log to view details.

5. To copy log content from an open log, select one or more lines and use the key commands[Ctrl+C].

Determining the success of the job

The best measure of the success of a job is the state of the target data. Always examine yourdata to make sure the data movement operation produced the results you expect. Be sure that:• Data was not converted to incompatible types or truncated.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

• Data was not duplicated in the target.• Data was not lost between updates of the target.• Generated keys have been properly incremented.• Updated values were handled properly.

If a job fails to execute, check the Job server icon in the status bar to verify that the Job Serviceis running. Also check that the port number inDesignermatches the number specified in ServerManager; if necessary, you can use the Server Manager resync button to reset the port numberin the Local Object Library.

Activity: Setting traces and adding annotations

You will be sharing your jobs with other developers during the project, so you want to makesure that you identify the purpose of the job you just created. You also want to ensure that thejob is handling the movement of each row appropriately.

Objectives

• Add an annotation to a job so that other designers who reference this information will beable to identify its purpose.

• Execute the job in trace mode to determine when a transform imports and exports fromsource to target.

Instructions

1. Open the workspace for Alpha_Customers_Job.

2. Add an annotation to theworkspace beside the data flowwith an explanation of the purposeof the job.

3. Save all objects you have created.

4. Execute Alpha_Customers_Job and enable Trace rows option on the Trace tab of theExecution Properties dialog box.An entry for each row is added to the log to indicate how it is being handled by the dataflow.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using View Data and the Interactive Debugger

IntroductionYou can debug jobs in Data Services using the View Data and Interactive Debugger features.With View Data, you can view samples of source and target data for your jobs. Using theInteractive Debugger, you can examinewhat happens to the data after each transform or objectin the flow.


• Use View Data with sources and targets• Use the Interactive Debugger• Set filters and breakpoints for a debug session

Using View Data with sources and targets

With the View Data feature, you can check the status of data at any point after you import themetadata for a data source, and before or after you process your data flows. You can check thedata when you design and test jobs to ensure that your design returns the results you expect.

View Data allows you to see source data before you execute a job. Using data details you can:• Create higher quality job designs.• Scan and analyze imported table and file data from the Local Object Library.• See the data for those same objects within existing jobs.• Refer back to the source data after you execute the job.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

View Data also allows you to check your target data before executing your job, then look atthe changed data after the job executes. In a data flow, you can use one or more View Datapanels to compare data between transforms and within source and target objects.

ViewData displays your data in the rows and columns of a data grid. The path for the selectedobject displays at the top of the pane. The number of rows displayed is determined by acombination of several conditions:• Sample size: the number of rows sampled in memory. Default sample size is 1000 rows for

imported source, targets, and transforms.• Filtering: the filtering options that are selected. If your original data set is smaller or if you

use filters, the number of returned rows could be less than the default.

Keep in mind that you can have only two ViewData windows open at any time. if you alreadyhave two windows open and try to open a third, you are prompted to select which to close.

To use View Data in source and target tables

• On the Datastore tab of the Local Object Library, right-click a table and select View Datafrom the menu.The View Data dialog box displays.

To open a View Data pane in a data flow workspace

1. In the data flow workspace, click the magnifying glass button on a data flow object.A large View Data pane appears beneath the current workspace area.

2. To compare data, click the magnifying glass button for another object.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


A second pane appears below the workspace area, and the first pane area shrinks toaccommodate it.

When both panes are filled and you click another View Data button, a small menu appearscontaining window placement icons. The black area in each icon indicates the pane youwant to replace with a new set of data. When you select a menu option, the data from thelatest selected object replaces the data in the corresponding pane.

Using the Interactive Debugger

Designer includes an InteractiveDebugger that allows you to troubleshoot your jobs by placingfilters and breakpoints on lines in a data flowdiagram. This enables you to examine andmodifydata row by row during a debug mode job execution.

The Interactive Debugger can also be used without filters and breakpoints. Running the job indebugmode and then navigating to the data flowwhile remaining in debugmode enables youto drill into each step of the data flow and view the data.

When you execute a job in debug mode, Designer displays several additional windows thatmake up the Interactive Debugger: Call stack, Trace, Variables, and View Data panes.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The left View Data pane shows the data in the source table, and the right pane shows the rowsthat have been passed to the query up to the breakpoint.

To start the Interactive Debugger

1. In the project area, right-click the job and select Start debug from the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

TheDebug Properties dialog box displays.

2. Set properties for the execution.You can specify many of the same properties as you can when executing a job withoutdebugging. In addition, you can specify the number of rows to sample in the Data samplerate field.

3. ClickOK.

The debug mode begins.

While in debugmode, all other Designer features are set to read-only. ADebug icon is visiblein the task bar while the debug is in progress.

4. If you have set breakpoints, in the Interactive Debugger toolbar, clickGet next row to moveto the next breakpoint.

5. To exit the debug mode, from theDebugmenu, select Stop Debug.

Setting filters and breakpoints for a debug session

You can set filters and breakpoints on lines in a data flowdiagrambefore you start a debuggingsession that allow you to examine and modify data row-by-row during a debug mode jobexecution.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

A debug filter functions the same as a simple Query transformwith aWHERE clause. You canuse a filter if you want to reduce a data set in a debug job execution. The debug filter does notsupport complex expressions.

A breakpoint is the location where a debug job execution pauses and returns control to you.A breakpoint can be based on a condition, or it can be set to break after a specific number ofrows.

You can place a filter or breakpoint on the line between a source and a transform or twotransforms. If you set a filter and a breakpoint on the same line, Data Services applies the filterfirst, which means that the breakpoint applies to the filtered rows only.

To set filters and breakpoints

1. In the data flow workspace, right-click the line that connects two objects and select SetFilter/Breakpoint from the menu.

2. In the Breakpointwindow in the Column drop-down list, select the column to which thefilter or breakpoint applies.

3. In theOperator drop-down list, select the operator for the expression.

4. In the Value field, enter the value to complete the expression.The condition for filters/breakpoints do not use a delimiter for strings.

5. If you are using multiple conditions, repeat step 3 to step 5 for all conditions and select theappropriate operator from the Concatenate all conditions using drop-down list.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


6. ClickOK.

Activity: Using the Interactive Debugger

To ensure that your job is processing the data correctly, youwant to run the job in debugmode.To minimize the data you have to review in the Interactive Debugger, you will set the debugprocess to show only records from the USA (represented by a CountryID value of 1). Once youhave confirmed that the structure appears correct, you will run another debug session with allrecords, breaking after every row.

Objectives

• View the data in debug mode with a filter to limit records to those with a CountryID of 1(USA).

• View the data in debug mode with a breakpoint to stop the debug process after each row.

Instructions

1. In the workspace for Alpha_Customers_DF, add a filter between the source and the Querytransform to filter the records so that only customers from the USA are included in thedebug session.

2. Execute Alpha_Customers_Job in debug mode.

3. Return to the data flow workspace and view data for the target table.Only five rows were returned.

4. Remove the filter and add a breakpoint to break the debug session after every row.

5. Execute Alpha_Customers_Job in debug mode again.

6. Discard the first row, and then step through the rest of the records.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

7. Exit the debugger, return to the data flow workspace, and view data for the target table.Note that only 24 of 25 rows were returned.

8. Remove the breakpoint from the data flow.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Setting up auditing

IntroductionYou can collect audit statistics on the data that flows out of any Data Services object, such as asource, transform, or target. If a transform has multiple distinct or different outputs (such asValidation or Case), you can audit each output independently.


• Define audit points and rules• Explain guidelines for choosing audit points

Setting up auditing

When you audit data flows, you:1. Define audit points to collect run-time statistics about the data that flows out of objects.

These audit statistics are stored in the Data Services repository.

2. Define rules with these audit statistics to ensure that the data extracted from sources,processed by transforms, and loaded into targets is what you expect.

3. Generate a run-time notification that includes the audit rule that failed and the values ofthe audit statistics at the time of failure.

4. Display the audit statistics after the job execution to help identify the object in the data flowthat might have produced incorrect data.

Defining audit points

An audit point represents the object in a data flow where you collect statistics. You can audita source, a transform, or a target in a data flow.

When you define audit points on objects in a data flow, you specify an audit function. An auditfunction represents the audit statistic that Data Services collects for a table, output schema, orcolumn. You can choose from these audit functions:

DescriptionFunctionData object

This function collects two statistics:

CountTable oroutputschema

• Good count for rows that were successfully processed.• Error count for rows that generated some type of error if

you enabled error handling.

The datatype for this function is integer.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionFunctionData object

Sum of the numeric values in the column. This function onlyincludes the good rows.

SumColumnThis function applies only to columns with a datatype ofinteger, decimal, double, and real.

Average of the numeric values in the column. This functiononly includes the good rows.

AverageColumnThis function applies only to columns with a datatype ofinteger, decimal, double, and real.

Detect errors in the values in the column by using thechecksum value.

ChecksumColumnThis function applies only to columns with a datatype ofvarchar.

Defining audit labels

An audit label represents the unique name in the data flow that Data Services generates forthe audit statistics collected for each audit function that you define. You use these labels todefine audit rules for the data flow.

If the audit point is on a table or output schema, these two labels are generated for the Countaudit function:

$Count_objectname

$CountError_objectname

If the audit point is on a column, the audit label is generated with this format:

$auditfunction_objectname

Note: An audit label can become invalid if you delete or rename an object that had an audit pointdefined on it. Invalid labels are listed as a separate node on the Labels tab. To resolve the issue, youmust re-create the labels and then delete the invalid items.

Defining audit rules

Use auditing rules if you want to compare audit statistics for one object against another object.For example, you can use an audit rule if you want to verify that the count of rows from thesource table is equal to the rows in the target table.

An audit rule is a Boolean expression which consists of a left-hand-side (LHS), a Booleanoperator, and a right-hand-side (RHS). The LHS can be a single audit label, multiple audit


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

labels that form an expression with one or more mathematical operators, or a function withaudit labels as parameters. In addition to these, the RHS can also be a constant.

These are examples of audit rules:

$Count_CUSTOMER = $Count_CUSTDW

$Sum_ORDER_US + $Sum_ORDER_EUROPE = $Sum_ORDER_DW

round($Avg_ORDER_TOTAL) >= 10000

Defining audit actions

You can choose any combination of the actions listed for notification of an audit failure:• Email to list: Data Services sends a notification of which audit rule failed to the email

addresses that you list in this option. Use a comma to separate the list of email addresses.You can specify a variable for the email list.

This option uses the smtp_to function to send email. Therefore, you must define the serverand sender for the Simple Mail Tool Protocol (SMTP) in the Data Services Server Manager.

• Script: Data Services executes the custom script that you create in this option.• Raise exception: When an audit rule fails, the Error Log shows the rule that failed. The job

stops at the first audit rule that fails. This is an example of a message in the Error Log:

Audit rule failed <($Checksum_ODS_CUSTOMER = $Count_CUST_DIM)> for <Data Flow

Demo_DF>.

This action is the default. If you clear this action and an audit rule fails, the job completessuccessfully and the audit does not write messages to the job log.

If you choose all three actions, Data Services executes them in the order presented.

You can see the audit status in one of these places:

Action on FailurePlaces where you can view audit information

Raise an exceptionJob Error Log, Metadata Reports

Email to listEmail message, Metadata Reports

ScriptWherever the custom script sends the audit messages,Metadata Reports

To define audit points and rules in a data flow

1. On the Data Flow tab of the Local Object Library, right-click a data flow and select Auditfrom the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


TheAuditdialog box displayswith a list of the objects you can audit, with any audit functionsand labels for those objects.

2. On the Label tab, right-click the object you want to audit and select Properties from themenu.The Schema Properties dialog box displays.

3. In the Audit tab of the Schema Properties dialog box, in the Audit function drop-downlist, select the audit function you want to use against this data object type.The audit functions displayed in the drop-down menu depend on the data object type thatyou have selected.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Default values are assigned for the audit labels, which can be changed if required.

4. ClickOK.

5. Repeat step 2 to step 4 for all audit points.

6. On the Rule tab, under Auditing Rules, click Add.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The expression editor activates and the Custom options become available for use. Theexpression editor contains three drop-down lists where you specify the audit labels for theobjects you want to audit and choose the Boolean expression to use between these labels.

7. In the left-hand-side drop-down list in the expression editor, select the audit label for theobject you want to audit.

8. In the operator drop-down list in the expression editor, select a Boolean operator.

9. In the right-hand-side drop-down list in the expression editor, select the audit label for thesecond object you want to audit.If you want to compare audit statistics for one or more objects against statistics for multipleother objects or a constant, select the Custom radio button, and click the ellipsis buttonbeside Functions. This opens up the full-size smart editor where you can drag differentfunctions and labels to use for auditing.

10.Repeat step 7 to step 10 for all audit rules.

11.Under Action on Failure, select the action you want.

12.Click Close.

To trace audit data

1. In the project area, right-click the job and select Execute from the menu.

2. In the Execution Propertieswindow, click the Trace tab.

3. Select Trace Audit Data.

4. In the Value drop-down list, select Yes.

5. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The job executes and the job log displays the Audit messages based on the audit functionthat is used for the audit object.

Choosing audit points

When you choose audit points, consider the following:• The Data Services optimizer cannot push down operations after the audit point. Therefore,

if the performance of a query that is pushed to the database server is more important thangathering audit statistics from the source, define the first audit point on the query or laterin the data flow.

For example, suppose your data flow has a source, a Query transform, and a target, and theQuery has a WHERE clause that is pushed to the database server that significantly reducesthe amount of data that returns to Data Services. Define the first audit point on the Query,rather than on the source, to obtain audit statistics on the results.

• If a pushdown_sql function is after an audit point, Data Services cannot execute it.• The auditing feature is disabled when you run a job with the debugger.• If you use the CHECKSUM audit function in a job that normally executes in parallel, Data

Services disables the Degrees of Parallelism (DOP) for the whole data flow. The order ofrows is important for the result of CHECKSUM, and DOP processes the rows in a differentorder than in the source. For more information on DOP, see “Using Parallel Execution” and“Maximizing the number of push-down operations” in the Data Services PerformanceOptimization Guide.

Activity: Using auditing in a data flow

You must ensure that all records from the Customer table in the Alpha database are beingmoved to the Delta staging database using the audit logs.

Objectives

• Add audit points to the source and target tables.• Create an audit rule to ensure that the count of both tables is the same.• Execute the job with auditing enabled.

Instructions

1. In the Local Object Library, set up auditing for Alpha_Customers_DF by adding an auditpoint to count the total number of records in the source table.

2. Add another audit point to count the total number of records in the target table.

3. Construct an audit rule that states that, if the count from both tables is not the same, theaudit must raise an exception in the log.

4. Execute Alpha_Customers_Job. Ensure that the Enable auditing option is selected on theParameters tab of the Execution Propertiesdialog box, and that theTraceAuditData optionis enabled on the Trace tab.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Note that the audit rule passes validation.

A solution file called SOLUTION_Audit.atl is included in your Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Troubleshooting batch jobs1. List some reasons why a job might fail to execute.

2. Explain the View Data feature.

3. What must you define in order to audit a data flow?

4. True or false? The auditing feature is disabled when you run a job with the debugger.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Use descriptions and annotations• Validate and trace jobs• Use View Data and the Interactive Debugger• Use auditing in data flows


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 5Using Functions, Scripts, and Variables

Lesson introductionData Services gives you the ability to perform complex operations using functions and to extendthe flexibility and re-usability of objects by writing scripts, custom functions, and expressionsusing Data Services scripting language and variables.


• Define built-in functions• Use functions in expressions• Use the lookup function• Use the decode function• Use variables and parameters• Use Data Services scripting language• Script a custom function

133Using Functions, Scripts, and Variables—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Defining built-in functions

IntroductionData Services supports built-in and custom functions.


• Define functions• List the types of operations available for functions• Describe other types of functions

Defining functions

Functions take input values and produce a return value. Functions also operate on individualvalues passed to them. Input values can be parameters passed into a data flow, values from acolumn of data, or variables defined inside a script.

You can use functions in expressions that include scripts and conditional statements.

Note: Data Services does not support functions that include tables as input or output parameters,except functions imported from SAP R/3.

Listing the types of operations for functions

Functions are grouped into different categories:

FunctionsDescriptionType

avg, count, count_distinct,max, min, sum

Performs calculations onnumeric values.Aggregate Functions

cast, interval_to_char,julian_to_date, load_to_xml,

Converts values to specificdatatypes.Conversion Functions

long_to_varchar,num_to_interval, to_char,to_date, to_decimal,to_decimal_ext,varchar_to_long

Performs functions defined by the user.Custom Functions

key_generation, sql,total_rows

Performs operations specificto databases.Database Functions


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


add_months,concat_date_time, date_diff,

Performs calculations andconversions on date values.Date Functions

date_part, day_in_month,day_in_week, day_in_year,fiscal_day, isweekend, julian,last_date, month, quarter,sysdate, systime,week_in_month,week_in_year, year

get_env, get_error_filename,get_monitor_filename,Performs operations specific

to your Data Servicesenvironment.

Environment Functions get_trace_filename,is_set_env, set_env

lookup, lookup_ext,lookup_seqLooks updata in other tables.Lookup Functions

abs, ceil, floor, ln, log, mod,power, rand, rand_ext,round, sqrt, trunc

Performs complexmathematical operations onnumeric values.

Math Functions

base64_decode,base64_encode,

Performs various operations.Miscellaneous Functions

current_configuration,current_system_configuration,dataflow_name,datastore_field_value,db_database_name,db_owner, db_type,db_version, decode,file_exists, gen_row_num,gen_row_num_by_group,get_domain_description,get_file_attribute, greatest,host_name, ifthenelse,is_group_changed, isempty,job_name, least, nvl,previous_row_value,pushdown_sql,raise_exception,raise_exception_ext,


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


repository_name, sleep,system_user_name,table_attribute,truncate_table, wait_for_file,workflow_name

ascii, chr, double_metaphone,index, init_cap, length, literal,

Performs operations onalphanumeric strings of data.String Functions

lower, lpad, lpad_ext, ltrim,ltrim_blanks,ltrim_blanks_ext,match_pattern,match_regex,print, replace_substr,replace_substr_ext, rpad,rpad_ext, rtrim, rtrim_blanks,rtrim_blanks_ext,search_replace, soundex,substr, upper, word,word_ext

exec, mail_to, smtp_toPerforms system operations.System Functions

is_valid_date,is_valid_datetime,Validates specific types of

values.Validation Functions is_valid_decimal,is_valid_double, is_valid_int,is_valid_real, is_valid_time

Defining other types of functions

In addition to built-in functions, you can also use these functions:• Database and application functions:

These functions are specific to your RDBMS. You can import the metadata for database andapplication functions and use them inData Services applications. At run time, Data Servicespasses the appropriate information to the database or application from which the functionwas imported.

The metadata for a function includes the input, output, and their datatypes. If there arerestrictions on data passed to the function, such as requiring uppercase values or limitingdata to a specific range, you must enforce these restrictions in the input. You can either testthe data before extraction or include logic in the data flow that calls the function.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


You can import stored procedures from DB2, Microsoft SQL Server, Oracle, and Sybasedatabases. You can also import stored packages from Oracle. Stored functions from SQLServer can also be imported. For more information on importing functions, see “CustomDatastores”, in Chapter 5, in the Data Services Reference Guide.

• Custom functions:

These are functions that you define. You can create your own functions by writing scriptfunctions in Data Services scripting language.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using functions in expressions

IntroductionFunctions can be used in expressions to map return values as new columns, which allowscolumns that are not in the initial input data set to be specified in the output data set.


• Use functions in expressions

Defining functions in expressions

Functions are typically used to add columns based on some other value (lookup function) orgenerated key fields. You can use functions in:• Transforms: The Query, Case, and SQL transforms support functions.• Scripts: These are single-use objects used to call functions and assign values to variables in

a work flow.• Conditionals: These are single-use objects used to implement branch logic in a work flow.• Other custom functions: These are functions that you create as required.

Before you use a function, you need to know if the function’s operation makes sense in theexpression you are creating. For example, the max function cannot be used in a script orconditional where there is no collection of values on which to operate.

You can add existing functions in an expression by using the Smart Editor or the Functionwizard. The Smart Editor offers you many options, including variables, datatypes, keyboardshortcuts, and so on. The Function wizard allows you to define parameters for an existingfunction and is recommended for defining complex functions.

To use the Smart Editor

1. Open the object in which you want to use an expression.

2. Click the ellipses (...) button.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Smart Editor appears.

3. Click the Functions tab and expand a function category.

4. Click and drag the specific function onto the workspace.

5. Enter the input parameters based on the syntax of your formula.

6. ClickOK.

To use the Function wizard

1. Open the object in which you want to use an expression.

2. Click Functions.

The Select Function dialog box opens.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

3. In the Function list, select a category.

4. In the Function name list, select a specific function.The functions shown depend on the object you are using. Clicking each function separatelyalso displays a description of the function below the list boxes.

5. ClickNext.TheDefine Input Parameter(s) dialog box displays. The options available depend on theselected function.

6. Click the drop-down arrow next to the input parameters.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Input Parameter dialog box appears.

7. Double-click to select the source object and column for the function.

8. Repeat steps 6 and 7 for all other input parameters.

9. Click Finish.

Activity: Using the search_replace function

When evaluating the customer data for Alpha Acquisitions, you discover a data entry errorwhere the contact title of Account Manager has been entered as Accounting Manager. Youwant to clean up this data before it is moved to the data warehouse.

Objective

• Use the search_replace function in an expression to change the contact title fromAccountingManager to Account Manager.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Instructions

1. In theAlpha_Customers_DFworkspace, open the transform editor for theQuery transform.

2. On the Mapping tab, delete the existing expression for the Title column.

3. Using the Function wizard, create a new expression for the Title column using thesearch_replace function (under String functions) to replace the full string of "AccountingManager" with "Account Manager".

Note: Be aware that the search_replace function can react unpredictably if you use theexternal table option.

4. Execute Alpha_Customers_Job with the default execution properties and save all objectsyou have created.

5. Return to the data flow workspace and view data for the target table.Note that the titles for the affected contacts have been changed.

A solution file called SOLUTION_SearchReplace.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the lookup function

IntroductionLookup functions allow you to look up values in other tables to populate columns.


• Use the lookup function to look up values in another table

Using lookup tables

Lookup functions allow you to use values from the source table to look up values in othertables to generate the data that populates the target table.

Lookups enable you to store re-usable values in memory to speed up the process. Lookups areuseful for values that rarely change.

The lookup, lookup_seq, and lookup_ext functions all provide a specialized type of join, similarto an SQL outer join. While a SQL outer join may return multiple matches for a single recordin the outer table, lookup functions always return exactly the same number of records that arein the source table.

While all lookup functions return one row for each row in the source, they differ in how theychoose which of several matching rows to return:• Lookup does not provide additional options for the lookup expression.• Lookup_ext allows you to specify an Order by column and Return policy (Min, Max) to

return the record with the highest/lowest value in a given field (for example, a surrogatekey).

• Lookup_seq searches inmatching records to return a field from the recordwhere the sequencecolumn (for example, effective_date) is closest to but not greater than a specified sequencevalue (for example, a transaction date).


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

lookup_ext

The lookup_ext function is recommended for lookup operations because of its enhanced options.

You can use this function to retrieve a value in a table or file based on the values in a differentsource table or file. This function also extends functionality by allowing you to:• Return multiple columns from a single lookup.• Choose from more operators to specify a lookup condition.• Specify a return policy for your lookup.• Perform multiple (including recursive) lookups.• Call lookup_ext in scripts and custom functions. This also lets you re-use the lookups

packaged inside scripts.• Define custom SQL using the SQL_override parameter to populate the lookup cache,

narrowing large quantities of data to only the sections relevant for your lookup(s).• Use lookup_ext to dynamically execute SQL.• Call lookup_ext, using the Functionwizard, in the query outputmapping to returnmultiple

columns in a Query transform.• Design jobs to use lookup_ext without having to hard code the name of the translation file

at design time.• Use lookup_ext with memory datastore tables.

Tip:

There are two ways to use the lookup_ext function in a Query output schema.

The first is to map to a single output column in the output schema. In this case, the lookup_extis limited to returning values from a single column from the lookup (translate) table.

The second way is to specify a "New Output Function Call" (right mouse click option) in theQuery output schema,which opens the FunctionWizard. You can then configure the lookup_extwith multiple columns being returned from the lookup (translate) table from a single lookup.This has performance benefits as well as allowing you to easily modify the function call afterthe initial definition.

DetailsFeature

lookup_ext ([translate_table, cache_spec, return_policy],[return_column_list], [default_value_list], [condition_list],[orderby_column_list], [output_variable_list], [sql_override])

Syntax

Returns any type of value. The return type is the first lookup columnin return_column_list.Return value

The following where clauses are available:Where • translate_table represents the table, file, ormemory datastore that

contains the value you are looking up (result_column_list).


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DetailsFeature

• cache_spec represents the caching method the lookup_extoperation uses.

• return_policy specifies whether the return columns should beobtained from the smallest or the largest row based on values inthe order by columns.

• return_column_list is a comma-separated list containing the namesof output columns in the translate_table.

• default_value_list is a comma-separated list containing the defaultexpressions for the output columns. When no rows match thelookup condition, the default values are returned for the outputcolumn.

• condition_list is a list of triplets that specify lookup conditions.Each set in a triplet contains a compare_column, a compareoperator (<,<=,>,>=,=. IS, IS NOT), and a compare expression.

• orderby_column_list is a comma-separated list of column namesfrom the translate_table.

• output_variable_list is a comma-separated list of output variables.• sql_override is available in the Function Wizard. It must contain

a valid, single-quoted SQL SELECT statement of a $variable oftype varchar to populate the lookup cache when the cachespecification is PRE_LOAD_CACHE.

Lookup(ds.owner.emp, empname, .no body., .NO_CACHE., empno,

1);

Example Lookup_ext[(ds.owner.emp, .NO_CACHE.,.MAX.], [empname], [.no

body.]. [empno, .=., 1]

These expressions both retrieve the name of an employee whoseempno is equal to 1.

To create a lookup_ext expression

1. Open the Query transform.The Query transform should have at least one main source table and one lookup table, andit must be connected to a single target object.

2. Select the output schema column for which the lookup function is being performed.

3. In the Mapping tab, click Functions.The Select Functionwindow opens.

4. In the Function list, select Lookup Functions.

5. In the Function name list, select lookup_ext.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


6. ClickNext.The Lookup_ext - Select Parameters dialog box displays.

7. In the Lookup table drop-down list, select the lookup table.

8. Change the caching specification, if required.

9. Under Condition, in theColumn in lookup tabledrop-down list, select the key in the lookuptable that corresponds to the source table.

10.In theOp.(&) drop-down list, select an operator.

11.Enter the other logical join from the source table in the Expression column.You can click and drag the column from the Available parameters pane to the Expressioncolumn. For a direct lookup, click and drag the key from the Input Schema (source table)that corresponds to the lookup table.

12.UnderOutput parameters, in theColumn in lookup table drop-down list, select the columnwith the value that will be returned by the lookup function.

13.Specify default values and order by parameters, if required.

14.Click Finish.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Activity: Using the lookup_ext() function

In the Alpha Acquisitions database, the country for a customer is stored in a separate table andreferenced with a foreign key. To speed up access to information in the data warehouse, thislookup should be eliminated.

Objective

• Use the lookup_ext function to swap the ID for the country in the Customers table for AlphaAcquisitions with the actual value from the Countries table.

Instructions

1. In theAlpha_Customers_DFworkspace, open the transform editor for theQuery transform.

2. On the Mapping tab, delete the current expression for the Country column.

3. Use the Functions wizard to create a new lookup expression using the lookup_ext functionwith the following parameters:

ValueField/Option

ALPHA.SOURCE.COUNTRYLookup table

Condition

COUNTRYIDColumn in lookup table

=Op.(&)

customer.COUNTRYIDExpression

Output

COUNTRYNAMEColumn in lookup table

The following code is generated:

lookup_ext([ALPHA.SOURCE.COUNTRY,'PRE_LOAD_CACHE','MAX'],

[COUNTRYNAME],[NULL],[COUNTRYID,'=',CUSTOMER.COUNTRYID]) SET

("run_as_separate_process"-'no', "output_cols_info"='<?xml version="1.0"

encoding="UTF-8"?><output_cols_info><col index="1"

expression="no"/></output_cols_info>')

4. Execute Alpha_Customers_Job with the default execution properties and save all objectsyou have created.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

5. Return to the data flow workspace and view data for the target table after the lookupexpression is added.

A solution file called SOLUTION_LookupFunction.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the decode function

IntroductionYou can use the decode function as an alternative to nested if/then/else conditions.


• Use the decode function

Explaining the decode function

You can use the decode function to return an expression based on the first condition in thespecified list of conditions and expressions that evaluates to TRUE. It provides an alternateway to write nested ifthenelse functions.

Use this function to apply multiple conditions when you map columns or select columns in aquery. For example, you can use this function to put customers into different groupings.

The syntax of the decode function uses the following format:

decode(condition_and_expression_list, default_expression)

The elements of the syntax break down as follows:

DescriptionElement

expression or default_expression

Return value

Returns the value associated with the firstcondition that evaluates to TRUE.

The datatype of the return value is the datatype of the first expression in thecondition_and_expression_list.

Note: If the data type of any subsequent expressionor the default_expression is not convertible to thedata type of the first expression, Data Integratorproduces an error at validation. If the data types areconvertible but do not match, a warning appears atvalidation.

condition_and_expression_list

WhereRepresents a comma-separated list of one ormore pairs that specify a variable number ofconditions. Each pair contains one conditionand one expression separated by a comma.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionElement

You must specify at least one condition andexpression pair:• The condition evaluates to TRUE or

FALSE.• The expression is the value that the

function returns if the condition evaluatesto TRUE.

default_expression

Represents an expression that the functionreturns if none of the conditions incondition_and_expression_list evaluate toTRUE.

Note: You must specify a default_expression.

The decode function provides an easier way to write nested ifthenelse functions. In nestedifthenelse functions, you must write nested conditions and ensure that the parentheses are inthe correct places as in this example:

ifthenelse((EMPNO = 1),'111',




'NO_ID'))))

In the decode function, you list the conditions as in this example:

decode((EMPNO = 1),'111',

(EMPNO = 2),'222',

(EMPNO = 3),'333',

(EMPNO = 4),'444',

'NO_ID')

Therefore, decode is less prone to error than nested ifthenelse functions.

To improve performance, Data Services pushes this function to the database server whenpossible. Thus, the database server, rather than Data Integrator, evaluates the decode function.

To configure the decode function

1. Open the Query transform.

2. Select the output schema column for which the decode function is being performed.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


3. In the Mapping tab, click Functions.The Select Functionwindow opens.

4. In the Function list, select Miscellaneous Functions.

5. In the Function name list, select decode.

6. ClickNext.TheDefine Input Parameter(s) dialog box displays.

7. In the Conditional expression field, select or enter the IF clause in the case logic.

8. In the Case expression field, select or enter the THEN clause.

9. In theDefault expression field, select or enter the ELSE clause.

10.Click Finish.

11.If required, add any additional THEN clauses in the mapping expression.

Activity: Using the decode function

You need to calculate the total value of all orders, including their discounts, for reportingpurposes.

Objective

• Use the sum and decode functions to calculate the total value of orders in Order_Detailstable.

Instructions

1. In the Omega project, create a new batch job called Alpha_Order_Sum_Jobwith a data flowcalled Alpha_Order_Sum_DF.

2. In the Alpha_Order_Sum_DF workspace, add the Order_Details and Product tables fromthe Alpha datastore as the source objects.

3. Add a new template table to the Delta datastore called order_sum as the target object.

4. Add a Query transform and connect all objects.

5. In the transform editor for the Query transform, on theWHERE tab, propose a join betweenthe two source tables.

6. Map the ORDERID column from the input schema to the output schema.

7. Create a new output column called TOTAL_VALUE with a data type of decimal(10,2).

8. On theMapping tab of the new output column, use the Functionwizard or the Smart Editorto construct an expression to calculate the total value of the orders using the decode andsum functions.

The discount and order total can be multiplied to determine the total after discount. Thedecode functions allows you to avoid multiplying order with zero discount by zero.

Consider the following:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

• The expression must specify that if the value in the DISCOUNT column is not zero(Conditional expression), then the total value of the order is calculated by multiply theQUANTITY from the order_details table by the COST from the product table, and thenmultiplying that value by the DISCOUNT (Case expression).

• Otherwise, the total value of the order is calculated by simplymultiplying theQUANTITYfrom the order_details table by the COST from the product table (Default expression).

• Once these values are calculated for each order, a sum must be calculated for the entirecollection of orders.

Tip: You can use the Function wizard to construct the decode portion of the mapping, andthen use the Smart Editor or themainwindow in theMapping tab towrap the sum functionaround the expression.

The expression should be:

sum(decode(order_details.DISCOUNT <> 0, (order_details.QUANTITY * product.COST)

* order_details.DISCOUNT, order_details.QUANTITY * product.COST))

Note: If you validate the expression, the validation will fail. However, after Step 9 iscompleted, the validation will pass.

9. On the GROUP BY tab, add the order_details.ORDERID column.

10.Execute Alpha_Orders_Jobwith the default execution properties and save all objects youhave created.

11.Return to the data flow workspace and view data for the target table after the decodeexpression is added to confirm that order 11146 has a total value of $204,000.

A solution file called SOLUTION_DecodeFunction.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using scripts, variables, and parameters

IntroductionWith the Data Services scripting language, you can assign values to variables, call functions,and use standard string andmathematical operators to transform data andmanagework flow.


• Describe the purpose of scripts, variables, and parameters• Explain the differences between global and local variables• Set global variable values using properties• Describe the purpose of substitution parameters

Defining scripts

To apply decision-making and branch logic towork flows, youwill use a combination of scripts,variables, and parameters to calculate and pass information between the objects in your jobs.

A script is a single-use object that is used to call functions and assign values in a work flow.

Typically, a script is executed before data flows for initialization steps and used in conjunctionwith conditionals to determine execution paths. A script may also be used after work flows ordata flows to record execution information such as time, or a change in the number of rows ina data set.

Use a script when youwant to calculate values that will be passed on to other parts of the workflow. Use scripts to assign values to variables and execute functions.

A script can contain these statements:• Function calls• If statements• While statements• Assignment statements• Operators

Defining variables

A variable is common component in scripts that acts as a placeholder to represent values thathave the potential to change each time a job is executed. To make them easy to identify in anexpression, variable names start with a dollar sign ($). They can be of any datatype supportedby Data Services.

You can use variables in expressions in scripts or transforms to facilitate decision making ordata manipulation (using arithmetic or character substitution). A variable can be used in aLOOP or IF statement to check a variable's value to decide which step to perform.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Note that variables can be used to enable the same expression to be used for multiple outputfiles. Variables can be used as file names for:• Flat file sources and targets• XML file sources and targets• XML message targets (executed in the Designer in test mode)• Document file sources and targets (in an SAP R/3 environment)• Document message sources and targets (SAP R/3 environment)

In addition to scripts, you can also use variables in a catch or a conditional. A catch is part ofa serial sequence called a try/catch block. The try/catch block allows you to specify alternativework flows if errors occur while Data Services is executing a job. A conditional is a single-useobject available inwork flows that allows you to branch the execution logic based on the resultsof an expression. The conditional takes the form of an if/then/else statement.

Defining parameters

A parameter is another type of placeholder that calls a variable. This call allows the value fromthe variable in a job or work flow to be passed to the parameter in a dependent work flow ordata flow. Parameters are most commonly used in WHERE clauses.

Combining scripts, variables, and parameters

To illustrate how scripts, variables, and parameters are used together, consider an examplewhere you start with a job, work flow, and data flow. You want the data flow to update onlythose records that have been created since the last time the job executed.

To accomplish this, youwould start by creating a variable for the update time at the work flowlevel, and a parameter at the data flow level that calls the variable.

Next, you would create a script within the work flow that executes before the data flow runs.The script contains an expression that determines the most recent update time for the sourcetable.

The script then assigns that update time value to the variable, which identifies what that valueis used for and allows it to be re-used in other expressions.

Finally, in the data flow, you create an expression that uses the parameter to call the variableand find out the update time. This allows the data flow to compare the update time to thecreation date of the records and identify which rows to extract from the source.

Defining global versus local variables

There are two types of variables: local and global.

Local variables are restricted to the job or work flow in which they are created. You must useparameters to pass local variables to the work flows and data flows in the object.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Global variables are also restricted to the job in which they are created. However, they do notrequire parameters to be passed to work flows and data flows in that job. Instead, you canreference the global variable directly in expressions in any object in that job.

Global variables can simplify yourwork. You can set values for global variables in script objectsor using external job, execution, or schedule properties. For example, during production, youcan change values for default global variables at run time from a job's schedule without havingto open a job in the Designer.

Whether you use global variables or local variables and parameters depends on how andwhereyou need to use the variables. If you need to use the variable at multiple levels of a specific job,it is recommended that you create a global variable.

However, there are implications to using global variables in work flows and data flows thatare re-used in other jobs. A local variable is included as part of the definition of the work flowor data flow, and so it is portable between jobs. Because a global variable is part of the definitionof the job to which the work flow or data flow belongs, it is not included when the object isre-used.

The following table summarizes the type of variables and parameters you can create for eachtype of object.

Used byTypeObject

Any object in the job.Global variableJob

A script or conditional in thejob.Local variableJob

This work flow or passeddown to other work flows ordata flowsusing a parameter.

Local variableWork flow

Parent objects to pass localvariables. Work flows mayParameterWork flow also return variables orparameters to parent objects.

A WHERE clause, columnmapping, or function in theParameterData flow data flow. Data flows cannotreturn output values.

To ensure consistency across projects and minimize troubleshooting errors, it is a best practiceto use a consistent naming convention for your variable and parameters. Keep in mind thatnames can include any alpha or numeric character or underscores (_), but cannot contain blank


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

spaces. To differentiate between the types of objects, start all names with a dollar sign ($), anduse the following prefixes:

Naming conventionType

$G_Global variable

$L_Local variable

$P_Parameter

To define a global variable, local variable, or parameter

1. Select the object in the project area.For a global variable, the object must be a job. For a local variable, it can be a job or a workflow. For a parameter, if can be work flow or a data flow.

2. From the Toolsmenu, select Variables.

The Variables and Parameters dialog box appears.

3. On the Definitions tab, right-click the type of variable or parameter and select Insert fromthe menu.

4. Right-click the new variable or parameter and select Properties from the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


The Properties dialog box displays. The properties differ depending on the type of variableor parameter.

5. In theName field, enter a unique name for the variable or parameter.

6. In theData type drop-down list, select the datatype for the variable or parameter.

7. For parameters, in the Parameter type drop-down list, select whether the parameter is forinput, output, or both.For most applications, parameters are used for input.

8. ClickOK.You can create a relationship between a local variable and the parameter by specifying thatthe name of the local variable as the value in the properties for the parameter in the Callstab.

To define the relationship between a local variable and a parameter

1. Select the dependent object in the project area.

2. From the Toolsmenu, select Variables to open the Variables and Parameters dialog box.

3. Click the Calls tab.

Any parameters that exist in dependent objects display on the Calls tab.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

4. Right-click the parameter and select Properties from the menu.

The Parameter Value dialog box appears.

5. In the Value field, enter the name of the local variable you want the parameter to call or aconstant value.If you enter a variable, it must of the same datatype as the parameter.

6. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Setting global variables using job properties

In addition to setting a variable inside a job using a script, you can also set andmaintain globalvariable values outside a job using properties. Values set outside a job are processed the sameway as those set in a script. However, if you set a value for the same variable both inside andoutside a job, the value from the script overrides the value from the property.

Values for global variables can be set as a job property or as an execution or schedule property.

All values defined as job properties are shown in the Propertieswindow. By setting valuesoutside a job, you can rely on the Propertieswindow for viewing values that have been set forglobal variables and easily edit values when testing or scheduling a job.

To set a global variable value as a job property

1. Right-click a job in the Local Object Library or project area and select Properties from themenu.The Properties dialog box appears.

2. Click theGlobal Variable tab.

All global variables for the job are listed.

3. In the Value column for the global variable, enter a constant value or an expression, asrequired.

4. ClickOK.You can also view and edit these default values in the Execution Properties dialog of theDesigner. This allows you to override job property values at run time.Data Services saves values in the repository as job properties.

Defining substitution parameters

Substitution parameters provide a way to define parameters that have a constant value for oneenvironment, but might need to get changed in certain situations. In case a change is needed,it can be changed in one location to affect all jobs. You can override the parameter for particularjob executions.

The typical use case is for file locations (directory files or source/target/error files) that areconstant in one environment, but will change when a job is migrated to another environment(like migrating a job from test to production).

As with variables and parameters, the name can include any alpha or numeric character orunderscores (_), but cannot contain blank spaces. Follow the same naming convention andalways begin the name for a substitution parameter with double dollar signs ($$) and an S_prefix to differentiate from out-of-the-box substitution parameters.

Note: When exporting a job (to a file or a repository), the substitution parameter configurations(values) are not exported with them. You need to export substitution parameters via a separatecommand to a text file and use this text file to import into another repository.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To create a substitution parameter configuration

1. From the Toolsmenu, select Substitution Parameter Configurations.

The Substitution Parameter Editor dialog box displays all pre-defined substitutionparameters:

2. Double-click the header for the default configuration to change the name, and then clickoutside of the header to commit the change.

3. Do any of the following:• To add a new configuration, click Create New Substitution Parameter Configuration

to add a new column, enter a name for the new configuration in the header, and clickoutside of the header to commit the change. Enter the values of the substitution parametersas required for the new configuration.

• To add a new substitution parameter, in the Substitution Parameter column of the lastline, enter the name and value for the substitution parameter.

4. ClickOK.

To add a substitution parameter configuration to a system configuration

1. From the Toolsmenu, select System Configurations.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The System Configuration Editor dialog box displays any existing system configurations:

2. For an existing system configuration, in the Substitution Parameter drop-down list, selectthe substitution parameter configuration.

3. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using Data Services scripting language

IntroductionWith Data Services scripting language, you can assign values to variables, call functions, anduse standard string and mathematical operators. The syntax can be used in both expressions(such as WHERE clauses) and scripts.


• Explain language syntax• Use strings and variables in Data Services scripting language

Using basic syntax

Expressions are a combination of constants, operators, functions, and variables that evaluateto a value of a given datatype. Expressions can be used inside script statements or added todata flow objects.

Data Services scripting language follows these basic syntax rules when you are creating anexpression:• Each statement ends with a semicolon (;).• Variable names start with a dollar sign ($).• String values are enclosed in single quotation marks (').• Comments start with a pound sign (#).• Function calls always specify parameters, even if they do not use parameters.• Square brackets substitute the value of the expression. For example:

Print('The value of the start date is:[sysdate()+5]');

• Curly brackets quote the value of the expression in single quotation marks. For example:

$StartDate = sql('demo_target', 'SELECT ExtractHigh FROM Job_Execution_Status

WHERE JobName = {$JobName}');

Using syntax for column and table references in expressions

Because expressions can be used inside data flow objects, they often contain column names.

The Data Services scripting language recognizes column and table names without specialsyntax. For example, you can indicate the start_date column as the input to a function in theMapping tab of a query as:

to_char(start_date, 'dd.mm.yyyy')

The column start_date must be in the input schema of the query.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

If there is more than one column with the same name in the input schema of a query, indicatewhich column is included in an expression by qualifying the columnnamewith the table name.For example, indicate the column start_date in the table status as:

status.start_date

Column and table names as part of SQL stringsmay require special syntax based on the RDBMSthat the SQL is evaluated by. For example, select all rows from the LAST_NAME column ofthe CUSTOMER table as:

sql('oracle_ds','select CUSTOMER.LAST_NAME from CUSTOMER')

Using operators

The operators you can use in expressions are listed in the following table in order of precedence.Note that when operations are pushed to a RDBMS to perform, the precedence is determinedby the rules of the RDBMS.

DescriptionOperator

Addition+

Subtraction-

Multiplication*

Division/

Comparison, equals=

Comparison, is less than<

Comparison, is less than or equal to<=

Comparison, is greater than>

Comparison, is greater than or equal to>=

Comparison, is not equal to!=

Concatenate||


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOperator

Logical ANDAND

Logical OROR

Logical NOTNOT

Comparison, is a NULL valueIS NULL

Comparison, is not a NULL valueIS NOT NULL

Reviewing script examples

Example 1

$language = 'E';

$start_date = '1994.01.01';

$end_date = '1998.01.31';

Example 2

$start_time_str = sql('tutorial_ds', 'select to_char(start_time,\'YYYY-MM-DD

HH24:MI:SS\')');

$end_time_str = sql('tutorial_ds', 'select to_char(max(last_update),\'YYYY-MM-DD

HH24:MI:SS\')');

$start_time = to_date($start_time_str, 'YYYY-MM-DD HH24:MI:SS');

$end_time = to_date($end_time_str, 'YYYY-MM-DD HH24:MI:SS');

Example 3

$end_time_str = sql('tutorial_ds', 'select to_char(end_time,\'YYYY-MM-DD

HH24:MI:SS\')');

if (($end_time_str IS NULL) or ($end_time_str = '')) $recovery_needed = 1;

else $recovery_needed = 0;

Using strings and variables

Special care must be given to handling of strings. Quotation marks, escape characters, andtrailing blanks can all have an adverse effect on your script if used incorrectly.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using quotation marks

The type of quotation marks to use in strings depends on whether you are using identifiers orconstants. An identifier is the name of the object (for example, table, column, data flow, orfunction). A constant is a fixed value used in computation. There are two types of constants:• String constants (for example, 'Hello' or '2007.01.23')• Numeric constants (for example, 2.14)

Identifiers need quotation marks if they contain special (non-alphanumeric) characters. Forexample, you need a double quote for the following because it contains blanks:

"compute large numbers"

Use single quotes for string constants.

Using escape characters

If a constant contains a single quote (') or backslash (\) or another special character used bythe Data Services scripting language, then those characters must be preceded by an escapecharacter to be evaluated properly in a string. Data Services uses the backslash (\) as the escapecharacter.

ExampleCharacter

'World\'s Books'Single quote (')

'C:\\temp'Backslash (\)

Handling nulls, empty strings, and trailing blanks

To conform to the ANSI VARCHAR standard when dealing with NULLS, empty strings, andtrailing blanks, Data Services:• Treats an empty string as a zero length varchar value, instead of as a NULL value.• Returns a value of FALSE when you use the operators Equal (=) and Not Equal (<>) to

compare to a NULL value.• Provides IS NULL and IS NOT NULL operators to test for NULL values.• Treats trailing blanks as regular characterswhen reading fromall sources, instead of trimming

them.• Ignores trailing blanks in comparisons in transforms (Query and Table Comparison) and

functions (decode, ifthenelse, lookup, lookup_ext, lookup_seq).

NULL values

To represent NULL values in expressions, type the word NULL. For example, you can checkwhether a column (COLX) is null or not with the following expressions:

COLX IS NULL


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

COLX IS NOT NULL

Data Services does not check for NULL values in data columns. Use the function nvl to removeNULL values. For more information on the NVL function, see “Functions and Procedures”,Chapter 6 in the Data Services Reference Guide.

NULL values and empty strings

Data Services uses the following two rules with empty strings:• When you assign an empty string to a variable, Data Services treats the value of the variable

as a zero-length string.

An error results if you assign an empty string to a variable that is not a varchar. To assigna NULL value to a variable of any type, use the NULL constant.

• As a constant (' '), Data Services treats the empty string as a varchar value of zero length.

Use the NULL constant for the null value.

Data Services uses the following three rules with NULLS and empty strings in conditionals:

Rule 1

The Equals (=) and Is Not Equal to (<>) comparison operators against a NULL value alwaysevaluate to FALSE. This FALSE result includes comparing a variable that has a value of NULLagainst a NULL constant.

The following table shows the comparison results for the variable assignments $var1 = NULLand $var2 = NULL:

ReturnsTranslates toCondition

FALSENULL is equal to NULLIf (NULL = NULL)

FALSENULL is not equal to NULLIf (NULL != NULL)

FALSENULL is equal to emptystringIf (NULL = ' ')

FALSENULL is not equal to emptystringIf (NULL != ' ')

FALSEbbb is equal to NULLIf ('bbb' = NULL)

FALSEbbb is not equal to NULLIf ('bbb' != NULL)

FALSEbbb is equal to empty stringIf ('bbb' = ' ')


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

ReturnsTranslates toCondition

TRUEbbb is not equal to emptystringIf ('bbb' != ' ')

FALSENULL is equal to NULLIf ($var1 = NULL)

FALSENULL is not equal to NULLIf ($var != NULL)

FALSENULL is equal to emptystringIf ($var1 = ' ')

FALSENULL is not equal to emptystringIf ($var != ' ')

FALSENULL is equal to NULLIf ($var1 = $var2)

FALSENULL is not equal to NULLIf ($var != $var2)

The following table shows the comparison results for the variable assignments $var1 = ' 'and $var2 = ' ':

ReturnTranslates toCondition

FALSEEmpty string is equal toNULLIf ($var1 = NULL)

FALSEEmpty string is not equal toNULLIf ($var != NULL)

TRUEEmpty string is equal toempty stringIf ($var1 = ' ')

FALSEEmpty string is not equal toempty stringIf ($var != ' ')

TRUEEmpty string is equal toempty stringIf ($var1 = $var2)


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


FALSEEmpty string is not equal toempty stringIf ($var != $var2)

Rule 2

Use the IS NULL and IS NOTNULL operators to test the presence of null values. For example,assuming a variable assignment $var1 = NULL;


FALSEbbb is NULLIf ('bbb' IS NULL)

TRUEbbb is not NULLIf ('bbb' IS NOT NULL)

FALSEEmpty string is NULLIf (' ' IS NULL)

TRUEEmpty string is not NULLIf (' ' IS NOT NULL)

TRUENULL is NULLIf ($var1 IS NULL)

FALSENULL is not NULLIf ($var1 IS NOT NULL)

Rule 3

When comparing two variables, always test for NULL. In this scenario, you are not testing avariable with a value of NULL against a NULL constant (as in the first rule). Either test eachvariable and branch accordingly or test in the conditional as shown in the second row of thefollowing table.

RecommendationCondition

Donot comparewithout explicitly testing forNULLS. It is not recommended to use this

If ($var1 = $var2) logic because any relational comparison to aNULL value returns FALSE.

Execute the TRUE branch if both $var1 and$var2 are NULL, or if neither are NULL butare equal to each other.

If ( (($var1 IS NULL) AND ($var2 IS

NULL)) OR ($var1 = $var2))


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Scripting a custom function

IntroductionIf the built-in functions that are provided byData Services do notmeet your requirements, youcan create your own custom functions using the Data Services scripting language.


• Create a custom function• Import a stored procedure to use as a custom function

Creating a custom function

You can create your own functions by writing script functions in Data Services scriptinglanguage using the Smart Editor. Saved custom functions appear in the Function wizard andthe Smart Editor under the Custom Functions category, and are also displayed on the CustomFunctions tab of the Local Object Library. You can edit and delete custom functions from theLocal Object Library.

Consider these guidelines when you create your own functions:• Functions can call other functions.• Functions cannot call themselves.• Functions cannot participate in a cycle of recursive calls. For example, function A cannot

call function B if function B calls function A.• Functions return a value.• Functions can have parameters for input, output, or both. However, data flows cannot pass

parameters of type output or input/output.

Before creating a custom function, you must know the input, output, and return values andtheir datatypes. The return value is predefined to be Return.

To create a custom function

1. On the Custom Functions tab of the Local Object Library, right-click the white space andselectNew from the menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The Custom Function dialog box displays.

2. In the Function name field, enter a unique name for the new function.

3. In theDescription field, enter a description.

4. ClickNext.The Smart Editor enables you to define the return type, parameter list, and any variables tobe used in the function.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


5. On the Variables tab, expand the Parameters branch.

6. Right-click Return and select Properties from the menu.The Return value Properties dialog box displays.

7. In theData type drop-down list, select the datatype you want to return for the customfunction.By default, the return datatype is set to integer.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


8. ClickOK.

9. To define a new variable or parameter for your custom function, in the Variables tab,right-click the appropriate branch and select Insert from the menu.

10.In theName field, enter a unique name for the variable or parameter.

11.In theData type drop-down list, select the datatype for the variable or parameter.

12.For a parameter, in the Parameter type drop-down list, select whether the parameter is forinput, output, or both.Data Services data flows cannot pass variable parameters of type output and input/output.

13.ClickOK.

14.Repeat step 9 to step 13 for each variable or parameter required in your function.When adding subsequent variables or parameters, the right-clickmenuwill include optionsto Insert Above or Insert Above. Use thesemenu commands to create, delete, or edit variablesor parameters.

15.In the main area of the Smart Editor, enter the expression for your function.Your expression must include the Return parameter.

16.Click Validate to check the syntax of your function.If your function contains syntax errors, Data Services displays a list of those errors in anembedded pane below the editor. To see where the error occurs in the text, double-click anerror. The Smart Editor redraws to show the location of the error.

17.ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


To edit a custom function

1. On the Custom Functions tab of the Local Object Library, right-click the custom functionand select Edit from the menu.

2. In the Smart Editor, change the expression as required.

3. ClickOK.

To delete a custom function

1. On the Custom Functions tab of the Local Object Library, right-click the custom functionand selectDelete from the menu.

2. ClickOK to confirm the deletion.

Importing a stored procedure as a function

If you are using Microsoft SQL Server, you can use stored procedures to insert, update, anddelete data in your tables. To use stored procedures in Data Services, you must import themas custom functions.

To import a stored procedure

1. On the Datastores tab of the Local Object Library, expand the datastore that contains thestored procedure.

2. Right-click Functions and select Import By Name from the menu.

The Import By Name dialog box displays.

3. In the Type drop-down list, select Function.

4. In theName field, enter the name of the stored procedure.

5. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Activity: Creating a custom function

The Marketing department would like to send special offers to customers who have placed aspecified numbers of orders. This requires creating a custom function that must be able to becalled in a real-time job as a customer's order is entered into the system.

Objectives

• Create a custom function to accept the input parameters of the Customer ID and the numberof orders required to receive a special order, check the Orders table, and return a value of1 or 0.

• Create a batch job using the custom function to create an initial list of customers who haveplaced more than five orders, and are therefore eligible to receive the special offer.

Instructions

1. In the Local Object Library, create a new custom function called CF_MarketingOffer.

2. In the Smart Editor for the function, create a parameter called $P_CustomerIDwith a datatype of varchar, a parameter type of Input and a length of 10.

3. Create a second parameter called $P_Orderswith a data type of int and a parameter typeof Input.

4. Define the custom function as a conditional clause that specifies that, if the number of rowsin the Orders table is equal to the $P_Orders value for the Customer ID, then the functionshould return a 1; otherwise, it should return 0.

The syntax should be as follows:

if ((sql('alpha', 'select count(*) from orders where customerid =

[$P_CustomerID]')) >= $P_Orders)

Return 1;

else Return 0;

5. In theOmega project, create a newbatch job called Alpha_Marketing_Offer_Jobwith a dataflow called Alpha_Marketing_Offer_DF.

6. Create a new global variable for the job called $G_Num_to_Qualwith a datatype of int.

7. In the job workspace, to the left of the data flow, create a new script called CheckOrders andcreate an expression in the script to define the global variable as five orders to qualify.


$G_Num_to_Qual = 5;

8. Connect the script to the data flow.

9. In the data flowworkspace, add the Customer table from the Alpha datastore as the sourceobject.

10.Add a template table to the Delta datastore called offer_mailing_list as the target object.

11.Add two Query transforms and connect all objects.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

12.In the transform editor for the first Query transform, map the following columns:

Schema OutSchema In

CONTACTNAMECONTACTNAME

ADDRESSADDRESS

CITYCITY

POSTALCODEPOSTALCODE

13.Create a new output column called OFFER_STATUSwith a datatype of int.

14.On theMapping tab,map the new output column to the custom function using the Functionwizard. Specify the CUSTOMERID column for $P_CustomerID and the global variable for$P_Orders.

The expression should be as follows:

CF_MarketingOffer(customer.CUSTOMERID, $G_Num_to_Qual)

15.In the transform editor for the second Query transform, map the following columns:

Schema OutSchema In

CONTACTNAMECONTACTNAME

ADDRESSADDRESS

CITYCITY

POSTALCODEPOSTALCODE

16.On the WHERE tab, create an expression to select only those records where theOFFER_STATUS value is 1.


Query.OFFER_STATUS = 1

17.Execute Alpha_Marketing_Offer_Job with the default execution properties and save allobjects you have created.

18.Return to the data flow workspace and view data for the target table.You should have one output record for contact Lev M. Melton in Quebec.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

A solution file called SOLUTION_CustomFunction.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Using functions, scripts, and variables1. Describe the differences between a function and a transform.

2. Why are functions used in expressions?

3. What does a lookup function do? How do the different variations of the lookup functiondiffer?

4. What valuewould the Lookup_ext function return ifmultiplematching recordswere foundon the translate table?

5. Explain the differences between a variable and a parameter.

6. When would you use a global variable instead of a local variable?

7. What is the recommended naming convention for variables in Data Services?

8. Which object would you use to define a value that is constant in one environment, but maychange when a job is migrated to another environment?

a. Global variable

b. Local variable

c. Parameter

d. Substitution parameter


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Define built-in functions• Use functions in expressions• Use the lookup function• Use the decode function• Use variables and parameters• Use Data Services scripting language• Script a custom function


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 6Using Platform Transforms

Lesson introductionA transform enables you to control how data sets change in a data flow.


• Describe platform transforms• Use the Map Operation transform• Use the Validation transform• Use the Merge transform• Use the Case transform• Use the SQL transform

179Using Platform Transforms—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing platform transforms

IntroductionTransforms are optional objects in a data flow that allowyou to transformyour data as itmovesfrom source to target.


• Explain transforms• Describe the platform transforms available in Data Services• Add a transform to a data flow• Describe the Transform Editor window

Explaining transforms

Transforms are objects in data flows that operate on input data sets by changing them or bygenerating one or more new data sets. The Query transform is the most commonly-usedtransform.

Transforms are added as components to your data flow in the same way as source and targetobjects. Each transformprovides different options that you can specify based on the transform'sfunction. You can choose to edit the input data, output data, and parameters in a transform.

Some transforms, such as the Date Generation and SQL transforms, can be used as sourceobjects, in which case they do not have input options.

Transforms are often used in combination to create the output data set. For example, the TableComparison, History Preserve, and Key Generation transforms are used for slowly changingdimensions.

Transforms are similar to functions in that they can produce the same or similar values duringprocessing. However, transforms and functions operate on a different scale:• Functions operate on single values, such as values in specific columns in a data set.• Transforms operate on data sets by creating, updating, and deleting rows of data.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Describing platform transforms

The following platform transforms are available on the Transforms tab of the Local ObjectLibrary:

DescriptionTransformIcon

Divides the data from an input data set into multiple outputdata sets based on IF-THEN-ELSE branch logic.Case

Allows conversions between operation codes.Map Operation

Unifies rows from two or more input data sets into a singleoutput data set.Merge

Retrieves a data set that satisfies conditions that you specify.A query transform is similar to a SQL SELECT statement.Query

Generates a column filled with integers starting at zero andincrementing by one to the end value you specify.Row Generation

Performs the indicated SQL query operation.SQL

Allows you to specify validation criteria for an input dataset. Data that fails validation can be filtered out or replaced.You can have one validation rule per column.

Validation


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Map Operation transform

IntroductionThe Map Operation transform enables you to change the operation code for records.


• Describe map operations• Use the Map Operation transform

Describing map operations

Data Services maintains operation codes that describe the status of each row in each data setdescribed by the inputs to and outputs from objects in data flows. The operation codes indicatehow each row in the data set would be applied to a target table if the data set were loaded intoa target. The operation codes are as follows:

DescriptionOperation Code

Creates a new row in the target.

NORMAL

All rows in a data set are flagged as NORMAL when they areextracted by a source table or file. If a row is flagged asNORMALwhen loaded into a target table or file, it is inserted as a new rowin the target.

Most transforms operate only on rows flagged as NORMAL.

Creates a new row in the target.INSERT Only History Preserving and Key Generation transforms can

accept data sets with rows flagged as INSERT as input.

Is ignored by the target. Rows flagged asDELETE are not loaded.

DELETE Only the History Preserving transform, with the Preserve deleterow(s) as update row(s) option selected, can accept data setswithrows flagged as DELETE.

Overwrites an existing row in the target table.UPDATE Only History Preserving and Key Generation transforms can

accept data sets with rows flagged as UPDATE as input.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Explaining the Map Operation transform

The Map Operation transform allows you to change operation codes on data sets to producethe desired output. For example, if a row in the input data set has been updated in somepreviousoperation in the data flow, you can use this transform to map the UPDATE operation to anINSERT. The result could be to convert UPDATE rows to INSERT rows to preserve the existingrow in the target.

Data Services can push Map Operation transforms to the source database.

The next section gives a brief description the function, data input requirements, options, anddata output results for the Map Operation transform. For more information on the MapOperation transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs

Input for the Map Operation transform is a data set with rows flagged with any operationcodes. It can contain hierarchical data.

Use cautionwhen using columns of datatype real in this transform, because comparison resultsare unpredictable for this datatype.

Output for the Map Operation transform is a data set with rows flagged as specified by themapping operations.

Options

The Map Operation transform enables you to set the Output row type option to indicate thenew operations desired for the input data set. Choose from the following operation codes:INSERT, UPDATE, DELETE, NORMAL, or DISCARD.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Activity: Using the Map Operation transform

End users of employee reports have requested that employee records in the data mart containonly current employees.

Objective

• Use the Map Operation transform to remove any employee records that have a value in thedischarge_date column.

Instructions

1. In the Omega project, create a new batch job called Alpha_Employees_Current_Jobwith adata flow called Alpha_Employees_Current_DF.

2. In the data flowworkspace, add the Employee table from the Alpha datastore as the sourceobject.

3. Add the Employee table from the HR_datamart datastore as the target object.

4. Add the Query transform to the workspace and connect all objects.

5. In the transform editor for the Query transform, map all columns from the input schema tothe same column in the output schema.

6. On the WHERE tab, create an expression to select only those rows where discharge_date isnot empty.


employee.discharge_date is not null

7. In the data flow workspace, disconnect the Query transform from the target table.

8. Add a Map Operation transform between the Query transform and the target table andconnect it to both.

9. In the transform editor for the Map Operation transform, change the settings so that rowswith an input operation code of NORMAL have an output operation code of DELETE.

10.Execute Alpha_Employees_Current_Job with the default execution properties and save allobjects you have created.

11.Return to the data flow workspace and view data for both the source and target tables.Note that two rows were filtered from the target table.

A solution file called SOLUTION_MapOperation.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Validation transform

IntroductionThe Validation transform enables you to create validation rules and move data into targetobjects based on whether they pass or fail validation.


• Use the Validation transform

Explaining the Validation transform

Use the Validation transform in your data flows when you want to ensure that the data at anystage in the data flow meets your criteria.

For example, you can set the transform to ensure that all values:• Are within a specific range• Have the same format• Do not contain NULL values

The Validation transform allows you to define a re-usable business rule to validate each recordand column. The Validation transform qualifies a data set based on rules for input schemacolumns. It filters out or replaces data that fails your criteria. The available outputs are passand fail. You can have one validation rule per column.

For example, if you want to load only sales records for October 2007, you would set up avalidation rule that states: Sales Date is between 10/1/2007 to 10/31/2007. Data Services looksat this date field in each record to validate if the data meets this requirement. If it does not, youcan choose to pass the record into a Fail table, correct it in the Pass table, or do both.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Your validation rule consists of a condition and an action on failure:• Use the condition to describe what you want for your valid data.

For example, specify the condition IS NOT NULL if you do not want any NULLS in datapassed to the specified target.

• Use the Action on Failure area to describe what happens to invalid or failed data.

Continuingwith the example above, for anyNULL values, youmaywant to select the Sendto Fail option to send all NULL values to a specified FAILED target table.

You can also create a custom Validation function and select it when you create a validationrule. For more information on creating a custom Validation functions, see “ValidationTransform”, Chapter 12 in the Data Services Reference Guide.

The next section gives a brief description the function, data input requirements, options, anddata output results for the Validation transform. For more information on the Validationtransform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output

Only one source is allowed as a data input for the Validation transform.

The Validation transform outputs up to two different data sets based on whether the recordspass or fail the validation condition you specify. You can load pass and fail data into multipletargets.

The Pass output schema is identical to the input schema. Data Services adds the following twocolumns to the Fail output schemas:• The DI_ERRORACTION column indicates where failed data was sent in this way:

○ The letter B is used for sent to both Pass and Fail outputs.○ The letter F is used for sent only to the Fail output.

If you choose to send failed data to the Pass output, Data Services does not track the results.Youmay want to substitute a value for failed data that you send to the Pass output becauseData Services does not add columns to the Pass output.

• The DI_ERRORCOLUMNS column displays all error messages for columns with failedrules. The names of input columns associated with each message are separated by colons.For example, “<ValidationTransformName> failed rule(s): c1:c2”.

If a row has conditions set for multiple columns and the Pass, Fail, and Both actions arespecified for the row, then the precedence order is Fail, Both, Pass. For example, if onecolumn’s action is Send to Fail and the column fails, then the whole row is sent only to theFail output. Other actions for other validation columns in the row are ignored.

Options

When you use the Validation transform, you select a column in the input schema and create avalidation rule in the Validation transform editor. The Validation transform offers severaloptions for creating this validation rule:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Turn the validation rule on and off for thecolumn.Enable Validation

Send all NULL values to the Pass outputautomatically. Data Services will not applyDo not validate when NULL the validation rule on this column when anincoming value for it is NULL.

Define the condition for the validation rule:

Condition

• Operator: select an operator for a Booleanexpression (for example, =, <, >) and enterthe associated value.

• In: specify a list of possible values for acolumn.

• Between/and: specify a range of valuesfor a column.

• Match pattern: enter a pattern of upperand lowercase alphanumeric charactersto ensure the format of the column iscorrect.

• Custom validation function: select afunction from a list for validationpurposes. Data Services supportsValidation functions that take oneparameter and return an integer datatype.If a return value is not a zero, then DataServices processes it as TRUE.

• Exists in table: specify that a column’svalue must exist in a column in anothertable. This option also uses theLOOKUP_EXT function. You can definethe NOTNULL constraint for the columnin the LOOKUP table to ensure the Existsin table condition executes properly.

• Custom condition: create more complexexpressions using the function and smarteditors.

Data Services converts substitute values inthe condition to a corresponding columndatatype: integer, decimal, varchar, date,datetime, timestamp, or time. The Validation


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionOption

transform requires that you enter somevaluesin specific formats:• date (YYYY.MM.DD)• datetime (YYYY.MM.DD HH24:MI:SS)• time (HH24:MI:SS)• timestamp (YYYY.MM.DD

HH24:MI:SS.FF)

If, for example, you specify a date as12-01-2004, Data Services produces an errorbecause you must enter this date as2004.12.01.

Define where a record is loaded if it fails thevalidation rule:

Action on Fail

• Send to Fail• Send to Pass• Send to both

If you choose Send to Pass or Send to Both,you can choose to substitute a value orexpression for the failed values that are sentto the Pass output.

To create a validation rule

1. Open the data flow workspace.

2. Add your source object to the workspace.

3. On the Transforms tab of the Local Object Library, click and drag the Validation transformto the workspace to the right of your source object.

4. Add your target objects to the workspace.You will require one target object for records that pass validation, and an optional targetobject for records that fail validation, depending on the options you select.

5. Connect the source object to the transform.

6. Double-click the Validation transform to open the transform editor.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

7. In the input schema area, click to select an input schema column.

8. In the parameters area, select the Enable Validation option.

9. In the Condition area, select a condition type and enter any associated value required.All conditions must be Boolean expressions.

10.On the Properties tab, enter a name and description for the validation rule.

11.On the Action On Failure tab, select an action.

12.If desired, select the For pass, substitute with option and enter a substitute value orexpression for the failed value that is sent to the Pass output.This option is only available if you select Send to Pass or Send to Both.

13.Click Back to return to the data flow workspace.

14.Click and drag from the transform to the target object.

15.Release the mouse and select the appropriate label for that object from the pop-up menu.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

16.Repeat step 14 and step 15 for all target objects.

Activity: Using the Validation transform

Order data is stored in multiple formats with different structures and different information.Youwill use the Validation transform to validate order data from flat file sources and the alphaorders table before merging it.

Objectives

• Join the data in the Orders flat files with that in the Order_Shippers flat files.• Create a column on the target table for employee information so that orders taken by

employees who are no longer with the company are assigned to a default current employeeusing the validation transform in a new column named order_assigned_to.

• Create a column to hold the employee ID of the employee who originally made the sale.• Replace null values in the shipper fax column with a value of 'No Fax' and send those rows

to a separate table for follow up.

Instructions

1. Create a file format called Order_Shippers_Format for the flat fileOrder_Shippers_04_20_07.txt. Use the structure of the text file to determine the appropriatesettings.

2. In the ColumnAttributes pane, adjust the datatypes for the columns based on their content:

DatatypeColumn

intORDERID

varchar(50)SHIPPERNAME

varchar(50)SHIPPERADDRESS

varchar(50)SHIPPERCITY

intSHIPPERCOUNTRY

varchar(20)SHIPPERPHONE

varchar(20)SHIPPERFAX

intSHIPPERREGION


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DatatypeColumn

varchar(15)SHIPPERPOSTALCODE

3. In the Omega project, create a new batch job called Alpha_Orders_Validated_Job and twodata flows, one named Alpha_Orders_Files_DF, and the secondnamed Alpha_Orders_DB_DF.

4. Add the file formats Orders_Format and Order_Shippers_Format as source objects to theAlpha_Orders_Files_DF data flow workspace.

5. Edit the source objects so that the Orders_Format source is using all three related ordersflat files and the Order_Shippers_Format source is using all three order shippers files.

Tip: You can use a wildcard to replace the dates in the file names.

6. Edit the Orders_Format source object to change the Capture Data Conversion Errors optionto Yes.

7. If necessary, edit the source objects to point to the file on the Job Server. If the Job Server ison a different machine than Designer, this step is required.

8. In the Location drop-down list, select Job Server.

9. In the Root directory, enter the correct path. The instructor will provide this information.

10.Add a Query transform to the workspace and connect it to the two source objects.

11.In the transform editor for the Query transform, create a WHERE clause to join the data onthe OrderID values.


Order_Shippers_Format.ORDERID = Orders_Format.ORDERID

12.Add the following mappings in the Query transform:

MappingSchema Out

Orders_Format.ORDERIDORDERID

Orders_Format.CUSTOMERIDCUSTOMERID

Orders_Format.ORDERDATEORDERDATE

Order_Shippers_Format.SHIPPERNAMESHIPPERNAME

Order_Shippers_Format.SHIPPERADDRESSSHIPPERADDRESS


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


MappingSchema Out

Order_Shippers_Format.SHIPPERCITYSHIPPERCITY

Order_Shippers_Format.SHIPPERCOUNTRYSHIPPERCOUNTRY

Order_Shippers_Format.SHIPPERPHONESHIPPERPHONE

Order_Shippers_Format.SHIPPERFAXSHIPPERFAX

Order_Shippers_Format.SHIPPERREGIONSHIPPERREGION

Order_Shippers_Format.SHIPPERPOSTALCODESHIPPERPOSTALCODE

13.Insert a new output column above ORDERDATE called ORDER_TAKEN_BYwith a datatypeof varchar(15) and map it to Orders_Format.EMPLOYEEID.

14.Insert a new output column aboveORDERDATE called ORDER_ASSIGNED_TOwith a datatypeof varchar(15) and map it to Orders_Format.EMPLOYEEID.

15.Add a Validation transform to the right of the Query transform and connect the transforms.

16.In the transform editor for the Validation transform, enable validation for theORDER_ASSIGNED_TOcolumn to verify the value in the columnexists in theEMPLOYEEIDcolumn of the Employee table in the HR_datamart datastore.


HR_DATAMART.DBO.EMPLOYEE.EMPLOYEEID

17.Set the action on failure for the Order_Assigned_To column to send to both pass and fail.For pass, substitute '3Cla5' to assign it to the default employee.

18.Enable validation for the SHIPPERFAX column to send NULL values to both pass and fail,substituting 'No Fax' for pass.

19.Add two target tables in the Delta datastore as targets, one called Orders_Files_Work andone called Orders_Files_No_Fax.

20.Connect the pass output from the Validation transform to Orders_Files_Work and the failoutput to Orders_Files_No_Fax.

21.In the Alpha_Orders_DB_DF workspace, add the Orders table from the Alpha datastore asthe source object.

22.Add a Query transform to the workspace and connect it to the source.

23.In the transform editor for the Query transform, map all of the columns from the inputschema to the output schema, except the EMPLOYEEID column.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

24.Change the names of the following Schema Out columns:

New column nameOld column name

SHIPPERCITYSHIPPERCITYID

SHIPPERCOUNTRYSHIPPERCOUNTRYID

SHIPPERREGIONSHIPPERREGIONID

25.Insert a new output column above ORDERDATE called ORDER_TAKEN_BYwith a data typeof varchar(15) and map it to Orders.EMPLOYEEID.

26.Insert a new output column above ORDERDATE called ORDER_ASSIGNED_TOwith a datatype of varchar(15) and map it to Orders.EMPLOYEEID.

27.Add a Validation transform to the right of the Query transform and connect the transforms.

28.Enable validation for Order_Assigned_To to verify the column value exists in theEMPLOYEEID column of the Employee table in the HR_datamart datastore.

29.Set the action on failure for the Order_Assigned_To column to send to both pass and fail.For pass, substitute '3Cla5' to assign it to the default employee.

30.Enable validation for the ShipperFax column to send NULL values to both pass and fail,substituting 'No Fax' for pass.

31.Add two target tables in the Delta datastore as targets, one named Orders_DB_Work andone named Orders_DB_No_Fax.

32.Connect the pass output from the Validation transform to Orders_DB_Work and the failoutput to Orders_DB_No_Fax.

33.Execute Alpha_Orders_Validated_Job with the default execution properties and save allobjects you have created.

34.View the data in the target tables to view the differences between passing and failing records.

A solution file called SOLUTION_Validation.atl is included in your Course Resources. To checkthe solution, import the file and open it to view the data flow design and mapping logic. Donot execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Merge transform

IntroductionThe Merge transform allows you to combine multiple sources with the same schema into asingle target.


• Use the Merge transform

Explaining the Merge transform

TheMerge transform combines incoming data sets with the same schema structure to producea single output data set with the same schema as the input data sets.

For example, you could use the Merge transform to combine two sets of address data:

The next section gives a brief description the function, data input requirements, options, anddata output results for theMerge transform. For more information on theMerge transform see“Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output

TheMerge transform performs a union of the sources. All sourcesmust have the same schema,including:• Number of columns• Column names• Column datatypes

If the input data set contains hierarchical data, the names and datatypes must match at everylevel of the hierarchy.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The output data has the same schema as the source data. The output data set contains a rowfor every row in the source data sets. The transformdoes not strip out duplicate rows. If columnsin the input set contain nested schemas, the nested data is passed through without change.

Tip: If you want to merge tables that do not have the same schema, you can add the Querytransform to one of the tables before the Merge transform to redefine the schema to match theother table.

Options

The Merge transform does not offer any options.

Activity: Using the Merge transform

The Orders data has now been validated, but the output is for two different sources: flat filesand database tables. The next step in the process is to modify the structure of those data setsso they match, and then merge them into a single data set.

Objectives

• Use the Query transforms to modify any column names and data types and to performlookups for any columns that reference other tables.

• Use the Merge transform to merge the validated orders data.

Instructions

1. In the Omega project, create a new batch job called Alpha_Orders_Merged_Jobwith a dataflow called Alpha_Orders_Merged_DF .

2. In the data flow workspace, add the orders_file_work and orders_db_work tables from theDelta datastore as the source objects.

3. Add twoQuery transforms to the data flow, connecting each source object to its ownQuerytransform.

4. In the transform editor for the Query transform connected to the orders_files_work table,map all columns from input to output.

5. Change the datatype for the following Schema Out columns as specified:

TypeColumn

datetimeORDERDATE

varchar(100)SHIPPERADDRESS

varchar(50)SHIPPERCOUNTRY

varchar(50)SHIPPERREGION


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


TypeColumn

varchar(50)SHIPPERPOSTALCODE

6. For the SHIPPERCOUNTRY column, change the mapping to perform a lookup ofCountryName from the Country table in the Alpha datastore.



[COUNTRYNAME],[NULL],[COUNTRYID,'=',ORDERS_FILE_WORK.SHIPPERCOUNTRY]) SET




7. For the SHIPPERREGIONcolumn, change themapping to perform a lookup of RegionNamefrom the Region table in the Alpha datastore.


lookup_ext([ALPHA.SOURCE.REGION,'PRE_LOAD_CACHE','MAX'],

[REGIONNAME],[NULL],[REGIONID,'=',ORDERS_FILE_WORK.SHIPPERREGION]) SET




8. In the transform editor for the Query transform connected to the orders_db_work table,map all columns from input to output.

9. Change the datatype for the following Schema Out columns as specified:

TypeColumn

varchar(15)ORDER_TAKEN_BY

varchar(15)ORDER_ASSIGNED_TO

varchar(50)SHIPPERCITY

varchar(50)SHIPPERCOUNTRY

varchar(50)SHIPPERREGION

10.For the SHIPPERCITY column, change themapping to perform a lookup of CityName fromthe City table in the Alpha datastore.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


lookup_ext([ALPHA.SOURCE.CITY,'PRE_LOAD_CACHE','MAX'],

[CITYNAME],[NULL],[CITYID,'=',ORDERS_DB_WORK.SHIPPERCITY]) SET




11.For the SHIPPERCOUNTRY column, change the mapping to perform a lookup ofCountryName from the Country table in the Alpha datastore.



[COUNTRYNAME],[NULL],[COUNTRYID,'=',ORDERS_DB_WORK.SHIPPERCOUNTRY]) SET




12.For the SHIPPERREGIONcolumn, change themapping to perform a lookup of RegionNamefrom the Region table in the Alpha datastore.


lookup_ext([ALPHA.SOURCE.REGION,'PRE_LOAD_CACHE','MAX'],

[REGIONNAME],[NULL],[REGIONID,'=',ORDERS_DB_WORK.SHIPPERREGION]) SET




13.Add a Merge transform to the data flow and connect both Query transforms to the Mergetransform.

14.Add a template table called Orders_Merged in the Delta datastore as the target table andconnect it to the Merge transform.

15.ExecuteAlpha_Orders_Merged_Jobwith the default execution properties and save all objectsyou have created.

16.View the data in the target table.Note that the SHIPPERCITY, SHIPPERCOUNTRY, and SHIPPERREGION columns for the363 records in the template table consistently have names versus ID values.

A solution file called SOLUTION_Merge.atl is included in your Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Case transform

IntroductionThe Case transform supports separating data from a source into multiple targets based onbranch logic.


• Use the Case transform

Explaining the Case transform

You use the Case transform to simplify branch logic in data flows by consolidating case ordecision-making logic into one transform. The transform allows you to split a data set intosmaller sets based on logical branches.

For example, you can use the Case transform to read a table that contains sales revenue factsfor different regions and separate the regions into their own tables formore efficient data access:

The next section gives a brief description the function, data input requirements, options, anddata output results for the Case transform. For more information on the Case transform, see“Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output

Only one data flow source is allowed as a data input for the Case transform. Depending on thedata, only one of multiple branches is executed per row. The input and output schema are alsoidentical when using the case transform.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The connections between the Case transform and objects used for a particular case must belabeled. Each output label in the Case transform must be used at least once.

You connect the output of the Case transformwith another object in the workspace. Each labelrepresents a case expression (WHERE clause).

Options

The Case transform offers several options:

DescriptionOption

Define the name of the connection thatdescribes where data will go if thecorresponding Case condition is true.

Label

Define the Case expression for thecorresponding label.Expression

Specify that the transform must use theexpression in this label when all other Caseexpressions evaluate to false.

Produce default option with label

Specify that the transform passes each rowto the first case whose expression returnstrue.

Row can be TRUE for one case only

To create a case statement



3. On the Transforms tab of the Local Object Library, click and drag the Case transform to theworkspace to the right of your source object.

4. Add your target objects to the workspace.You will require one target object for each possible condition in the case statement.


6. Double-click the Case transform to open the transform editor.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

7. In the parameters area of the transform editor, click Add to add a new expression.

8. In the Label field, enter a label for the expression.

9. Click and drag an input schema column to the Expression pane at the bottomof thewindow.

10.Enter the rest of the expression to define the condition.For example, to specify that youwant all Customerswith aRegionIDof 1, create the followingstatement: Customer.RegionID = 1

11.Repeat step 7 to step 10 for all expressions.

12.To direct records that do not meet any defined conditions to a separate target object, selectthe Produce default option with label option and enter the label name in the associatedfield.

13.To direct records that meet multiple conditions to only one target, select the Row can beTRUE for one case only option.In this case, records are placed in the target associatedwith the first condition that evaluatesas true.



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


15.Connect the transform to the target object.

16.Release the mouse and select the appropriate label for that object from the pop-up menu.

17.Repeat step 15 and step 16 for all target objects.

Activity: Using the Case transform

Once the orders have been validated and merged, the resulting data set must be split out byquarter for reporting purposes.

Objective

• Use the Case transform to create separate tables for orders occurring in fiscal quarters 3 and4 for the year 2007 and quarter 1 of 2008.

Instructions

1. In the Omega project, create a new batch job called Alpha_Orders_By_Quarter_Jobwith adata flow named Alpha_Orders_By_Quarter_DF.

2. In the data flow workspace, add the Orders_Merged table from the Delta datastore as thesource object.

3. Add a Query transform to the data flow and connect it to the source table.

4. In the transform editor for the Query transform, map all columns from input to output.

5. Add the following two output columns:

MappingTypeColumn

quarter(orders_merged.ORDERDATE)intORDERQUARTER

to_char(orders_merged.ORDERDATE,'YYYY')

varchar(4)ORDERYEAR

6. Add a Case transform to the data flow and connect it to the Query transform.

7. In the transform editor for the Case transform, create the following labels and associatedexpressions:

ExpressionLabel

Query.ORDERYEAR = '2006' andQuery.ORDERQUARTER = 4Q42006


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

ExpressionLabel





8. Choose the settings to not produce a default output set for the Case transform and to specifythat rows can be true for one case only.

9. Add five template tables in the Delta datastore called Orders_Q4_2006, Orders_Q1_2007,Orders_Q2_2007, Orders_Q3_2007, and Orders_Q4_2007.

10.Connect the output from the Case transform to the target tables selecting the correspondinglabels.

11.Execute Alpha_Orders_By_Quarter_Job with the default execution properties and save allobjects you have created.

12.View the data in the target tables and confirm that there are 103 orders that were placed inQ1 of 2007.

A solution file called SOLUTION_Case.atl is included in your Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the SQL transform

IntroductionThe SQL transform allows you to submit SQL commands to generate data to be moved intotarget objects.


• Use the SQL transform

Explaining the SQL transform

Use this transform to perform standard SQL operations when other built-in transforms cannotperform them.

The SQL transform can be used to extract for general select statements as well as storedprocedures and views.

You can use the SQL transform as a replacement for theMerge transformwhen you are dealingwith database tables only. The SQL transform performs more efficiently because the merge ispushed down to the database. However, you cannot use this functionality if your source objectsinclude file formats.

The next section gives a brief description the function, data input requirements, options, anddata output results for the SQL transform. For more information on the SQL transform see“Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs

There is no input data set for the SQL transform.

There are two ways of defining the output schema for a SQL transform if the SQL submittedis expected to return a result set:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

• Automatic — After you type the SQL statement, click Update schema to execute a selectstatement against the database that obtains column information returned by the selectstatement and populates the output schema.

• Manual — Output columns must be defined in the output portion of the SQL transform ifthe SQL operation is returning a data set. The number of columns defined in the output ofthe SQL transform must equal the number of columns returned by the SQL query, but thecolumnnames and data types of the output columns do not need tomatch the columnnamesor data types in the SQL query.

Options

The SQL transform has the following options:

DescriptionOption

Specify the datastore for the tables referred to in the SQL statement.Datastore

Specify the type of database for the datastore where there aremultiple datastore configurations.Database type

Indicate the weight of the output data set if the data set is used ina join. The highest ranked source is accessed first to construct thejoin.

Join rank

Indicate the number of rows retrieved in a single request to asource database. The default value is 1000.Array fetch size

Hold the output from this transform in memory for use insubsequent transforms. Use this only if the data set is small enoughto fit in memory.

Cache

Enter the text of the SQL query.SQL text

To create a SQL statement


2. On the Transforms tab of the Local Object Library, click and drag the SQL transform to theworkspace.

3. Add your target object to the workspace.

4. Connect the transform to the target object.

5. Double-click the SQL transform to open the transform editor.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

6. In the parameters area, select the source datastore from theDatastore drop-down list.

7. If there is more than one datastore configuration, select the appropriate configuration fromtheDatabase type drop-down list.

8. Change the other available options, if required.

9. In the SQL text area, enter the SQL statement.For example, to copy the entire contents of a table into the target object, you would use thefollowing statement: Select * from Customers.

10.Click Update Schema to update the output schema with the appropriate values.If required, you can change the names and datatypes of these columns. You can also createthe output columns manually.


12.Click and drag from the transform to the target object.

Activity: Using the SQL transform

The contents of the Employee andDepartment tablesmust bemerged,which can be done usingthe SQL transform as a shortcut.

Objective

• Use the SQL transform to select employee and department data.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Instructions

1. In the Omega project, create a new batch job called Alpha_Employees_Dept_Jobwith a dataflow called Alpha_Employees_Dept_DF.

2. In the data flow workspace, add the SQL transform as the source object.

3. Add the Emp_Dept table from the HR_datamart datastore as the target object, and connectthe transform to it.

4. In the transform editor for the SQL transform, specify the appropriate datastore name anddatabase type for the Alpha datastore.

5. Create a SQL statement to select the last name and first name for the employee from theEmployee table and the department inwhich the employee belongs by looking up the valuein the Department table based on the Department ID.


SELECT EMPLOYEE.EMPLOYEEID, EMPLOYEE.FIRSTNAME, DEPARTMENT.DEPARTMENTNAME FROM

ALPHA.SOURCE.EMPLOYEE, ALPHA.SOURCE.DEPARTMENT WHERE

EMPLOYEE.DEPARTMENTID=DEPARTMENT.DEPARTMENTID

6. Update the output schema based on your SQL statement.

7. Set the EMPLOYEEID columns as the primary key.

8. Execute Employees_Dept_Jobwith the default execution properties and save all objects youhave created.

9. Return to the data flow workspace and view data for the target table.You should have 40 rows in your target table, because there were 8 employees in theemployee table with department IDs that were not defined in the department table.

A solution file called SOLUTION_SQL.atl is included in your Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Using platform transforms1. What would you use to change a row type from NORMAL to INSERT?

2. What is the Case transform used for?

3. Name the transform that you would use to combine incoming data sets to produce a singleoutput data set with the same schema as the input data sets.

4. A validation rule consists of a condition and an action on failure. When can you use theaction on failure options in the validation rule?

5. When would you use the Merge transform versus the SQL transform to merge records?


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Describe platform transforms• Use the Map Operation transform• Use the Validation transform• Use the Merge transform• Use the Case transform• Use the SQL transform


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 7Setting up Error Handling

Lesson introductionFor sophisticated error handling, you can use recoverable work flows and try/catch blocks torecover data.


• Set up recoverable work flows

209Setting up Error Handling—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using recovery mechanisms

IntroductionIf a Data Services job does not complete properly, youmust resolve the problems that preventedthe successful execution of the job.


• Explain how to avoid data recovery situations• Explain the levels of data recovery strategies• Recover a failed job using automatic recovery• Recover missing values and rows• Define alternative work flows

Avoiding data recovery situations

The best solution to data recovery situations is obviously not to get into them in the first place.Some of those situations are unavoidable, such as server failures. Others, however, can easilybe sidestepped by constructing your jobs so that they take into account the issues that frequentlycause them to fail.

One example is when an external file is required to run a job. In this situation, you could usethewait_for_file function or awhile loop and the file_exists function to check that the file existsin a specified location before executing the job.

While loops

The while loop is a single-use object that you can use in a work flow. The while loop repeats asequence of steps as long as a condition is true.

Typically, the steps done during the while loop result in a change in the condition so that thecondition is eventually no longer satisfied and the work flow exits from the while loop. If thecondition does not change, the while loop does not end.

For example, you might want a work flow to wait until the systemwrites a particular file. Youcan use a while loop to check for the existence of the file using the file_exists function. As longas the file does not exist, you can have the work flow go into sleepmode for a particular lengthof time before checking again.

Because the system might never write the file, you must add another check to the loop, suchas a counter, to ensure that the while loop eventually exits. In other words, change the whileloop to check for the existence of the file and the value of the counter. As long as the file doesnot exist and the counter is less than a particular value, repeat the while loop. In each iterationof the loop, put the work flow in sleep mode and then increment the counter.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Describing levels of data recovery strategies

When a job fails to complete successfully during execution, some data flows may not havecompleted.When this happens, some tablesmay have been loaded, partially loaded, or altered.

You need to design your data movement jobs so that you can recover your data by rerunningthe job and retrieving all the data without introducing duplicate or missing data.

There are different levels of data recovery and recovery strategies. You can:• Recover your entire database: Use your standard RDBMS services to restore crashed data

cache to an entire database. This option is outside of the scope of this course.• Recover a partially-loaded job: Use automatic recovery.• Recover from partially-loaded tables: Use the Table Comparison transform, do a full

replacement of the target, use the auto-correct load feature, include a preload SQL commandto avoid duplicate loading of rows when recovering from partially loaded tables.

• Recover missing values or rows: Use the Validation transform or the Query transformwithWHERE clauses to identify missing values, and use overflow files to manage rows thatcould not be inserted.

• Define alternative work flows: Use conditionals, try/catch blocks, and scripts to ensure allexceptions are managed in a work flow.

Depending on the relationships between data flows in your application, you may use acombination of these techniques to recover from exceptions.

Note: It is important to note that some recoverymechanisms are for use in production systems andare not supported in development environments.

Configuring work flows and data flows

In some cases, steps in awork flowdepend on each other andmust be executed together.Whenthere is a dependency like this, you should designate the work flow as a recovery unit. Thisrequires the entire work flow to complete successfully. If the work flow does not completesuccessfully, Data Services executes the entire work flow during recovery, including the stepsthat executed successfully in prior work flow runs.

Conversely, you may need to specify that a work flow or data flow should only execute once.When this setting is enabled, the job never re-executes that object. It is not recommended tomark a work flow or data flow as “Execute only once” if the parent work flow is a recoveryunit.

To specify a work flow as a recovery unit

1. In the project area or on theWork Flows tab of the Local Object Library, right-click the workflow and select Properties from the menu.The Properties dialog box displays.

2. On the General tab, select the Recover as a unit check box.

3. ClickOK.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


To specify that an object executes only once

1. In the project area or on the appropriate tab of the Local Object Library, right-click the workflow or data flow and select Properties from the menu.The Properties dialog box displays.

2. On the General tab, select the Execute only once check box.

3. ClickOK.

Using recovery mode

If a job with automated recovery enabled fails during execution, you can execute the job againin recovery mode. During recovery mode, Data Services retrieves the results forsuccessfully-completed steps and reruns uncompleted or failed steps under the same conditionsas the original job.

In recovery mode, Data Services executes the steps or recovery units that did not completesuccessfully in a previous execution. This includes steps that failed and steps that generatedan exception but completed successfully, such as those in a try/catch block. As in normal jobexecution, Data Services executes the steps in parallel if they are not connected in the workflow diagrams and in serial if they are connected.

For example, suppose a daily update job running overnight successfully loads dimension tablesin a warehouse. However, while the job is running, the database log overflows and stops thejob from loading fact tables. The next day, you truncate the log file and run the job again inrecovery mode. The recovery job does not reload the dimension tables in a failed job becausethe original job, even though it failed, successfully loaded the dimension tables.

To ensure that the fact tables are loaded with the data that corresponds properly to the dataalready loaded in the dimension tables, ensure the following:• Your recovery job must use the same extraction criteria that your original job used when

loading the dimension tables.

If your recovery job uses new extraction criteria, such as basing data extraction on the currentsystem date, the data in the fact tables will not correspond to the data previously extractedinto the dimension tables.

If your recovery job uses new values, the job execution may follow a completely differentpath through conditional steps or try/catch blocks.

• Your recovery job must follow the exact execution path that the original job followed. DataServices records any external inputs to the original job so that your recovery job can usethese stored values and follow the same execution path.

To enable automatic recovery in a job

1. In the project area, right-click the job and select Execute from the menu.The Execution Properties dialog box displays.

2. On the Parameters tab, select the Enable recovery check box.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

If this check box is not selected, Data Services does not record the results from the stepsduring the job and cannot recover the job if it fails.

3. ClickOK.

To recover from last execution

1. In the project area, right-click the job that failed and select Execute from the menu.The Execution Properties dialog box displays.

2. On the Parameters tab, select the Recover from last execution check box.This option is not available when a job has not yet been executed, the previous job runsucceeded, or recovery mode was disabled during the previous run.

3. ClickOK.

Recovering from partially-loaded data

Executing a failed job again may result in duplication of rows that were loaded successfullyduring the first job run.

Within your recoverable work flow, you can use several methods to ensure that you do notinsert duplicate rows:• Include the Table Comparison transform (available in Data Integrator packages only) in

your data flow when you have tables with more rows and fewer fields, such as fact tables.• Change the target table options to completely replace the target table during each execution.

This technique can be optimal when the changes to the target table are numerous comparedto the size of the table.

• Change the target table options to use the auto-correct load feature when you have tableswith fewer rows and more fields, such as dimension tables. The auto-correct load checksthe target table for existing rows before adding new rows to the table. Using the auto-correctload option, however, can slow jobs executed in non-recoverymode. Consider this techniquewhen the target table is large and the changes to the table are relatively few.

• Include a SQL command to execute before the table loads. Preload SQL commands canremove partial database updates that occur during incomplete execution of a step in a job.Typically, the preload SQL command deletes rows based on a variable that is set before thepartial insertion step began.

For more information on preloading SQL commands, see “Using preload SQL to allowre-executable Data Flows”, Chapter 18 in the Data Services Designer Guide.

Recovering missing values or rows

Missing values that are introduced into the target data during data integration and data qualityprocesses can be managed using the Validation or Query transforms.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Missing rows are rows that cannot be inserted into the target table. For example, rows may bemissing in instanceswhere a primary key constraint is violated. Overflow files help you processthis type of data problem.

When you specify an overflow file and Data Services cannot load a row into a table, DataServices writes the row to the overflow file instead. The trace log indicates the data flow inwhich the load failed and the location of the file.

You can use the overflow information to identify invalid data in your source or problemsintroduced in the data movement. Every new run will overwrite the existing overflow file.

To use an overflow file in a job

1. Open the target table editor for the target table in your data flow.

2. On the Options tab, under Error handling, select the Use overflow file check box.

3. In the File name field, enter or browse to the full path and file name for the file.When you specify an overflow file, give a full path name to ensure that Data Services createsa unique file when more than one file is created in the same job.

4. In the File format drop-down list, select what you want Data Services to write to the fileabout the rows that failed to load:• If you selectWrite data, you can use Data Services to specify the format of the

error-causing records in the overflow file.• If you selectWrite sql, you can use the commands to load the target manually when the

target is accessible.

Defining alternative work flows

You can set up your jobs to use alternative work flows that cover all possible exceptions andhave recovery mechanisms built in. This technique allows you to automate the process ofrecovering your results.

Alternative work flows consist of several components:1. A script to determine if recovery is required.

This script reads the value in a status table and populates a global variable with the samevalue. The initial value in table is set to indicate that recovery is not required.

2. A conditional that calls the appropriate work flow based on whether recovery is required.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The conditional contains an If/Then/Else statement to specify that work flows that do notrequire recovery are processed one way, and those that do require recovery are processedanother way.

3. A work flow with a try/catch block to execute a data flow without recovery.

The data flow where recovery is not required is set up without the auto correct load optionset. This ensures that, wherever possible, the data flow is executed in a less resource-intensivemode.

4. A script in the catch object to update the status table.

The script specifies that recovery is required if any exceptions are generated.

5. A work flow to execute a data flow with recovery and a script to update the status table.

The data flow is set up formore resource-intensive processing thatwill resolve the exceptions.The script updates the status table to indicate that recovery is not required.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Conditionals

Conditionals are single-use objects used to implement conditional logic in a work flow. Whenyou define a conditional, you must specify a condition and two logical branches:

DescriptionStatement

ABoolean expression that evaluates to TRUEor FALSE. You can use functions, variables,If and standard operators to construct theexpression.

Work flow element to execute if the IFexpression evaluates to TRUE.Then

Work flow element to execute if the IFexpression evaluates to FALSE.Else

Both the Then and Else branches of the conditional can contain any object that you can havein a work flow, including other work flows, data flows, nested conditionals, try/catch blocks,scripts, and so on.

Try/Catch Blocks

A try/catch block allows you to specify alternative work flows if errors occur during jobexecution. Try/catch blocks catch classes of errors, apply solutions that you provide, andcontinue execution.

For each catch in the try/catch block, you will specify:• One exception or group of exceptions handled by the catch. To handle more than one

exception or group of exceptions, add more catches to the try/catch block.• The work flow to execute if the indicated exception occurs. Use an existing work flow or

define a work flow in the catch editor.

If an exception is thrown during the execution of a try/catch block, and if no catch is lookingfor that exception, then the exception is handled by normal error logic.

Using try/catch blocks and automatic recovery

Data Services does not save the result of a try/catch block for re-use during recovery. If anexception is thrown inside a try/catch block, during recovery Data Services executes the stepthat threw the exception and subsequent steps.

Because the execution path through the try/catch block might be different in the recoveredjob, using variables set in the try/catch block could alter the results during automatic recovery.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


For example, suppose you create a job that defines the value of variable $I within a try/catchblock. If an exception occurs, you set an alternate value for $I. Subsequent steps are based onthe new value of $I.

During the first job execution, the first work flow contains an error that generates an exception,which is caught. However, the job fails in the subsequent work flow.

You fix the error and run the job in recovery mode. During the recovery execution, the firstwork flow no longer generates the exception. Thus the value of variable $I is different, and thejob selects a different subsequent work flow, producing different results.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

To ensure proper results with automatic recovery when a job contains a try/catch block, donot use values set inside the try/catch block or reference output variables from a try/catchblock in any subsequent steps.

To create an alternative work flow

1. Create a job.

2. Add a global variable to your job called $G_recovery_neededwith a datatype of int.The purpose of this global variable is to store a flag that indicates whether or not recoveryis needed. This flag is based on the value in a recovery status table, which contains a flag of1 or 0, depending on whether recovery is needed.

3. In the job workspace, add a work flow using the tool palette.

4. In the work flow workspace, add a script called GetStatus using the tool palette.

5. In the script workspace, construct an expression to update the value of the$G_recovery_needed global variable to the same value as is in the recovery status table.

The script content depends on the RDBMS on which the status table resides. The followingis an example of the expression:

$G_recovery_needed = sql('DEMO_Target', 'select recovery_flag from

recovery_status');

6. Return to the work flow workspace.

7. Add a conditional to the workspace using the tool palette and connect it to the script.

8. Open the conditional.

The transform editor for the conditional allows you to specify the IF expression andThen/Else branches.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

9. In the IF field, enter the expression that evaluates whether recovery is required.

The following is an example of the expression:

$G_recovery_needed = 0

This means the objects in the Then pane will run if recovery is not required. If recovery isneeded, the objects in the Else pane will run.

10.Add a try object to the Then pane of the transform editor using the tool palette.

11.In the Local Object Library, click and drag a work flow or data flow to the Then pane afterthe try object.

12.Add a catch object to the Then pane after the work flow or data flow using the tool palette.

13.Connect the objects in the Then pane.

14.Open the workspace for the catch object.

All exception types are lists in the Available exceptions pane.

15.To change which exceptions act as triggers, expand the tree in the Available exceptionspane, select the appropriate exceptions, and click Set to move them to the Trigger on theseexceptions pane.By default, Data Services catches all exceptions.

16.Add a script called Fail to the lower pane using the Tool.This object will be executed if there are any exceptions. If desired, you can add a data flowhere instead of a script.

17.In the script workspace, construct an expression update the flag in the recovery status tableto 1, indicating that recovery is needed.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


sql('DEMO_Target','update recovery_status set recovery_flag = 1');

18.Return to the conditional workspace.

19.Connect the objects in the Then pane.

20.In the Local Object Library, click and drag the work flow or data flow that represents therecovery process to the Else pane.This combinationmeans that if recovery is not needed, then the first object will be executed;if recovery is required, the second object will be executed.

21.Add a script called Pass to the lower pane using the tool palette.

22.In the script workspace, construct an expression to update the flag in the recovery statustable to 0, indicating that recovery is not needed.


sql('DEMO_Target','update recovery_status set recovery_flag = 0');

23.Return to the conditional workspace.

24.Connect the objects in the Else pane.

25.Validate and save all objects.

26.Execute the job.The first time this job is executed, the job succeeds because the recovery_flag value in thestatus table is set to 0 and the target table is empty, so there is no primary key constraint.

27.Execute the job again.The second time this job is executed, the job fails because the target table already containsrecords, so there is a primary key exception.

28.Check the contents of the status table.The recovery_flag field now contains a value of 1.

29.Execute the job again.The third time this job is executed, the version of the data flow with the Auto correct loadoption selected runs because the recovery_flag value in the status table is set to 1. The jobsucceeds because the auto correct load feature checks for existing values before trying toinsert new rows.

30.Check the contents of the status table again.The recovery_flag field contains a value of 0.

Activity: Creating an alternative work flow

With the influx of new employees resulting from Alpha's acquisition of new companies, theEmployee Department information needs to be updated regularly. Because this information isused for payroll, it is critical that no records are lost if a job is interrupted, so you need to set


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

up the job in such a way that exceptions will always be managed. This involves setting up aconditional that will try to run a less resource-intensive update of the table first; if that generatesan exception, the conditional then tries a version of the same data flow that is configured toauto correct the load.

Objective

• Set up a try/catch block with a conditional to catch exceptions.

Instructions

1. In the Local Object Library, replicate Alpha_Employees_Dept_DF and rename the newversion Alpha_Employees_Dept_AC_DF.

2. In the target table editor for the Emp_Dept table in Alpha_Employees_Dept_DF, ensurethat theDelete data from table before loading and Auto correct load options are notselected.

3. In the target table editor for the Emp_Dept table inAlpha_Employees_Dept_AC_DF, ensurethat theDelete data from table before loading option is not selected.

4. Select the Auto correct load option.

5. In the Omega project, create a new batch job called Alpha_Employees_Dept_Recovery_Job.

6. Add a global variable called $G_Recovery_Neededwith a datatype of int to your job.

7. Add a work flow to your job called Alpha_Employees_Dept_Recovery_WF.

8. In the work flow workspace, add a script called GetStatus and construct an expression toupdate the value of the $G_Recovery_Needed global variable to the same value as in therecovery_flag column in the recovery_status table in the HR datamart.


$G_Recovery_Needed = sql('hr_datamart', 'select recovery_flag from

recovery_status');

9. In the work flow workspace, add a conditional called Alpha_Employees_Dept_Con andconnect it to the script.

10.In the editor for the conditional, enter an IF expression that states that recovery is notrequired.


$G_Recovery_Needed = 0

11.In the Then pane, create a new try object called Alpha_Employees_Dept_Try.

12.Add Alpha_Employees_Dept_DF and connect it to the try object.

13.Create a new catch object called Alpha_Employees_Dept_Catch, and connect it toAlpha_Employees_Dept_DF.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


14.In the editor for the catch object, add a script called Recovery_Fail to the lower pane andconstruct an expression to update the flag in the recovery status table to 1, indicating thatrecovery is needed.


sql('hr_datamart','update recovery_status set recovery_flag = 1');

15.In the conditional workspace, add Alpha_Employees_Dept_AC_DF to the Else pane.

16.Add a script called Recovery_Pass to the Else pane next toAlpha_Employees_Dept_AC_DFand connect the objects.

17.In the script, construct an expression to update the flag in the recovery status table to 0,indicating that recovery is not needed.


sql('hr_datamart','update recovery_status set recovery_flag = 0');

18.Execute Alpha_Employees_Dept_Recovery_Job for the first timewith the default executionproperties and save all objects you have created.Note that the trace log indicates the data flow generated an error, but the job completedsuccessfully due to the try catch block. Note that an error logwas generatedwhich indicatesa primary key conflict in the target table.

19.Execute Alpha_Employees_Dept_Recovery_Job again.In the log, note that the job succeeds and that the data flow used wasAlpha_Employees_Dept_AC_DF.

A solution file called SOLUTION_Recovery.atl is included in your Course Resources. To checkthe solution, import the file and open it to view the data flow design and mapping logic. Donot execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Setting up error handling1. List the different strategies you can use to avoid duplicate rows of data when re-loading a

job.

2. True or false? You can only run a job in recovery mode after the initial run of the job hasbeen set to run with automatic recovery enabled.

3. What are the two scripts in a manual recovery work flow used for?

4. Which of the following types of exception can you NOT catch using a try/catch block?

a. Database access errors

b. Syntax errors

c. System exception errors

d. Execution errors

e. File access errors


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Set up recoverable work flows


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 8Capturing Changes in Data

Lesson introductionThe design of your datawarehousemust take into account howyou are going to handle changesin your target systemwhen the respective data in your source system changes. Data Integratortransforms provide you with a mechanism to do this.


• Update data over time• Use source-based CDC• Use target-based CDC

225Capturing Changes in Data—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Updating data over time

IntroductionData Integrator transforms provide support for updating changing data in your datawarehouse.


• Describe the options for updating changes to data• Explain the purpose of Changed Data Capture (CDC)• Explain the role of surrogate keys in managing changes to data• Define the differences between source-based and target-based CDC

Explaining Slowly Changing Dimensions (SCD)

SCDs are dimensions that have data that changes over time. The followingmethods of handlingSCDs are available:

DescriptionType

Natural consequence of normalization.Type 1

No history preservation

Type 2

• New rows generated for significantchanges.

Unlimitedhistorypreservation andnew rows• Requires use of a unique key. The key

relates to facts/time.• Optional Effective_Date field.

Type 3

• Two states of data are preserved: currentand old.

Limited history preservation• New fields are generated to store history

data.• Requires an Effective_Date field.

Because SCD Type 2 resolves most of the issues related to slowly changing dimensions, it isexplored last.

SCD Type 1

For an SCD Type 1 change, you find and update the appropriate attributes on a specificdimensional record. For example, to update a record in the SALES_PERSON_DIMENSIONtable to show a change to an individual’s SALES_PERSON_NAME field, you simply updateone record in the SALES_PERSON_DIMENSION table. This action would update or correct


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

that record for all fact records across time. In a dimensional model, facts have nomeaning untilyou link them with their dimensions. If you change a dimensional attribute withoutappropriately accounting for the time dimension, the change becomes global across all factrecords.

This is the data before the change:

SALES_TEAMNAMESALES_PERSON_IDSALES_PERSON_KEY

NorthwestDoe, John B00012015

This is the same table after the salesperson’s name has been changed:


NorthwestSmith, John B00012015

However, suppose a salesperson transfers to a new sales team. Updating the salesperson’sdimensional record would update all previous facts so that the salesperson would appear tohave always belonged to the new sales team. This may cause issues in terms of reporting salesnumbers for both teams. If youwant to preserve an accurate history of whowas onwhich salesteam, Type 1 is not appropriate.

SCD Type 3

To implement a Type 3 change, you change the dimension structure so that it renames theexisting attribute and adds two attributes, one to record the new value and one to record thedate of the change.

A Type 3 implementation has three disadvantages:• You can preserve only one change per attribute, such as old and new or first and last.• Each Type 3 change requires a minimum of one additional field per attribute and another

additional field if you want to record the date of the change.• Although the dimension’s structure contains all the data needed, the SQL code required to

extract the information can be complex. Extracting a specific value is not difficult, but if youwant to obtain a value for a specific point in time or multiple attributes with separate oldand new values, the SQL statements become long and have multiple conditions.

In summary, SCD Type 3 can store a change in data, but can neither accommodate multiplechanges, nor adequately serve the need for summary reporting.




This is the same table after the new dimensions have been added and the salesperson’s salesteam has been changed:


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

SALES_PERSON_IDEFF_TO_DATENEW_TEAMOLD_TEAMSALES_PERSON_

NAME

00120Oct_31_2004NortheastNorthwestDoe, John B

SCD Type 2

With a Type 2 change, you do not need to make structural changes to theSALES_PERSON_DIMENSION table. Instead, you add a record.




After you implement the Type 2 change, two records appear, as in the following table:



SoutheastDoe, John B000120133

Updating changes to data

When you have a large amount of data to update regularly and a small amount of systemdowntime for scheduled maintenance on a data warehouse, you must choose the most appropriatemethod for updating your data over time, also known as “delta load”. You can choose to do afull refresh of your data or you can choose to extract only new or modified data and updatethe target system:• Full refresh: Full refresh is easy to implement and easy to manage. This method ensures

that no data is overlooked or left out due to technical or programming errors. For anenvironment with a manageable amount of source data, full refresh is an easy method youcan use to perform a delta load to a target system.

• Capturing only changes: After an initial load is complete, you can choose to extract onlynew or modified data and update the target system. Identifying and loading only changeddata is called Changed Data Capture (CDC). CDC is recommended for large tables. If thetables that you are working with are small, you may want to consider reloading the entiretable instead. The benefit of using CDC instead of doing a full refresh is that it:○ Improves performance because the job takes less time to processwith less data to extract,

transform, and load.○ Change history can be tracked by the target system so that data can be correctly analyzed

over time. For example, if a sales person is assigned a new sales region, simply updating


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

the customer record to reflect the new region negatively affects any analysis by regionover time because the purchases made by that customer before the move are attributedto the new region.

Explaining history preservation and surrogate keys

History preservation allows the data warehouse or data mart to maintain the history of datain dimension tables so you can analyze it over time.

For example, if a customer moves from one sales region to another, simply updating thecustomer record to reflect the new region would give you misleading results in an analysis byregion over time, because all purchasesmade by the customer before themovewould incorrectlybe attributed to the new region.

The solution to this involves introducing a new record for the same customer that reflects thenew sales region so that you can preserve the previous record. In this way, accurate reportingis available for both sales regions. To support this, Data Services is set up to treat all changesto records as INSERT rows by default.

However, you also need to manage the primary key constraint issues in your target tables thatarise when you have more than one record in your dimension tables for a single entity, suchas a customer or an employee.

For example, with your sales records, the Sales Rep ID is usually the primary key and is usedto link that record to all of the rep's sales orders. If you try to add a new record with the sameprimary key, it will throw an exception. On the other hand, if you assign a new Sales Rep IDto the new record for that rep, you will compromise your ability to report accurately on therep's’s total sales.

To address this issue, youwill create a surrogate key, which is a new column in the target tablethat becomes the new primary key for the records. At the same time, you will change theproperties of the former primary key so that it is simply a data column.

When a new record is inserted for the same rep, a unique surrogate key is assigned allowingyou to continue to use the Sales Rep ID to maintain the link to the rep’s orders.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

You can create surrogate keys either by using the gen_row_num or key_generation functionsin theQuery transform to create a new output column that automatically incrementswhenevera new record is inserted, or by using the Key Generation transform, which serves the samepurpose.

Comparing source-based and target-based CDC

Setting up a full CDC solution within Data Services may not be required. Many databases nowhave CDC support built into them, such as Oracle, SQL Server, and DB2. Alternatively, youcould combine surrogate keys with the Map Operation transform to change all UPDATE rowtypes to INSERT row types to capture changes.

However, if you do want to set up a full CDC solution, there are two general incremental CDCmethods to choose from: source-based and target-based CDC.

Source-basedCDC evaluates the source tables to determinewhat has changed and only extractschanged rows to load into the target tables.

Target-based CDC extracts all the data from the source, compares the source and target rowsusing table comparison, and then loads only the changed rows into the target.

Source-based CDC is almost always preferable to target-based CDC for performance reasons.However, some source systems do not provide enough information to make use of thesource-based CDC techniques. You will usually use a combination of the two techniques.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using source-based CDC

IntroductionSource-based CDC is the preferred method because it improves performance by extracting thefewest rows.


• Define the methods of performing source-based CDC• Explain how to use timestamps in source-based CDC• Manage issues related to using timestamps for source-based CDC

Using source tables to identify changed data

Source-based CDC, sometimes also referred to as incremental extraction, extracts only thechanged rows from the source. To use source-based CDC, your source data must have someindication of the change. There are two methods:• Timestamps: You can use the timestamps in your source data to determine what rows have

been added or changed since the last time data was extracted from the source. To supportthis type of source-basedCDC, your database tablesmust have at least an update timestamp;it is preferable to have a create timestamp as well.

• Change logs: You can also use the information captured by the RDBMS in the log files forthe audit trail to determine what data is has been changed.

Log-based data is more complex and is outside the scope of this course. For more informationon using logs for CDC, see “Techniques for CapturingData”, in theData Services Designer Guide.

Using CDC with timestamps

Timestamp-based CDC is an ideal solution to track changes if:• There are date and time fields in the tables being updated.• You are updating a large table that has a small percentage of changes between extracts and

an index on the date and time fields.• You are not concerned about capturing intermediate results of each transaction between

extracts (for example, if a customer changes regions twice in the same day).

It is not recommended that you use timestamp-based CDC if:• You have a large table with a large percentage of it changes between extracts and there is

no index on the timestamps.• You need to capture physical row deletes.• You need to capture multiple events occurring on the same row between extracts.

Some systems have timestamps with dates and times, some with just the dates, and some withmonotonically-generated increasing numbers. You can treat dates and generated numbers inthe same manner. It is important to note that for timestamps based on real time, time zones


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


can become important. If you keep track of timestamps using the nomenclature of the sourcesystem (that is, using the source time or source-generated number), you can treat both temporal(specific time) and logical (time relative to another time or event) timestamps in the sameway.

The basic technique for using timestamps is to add a column to your source and target tablesthat tracks the timestamps of rows loaded in a job.When the job executes, this column is updatedalong with the rest of the data. The next job then reads the latest timestamp from the targettable and selects only the rows in the source table for which the timestamp is later.

This example illustrates the technique. Assume that the last load occurred at 2:00 PMon January1, 2008. At that time, the source table had only one row (key=1) with a timestamp earlier thanthe previous load. Data Services loads this row into the target tablewith the original timestampof 1:10 PM on January 1, 2008. After 2:00 PM, Data Services addsmore rows to the source table:

At 3:00 PM on January 1, 2008, the job runs again. The job:1. Reads the Last_Update field from the target table (01/01/2008 01:10 PM).

2. Selects rows from the source table that have timestamps that are later than the value ofLast_Update. The SQL command to select these rows is:

SELECT * FROM Source WHERE Last_Update > '01/01/2007 01:10 pm'

This operation returns the second and third rows (key=2 and key=3).

3. Loads these new rows into the target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

For timestamped CDC, you must create a work flow that contains the following:• A script that reads the target table and sets the value of a global variable to the latest

timestamp.• A data flow that uses the global variable in a WHERE clause to filter the data.

The data flow contains a source table, a query, and a target table. The query extracts only thoserows that have timestamps later than the last update.

To set up a timestamp-based CDC delta job

1. In the Variables and Parameters dialog box, add a global variable called $G_Last_Update

with a datatype of datetime to your job.The purpose of this global variable is to store a string conversion of the timestamp for thelast time the job executed.

2. In the job workspace, add a script called GetTimestamp using the tool palette.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

3. In the script workspace, construct an expression to do the following:• Select the last time the job was executed from the last update column in the table.• Assign the actual timestamp value to the $G_Last_Update global variable.


$G_Last_Update = sql('DEMO_Target','select max(last_update) from employee_dim');

4. Return to the job workspace.

5. Add a data flow to the right of the script using the tool palette.

6. In the data flowworkspace, add the source, Query transform, and target objects and connectthem.The target table for CDC cannot be a template table.

7. In the Query transform, add the columns from the input schema to the output schema asrequired.

8. If required, in the output schema, right-click the primary key (if it is not already set to thesurrogate key) and clear the Primary Key option in the menu.

9. Right-click the surrogate key column and select the Primary Key option in the menu.

10.On the Mapping tab for the surrogate key column, construct an expression to use thekey_generation function to generate new keys based on that column in the target table,incrementing by 1.


key_generation('DEMO_Target.demo_target.employee_dim', 'Emp_Surr_Key', 1)

11.On the WHERE tab, construct an expression to select only those records with a timestampthat is later than the $G_Last_Update global variable.

The following is an example of the expression:

employee_dim.last_update > $G_Last_Update

12.Connect the GetTimestamp script to the data flow.

13.Validate and save all objects.

14.Execute the job.

Managing overlaps

Unless source data is rigorously isolated during the extraction process (which typically is notpractical), there is a window of time when changes can be lost between two extraction runs.This overlap period affects source-based CDC because this kind of data capture relies on astatic timestamp to determine changed data.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

For example, suppose a table has 10,000 rows. If a change is made to one of the rows after itwas loaded but before the job ends, the second update can be lost.

There are three techniques for handling this situation:• Overlap avoidance• Overlap reconciliation• Presampling

Formore information see “Source-based and target-based CDC” in “Techniques for CapturingChanged Data” in the Data Services Designer Guide.

Overlap avoidance

In some cases, it is possible to set up a system where there is no possibility of an overlap. Youcan avoid overlaps if there is a processing intervalwhere no updates are occurring on the targetsystem.

For example, if you can guarantee the data extraction from the source system does not lastmore than one hour, you can run a job at 1:00AMevery night that selects only the data updatedthe previous day until midnight. While this regular job does not give you up-to-the-minuteupdates, it guarantees that you never have an overlap and greatly simplifies timestampmanagement.

Overlap reconciliation

Overlap reconciliation requires a special extraction process that re-applies changes that couldhave occurred during the overlap period. This extraction can be executed separately from theregular extraction. For example, if the highest timestamp loaded from the previous job was01/01/2008 10:30 PM and the overlap period is one hour, overlap reconciliation re-applies thedata updated between 9:30 PM and 10:30 PM on January 1, 2008.

The overlap period is usually equal to the maximum possible extraction time. If it can take uptoN hours to extract the data from the source system, an overlap period of N (or N plus a smallincrement) hours is recommended. For example, if it takes at most two hours to run the job,an overlap period of at least two hours is recommended.

Presampling

Presampling is an extension of the basic timestamp processing technique. The main differenceis that the status table contains both a start and an end timestamp, instead of the last updatetimestamp. The start timestamp for presampling is the same as the end timestamp of theprevious job. The end timestamp for presampling is established at the beginning of the job. Itis the most recent timestamp from the source table, commonly set as the system date.

Activity: Using source-based CDC

You need to set up a job to update employee records in the Omega data warehouse wheneverthey change. The employee records include timestamps to indicatewhen theywere last updated,so you can use source-based CDC.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Objective

• Use timestamps to enable changed data capture for employee records.

Instructions

1. In the Omega project, create a new batch job called Alpha_Employees_Dim_Job.

2. Add a global variable called $G_LastUpdatewith a datatype of datetime to your job.

3. In the job workspace, add a script called GetTimestamp and construct an expression to dothe following:• Select the last time the job was executed from the last update column in the employee

dimension table.• If the last update column isNULL, assign a value of January 1, 1901 to the $G_LastUpdate

global variable. When the job executes for the first time for the initial load, this ensuresthat all records are processed.

• If the last update column is not NULL, assign the actual timestamp value to the$G_LastUpdate global variable.


$G_LastUpdate = sql('omega','select max(LAST_UPDATE) from emp_dim')

if ($G_LastUpdate is null) $G_LastUpdate = to_date('1901.01.01','YYYY.MM.DD');

else print('Last update was ' || $G_LastUpdate);

4. In the job workspace, add a data flow called Alpha_Employees_Dim_DF and connect it to thescript.

5. Add the Employee table from the Alpha datastore as the source object and the Emp_Dimtable from the Omega datastore as the target object.

6. Add the Query transform and connect the objects.

7. In the transform editor for the Query transform, map the columns as follows:

Schema OutSchema In

EMPLOYEEIDEMPLOYEEID

LASTNAMELASTNAME

FIRSTNAMEFIRSTNAME

BIRTHDATEBIRTHDATE

HIREDATEHIREDATE


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Schema OutSchema In

ADDRESSADDRESS

PHONEPHONE

EMAILEMAIL

REPORTSTOREPORTSTO

LAST_UPDATELastUpdate

DISCHARGE_DATEdischarge_date

8. Create a mapping expression for the SURR_KEY column that generates new keys based onthe Emp_Dim target table, incrementing by 1.


key_generation('Omega.dbo.emp_dim', 'SURR_KEY', 1)

9. Create a mapping expression for the CITY column to look up the city name from the Citytable in the Alpha datastore based on the city ID.


lookup_ext([Alpha.source.city,'PRE_LOAD_CACHE','MAX'],

[CITYNAME],[NULL],[CITYID,'=',employee.CITYID]) SET

("run_as_separate_process"='no')

10.Create a mapping expression for the REGION column to look up the region name from theRegion table in the Alpha datastore based on the region ID.


lookup_ext([Alpha.source.region,'PRE_LOAD_CACHE','MAX'],

[REGIONNAME],[NULL],[REGIONID,'=',employee.REGIONID]) SET


11.Create a mapping expression for the COUNTRY column to look up the country name fromthe Country table in the Alpha datastore based on the country ID.


lookup_ext([Alpha.source.country,'PRE_LOAD_CACHE','MAX'],

[COUNTRYNAME],[NULL],[COUNTRYID,'=',employee.COUNTRYID]) SET



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


12.Create a mapping expression for the DEPARTMENT column to look up the departmentname from the Department table in the Alpha datastore based on the department ID.


lookup_ext([Alpha.source.department,'PRE_LOAD_CACHE','MAX'],

[DEPARTMENTNAME],[NULL],[DEPARTMENTID,'=',employee.DEPARTMENTID]) SET


13.On the WHERE tab, construct an expression to select only those records with a timestampthat is later than the $G_LASTUPDATE global variable.


employee.LastUpdate > $G_LASTUPDATE

14.Execute Alpha_Employees_Dim_Job with the default execution properties and save allobjects you have created.According to the log, the last update for the table was on 2007.12.27.

15.Return to the data flow workspace and view data for the target table. Sort the records bythe LAST_UPDATE column.

A solution file called SOLUTION_SourceCDC.atl is included in your Course Resources. To checkthe solution, import the file and open it to view the data flow design and mapping logic. Donot execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using target-based CDC

IntroductionTarget-basedCDCcompares the source to the target to determinewhich records have changed.


• Define the Data Integrator transforms involved in target-based CDC

Using target tables to identify changed data

Source-basedCDC evaluates the source tables to determinewhat has changed and only extractschanged rows to load into the target tables. Target-based CDC, by contrast, extracts all the datafrom the source, compares the source and target rows, and then loads only the changed rowsinto the target with new surrogate keys.

Source-based changed-data capture is almost always preferable to target-based capture forperformance reasons; however, some source systems do not provide enough information tomake use of the source-based CDC techniques. Target-based CDC allows you to use thetechnique when source-based change information is limited.

You can preserve history by creating a data flow that contains the following:• A source table contains the rows to be evaluated.• A Query transform maps columns from the source.• A Table Comparison transform compares the data in the source table with the data in the

target table to determinewhat has changed. It generates a list of INSERT andUPDATE rowsbased on those changes. This circumvents the default behavior in Data Services of treatingall changes as INSERT rows.

• A History Preserving transform converts certain UPDATE rows to INSERT rows based onthe columns inwhich values have changed. This produces a second row in the target insteadof overwriting the first row.

• AKey Generation transform generates new keys for the updated rows that are now flaggedas INSERT.

• A target table receives the rows. The target table cannot be a template table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Identifying history preserving transforms

Data Services supports history preservation with three Data Integrator transforms:


Converts rows flagged asUPDATE to UPDATE plus

History PreservingINSERT, so that the originalvalues are preserved in thetarget. You specify thecolumn in which to look forupdated data.

Generates new keys forsource data, starting from aKey Generation value based on existing keysin the table you specify.

Compares two data sets andproduces the difference

Table Comparison between them as a data setwith rows flagged as INSERTand UPDATE.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Explaining the Table Comparison transform

The Table Comparison transform allows you to detect and forward changes that have occurredsince the last time a target was updated. This transform compares two data sets and producesthe difference between them as a data set with rows flagged as INSERT or UPDATE.

For example, the transform compares the input and comparison tables and determines thatrow 10 has a new address, row 40 has a name change, and row 50 is a new record. The outputincludes all three records, flagged as appropriate:

The next section gives a brief description of the function, data input requirements, options, anddata output results for the Table Comparison transform. For more information on the Pivottransform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The transform compares two data sets, one from the input to the transform (input data set),and one from a database table specified in the transform (the comparison table). The transformselects rows from the comparison table based on the primary key values from the input dataset. The transform compares columns that exist in the schemas for both inputs.

The input data set must be flagged as NORMAL.

The output data set contains only the rows that make up the difference between the tables. Theschema of the output data set is the same as the schema of the comparison table. No DELETEoperations are produced.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


If a column has a date datatype in one table and a datetime datatype in the other, the transformcompares only the date section of the data. The columns can also be time anddatetime datatypes,in which case Data Integrator only compares the time section of the data.

For each row in the input data set, there are three possible outcomes from the transform:• An INSERT column is added: The primary key value from the input data set does notmatch

a value in the comparison table. The transform produces an INSERT row with the valuesfrom the input data set row.

If there are columns in the comparison table that are not present in the input data set, thetransform adds these columns to the output schema and fills them with NULL values.

• An UPDATE row is added: The primary key value from the input data set matches a valuein the comparison table, and values in the non-key compare columns differ in thecorresponding rows from the input data set and the comparison table.

The transform produces an UPDATE row with the values from the input data set row.

If there are columns in the comparison table that are not present in the input data set, thetransform adds these columns to the output schema and fills them with values from thecomparison table.

• The row is ignored: The primary key value from the input data set matches a value in thecomparison table, but the comparison does not indicate any changes to the row values.

Options

The Table transform offers several options:

DescriptionOption

Specifies the fully qualified name of thesource table from which the maximum

Table name

existing key is determined (key source table).This table must already be imported into therepository. Table name is represented asdatastore.owner.table where datastore is thename of the datastore Data Services uses toaccess the key source table and ownerdepends on the database type associatedwiththe table.

Specifies a column in the comparison table.When there is more than one row in the

Generated key column comparison table with a given primary keyvalue, this transform compares the rowwiththe largest generated key value of these rowsand ignores the other rows. This is optional.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Provides support for input rows withduplicate primary key values.Input contains duplicate keys

Flags the transform to identify rows that havebeen deleted from the source.Detect deleted row(s) from comparison table

Allows you to select themethod for accessingthe comparison table. You can select fromComparison method Row-by-row select, Cached comparisontable, and Sorted input.

Specifies the columns in the input data setthat uniquely identify each row. These

Input primary key column(s) columns must be present in the comparisontable with the same column names anddatatypes.

Improves performance by comparing onlythe sub-set of columns you drag into this box

Compare columns from the input schema. If no columns arelisted, all columns in the input data set thatare also in the comparison table are used ascompare columns. This is optional.

Explaining the History Preserving transform

The History Preserving transform ignores everything but rows flagged as UPDATE. For theserows, it compares the values of specified columns and, if the values have changed, flags therow as INSERT. This produces a second row in the target instead of overwriting the first row.

For example, a target table that contains employee information is updated periodically from asource table. In this case, the Table Comparison transform has flagged the name change forrow 40 as an update. However, the History Preserving transform is set up to preserve historyon the LastName column, so the output changes the operation code for that record fromUPDATE to INSERT.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The next section gives a brief description of the function, data input requirements, options, anddata output results for the History Preserving transform. For more information on the HistoryPreserving transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The input data set is the result of a comparison between two versions of the same data inwhichrows with changed data from the newer version are flagged as UPDATE rows and new datafrom the newer version are flagged as INSERT rows.

The output data set contains rows flagged as INSERT or UPDATE.

Options

The History Preserving transform offers these options:

DescriptionOption

Specifies a date or datetime column from thesource schema. Specify a Valid from dateValid from column if the target uses an effective date totrack changes in data.

Specifies a date value in the following format:YYYY.MM.DD. The Valid to date cannot bethe same as the Valid from date.

Valid to


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Specifies a column from the source schemathat identifies the current valid row from a

Column set of rows with the same primary key. Theflag column indicates whether a row is themost current data in the target for a givenprimary key.

Defines an expression that outputs a valuewith the same datatype as the value in the

Set value Set flag column. This value is used to updatethe current flag column in the new row in thetarget added to preserve history of an existingrow.

Defines an expression that outputs a valuewith the same datatype as the value in the

Reset value Reset flag column. This value is used toupdate the current flag column in an existingrow in the target that included changes inone or more of the compare columns.

Converts DELETE rows to UPDATE rows inthe target. If you previously set effective date

Preserve delete row(s) as update row(s)

values (Valid from and Valid to), sets theValid to value to the execution date. Thisoption is used to maintain slowly changingdimensions by feeding a complete data setfirst through the TableComparison transformwith itsDetect deleted row(s) fromcomparison table option selected.

Lists the column or columns in the input dataset that are to be compared for changes.

Compare columns

• If the values in the specified comparecolumns in each version match, thetransform flags the row as UPDATE. Therow from the before version is updated.The date and flag information is alsoupdated.

• If the values in each version do notmatch,the row from the latest version is flagged


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


DescriptionOption

as INSERT when output from thetransform. This adds a new row to thewarehouse with the values from the newrow.

Updates to non-history preserving columnsupdate all versions of the row if the updateis performed on the natural key (for example,Customer), but only update the latest versionif the update is on the generated key (forexample, GKey).

Explaining the Key Generation transform

The Key Generation transform generates new keys before inserting the data set into the targetin the sameway as the key_generation function does.When it is necessary to generate artificialkeys in a table, this transform looks up the maximum existing key value from a table and usesit as the starting value to generate new keys. The transform expects the generated key columnto be part of the input schema.

For example, suppose the History Preserving transform produces rows to add to a warehouse,and these rows have the same primary key as rows that already exist in the warehouse. In thiscase, you can add a generated key to the warehouse table to distinguish these two rows thathave the same primary key.

The next section gives a brief description of the function, data input requirements, options, anddata output results for the Key Generation transform. For more information on the KeyGeneration transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output

The input data set is the result of a comparison between two versions of the same data inwhichchanged data from the newer version are flagged as UPDATE rows and new data from thenewer version are flagged as INSERT rows.

The output data set is a duplicate of the input data set, with the addition of key values in thegenerated key column for input rows flagged as INSERT.

Options

The Key Generation transform offers these options:

DescriptionOption

Specifies the fully qualified name of thesource table from which the maximumTable nameexisting key is determined (key source table).


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

This table must be already imported into therepository. Table name is represented asdatastore.owner.table where datastore is thename of the datastore Data Services uses toaccess the key source table and ownerdepends on the database type associatedwiththe table.

Specifies the column in the key source tablecontaining the existing keys values.A column

Generated key column with the same name must exist in the inputdata set; the new keys are inserted in thiscolumn.

Indicates the interval between generated keyvalues.Increment values

Activity: Using target-based CDC

You need to set up a job to update product records in the Omega data warehouse wheneverthey change. The product records do not include timestamps to indicate when they were lastupdated, so youmust use target-based CDC to extract all records from the source and comparethem to the target.

Objective

• Use target-based CDC to preserve history for the Product dimension.

Instructions

1. In the Omega project, create a new batch job called Alpha_Product_Dim_Jobwith a dataflow called Alpha_Product_Dim_DF.

2. Add the Product table from theAlpha datastore as the source object and the Prod_Dim tablefrom the Omega datastore as the target object.

3. Add the Query, Table Comparison, History Preserving, and Key Generation transforms.

4. Connect the source table to the Query transform and the Query transform to the target tableto set up the schema prior to configuring the rest of the transforms.

5. In the transform editor for the Query transform, map the columns as follows:

Schema OutSchema In

PRODUCTIDPRODUCTID


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Schema OutSchema In

PRODUCTNAMEPRODUCTNAME

CATEGORYIDCATEGORYID

COSTCOST

6. Until the key can be generated, specify a mapping expression for the SURR_KEY columnto populate it with NULL.

7. Specify a mapping expression for the EFFECTIVE_DATE column to indicate the currentdate as sysdate( ).

8. Delete the link from the Query transform to the target table.

9. Connect the transforms in the following order:Query, TableComparison,History Preserving,and Key Generation.

10.Connect the Key Generation transform to the target table.

11.In the transform editor for the Table Comparison transform, use the Prod_Dim table in theOmega datastore as the comparison table and set Surr_Key as the generated key column.

12.Set the input primary key column to PRODUCTID, and compare the PRODUCTNAME,CATEGORYID, and COST columns.

13.Do not configure the History Preserving transform.

14.In the transform editor for the Key Generation transform, set up key generation based onthe Surr_Key column of the Prod_Dim table in the Omega datastore, incrementing by 1.

15.In the workspace, before executing the job, display the data in both the source and targettables.Note that theOmegaSoft product has been added in the source, but has not yet been updatedin the target.

16.Execute Alpha_Product_Dim_Jobwith the default execution properties and save all objectsyou have created.

17.Return to the data flow workspace and view data for the target table.Note that the new records were added for product IDs 2, 3, 6, 8, and 13, and that OmegaSofthas been added to the target.

A solution file called SOLUTION_TargetCDC.atl is included in your Course Resources. To checkthe solution, import the file and open it to view the data flow design and mapping logic. Donot execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Capturing changes in data1. What are the two most important reasons for using CDC?

2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

3. What is the difference between an initial load and a delta load?

4. What transforms do you typically use for target-based CDC?


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Update data over time• Use source-based CDC• Use target-based CDC


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Lesson 9Using Data Integrator Transforms

Lesson introductionData Integrator transforms are used to enhance your data integration projects beyond the corefunctionality of the platform transforms.


• Describe the Data Integrator transforms• Use the Pivot transform• Use the Hierarchy Flattening transform• Describe performance optimization• Use the Data Transfer transform• Use the XML Pipeline transform

251Using Data Integrator Transforms—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Describing Data Integrator transforms

IntroductionData Integrator transforms perform key operations on data sets to manipulate their structureas they are passed from source to target.


• Describe Data Integrator transforms available in Data Services

Defining Data Integrator transforms

The following transforms are available in the Data Integrator branch of the Transforms tab inthe Local Object Library:


Allows a data flow to split its processing into two sub-dataflows and push down resource-consuming operations tothe database server.

Data Transfer

Generates a column filled with date values based on thestart and end dates and increment you specify.Date Generation

Generates an additional effective to column based on theprimary key’s effective date.Effective Date

Flattens hierarchical data into relational tables so that itcan participate in a star schema. Hierarchy flattening canbe both vertical and horizontal.

Hierarchy Flattening

Sorts input data, maps output data, and resolves beforeand after versions for UPDATE rows.

Map CDC Operation While commonly used to support Oracle or mainframechanged data capture, this transform supports any datastream if its input requirements are met.

Rotates the values in specified columns to rows.Pivot

Rotates the values in specified rows to columns.Reverse Pivot


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Processes large XML inputs in small batches.XML Pipeline


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Pivot transform

IntroductionThe Pivot and Reverse Pivot transforms let you convert columns to rows and rows back intocolumns.


• Use the Pivot transform

Explaining the Pivot transform

The Pivot transform creates a new row for each value in a column that you identify as a pivotcolumn.

It allows you to change how the relationship between rows is displayed. For each value in eachpivot column, Data Services produces a row in the output data set. You can create pivot setsto specify more than one pivot column.

For example, you could produce a list of discounts by quantity for certain payment terms sothat each type of discount is listed as a separate record, rather than each being displayed in aunique column.

The Reverse Pivot transform reverses the process, converting rows into columns.

The next section gives a brief description of the function, data input requirements, options, anddata output results for the Pivot transform. For more information on the Pivot transform see“Transforms” Chapter 5 in the Data Services Reference Guide.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Inputs/Outputs

Data inputs include a data set with rows flagged as NORMAL.

Data outputs include a data set with rows flagged as NORMAL. This target includes thenon-pivoted columns, a column for the sequence number, the data field column, and the pivotheader column.

Options

The Pivot transform offers several options:

DescriptionOption

Assign a name to the sequence numbercolumn. For each row created from a pivotPivot sequence column column, Data Services increments and storesa sequence number.

Select the columns in the source that are toappear in the target without modification.Non-pivot columns

Identify a number for the pivot set. For eachpivot set, you define a group of pivotPivot set columns, a pivot data field, and a pivotheader name.

Specify the column that contains the pivoteddata. This column contains all of the Pivotcolumns values.

Data column field

Specify the name of the column that containsthe pivoted column names. This column listsHeader column the names of the columns where thecorresponding data originated.

Select the columns to be rotated into rows.Describe these columns in the HeaderPivot columns column. Describe the data in these columnsin the Data field column.

To pivot a table



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


3. On the Transforms tab of the Local Object Library, click and drag the Pivot or Reverse Pivottransform to the workspace to the right of your source object.

4. Add your target object to the workspace.


6. Connect the transform to the target object.

7. Double-click the Pivot transform to open the transform editor.

8. Click and drag any columns thatwill not be changed by the transform from the input schemaarea to theNon-Pivot Columns area.

9. Click and drag any columns that will be pivoted from the input schema area to the PivotColumns area.If required, you can create more than one pivot set by clicking Add.

10.If desired, change the values in the Pivot sequence column,Data field column, andHeadercolumn fields.These are the new columns that will be added to the target object by the transform.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



Activity: Using the Pivot transform

Currently, employee compensation information is loaded into a table with a separate columneach for salary, bonus, and vacation days. For reporting purposes, you need for each of theseitems to be a separate record in the HR datamart.

Objective

• Use the Pivot transform to create a separate row for each entry in a new employeecompensation table.

Instructions

1. In the Omega project, create a new batch job called Alpha_HR_Comp_Jobwith a data flowcalled Alpha_HR_Comp_DF.

2. Add the HR_Comp_Update table from the Alpha datastore to the workspace as the sourceobject.

3. Add the Pivot transform and connect it to the source object.

4. Add the Query transform and connect it to the Pivot transform.

5. Create a new template table called Employee_Comp in the Delta datastore as the target object.

6. Connect the Query transform to the new template table.

7. In the transform editor for the Pivot transform, specify that the EmployeeID anddate_updated fields are non-pivot columns.

8. Specify that the Emp_Salary, Emp_Bonus, and Emp_VacationDays fields are pivot columns.

9. Specify that the data field column is called Comp, and the header column is called Comp_Type.

10.In the transform editor for the Query transform, map all fields from input schema to outputschema.

11.On the WHERE tab, filter out NULL values for the Comp column.


Pivot.Comp is not null

12.Execute Alpha_HR_Comp_Job with the default execution properties and save all objectsyou have created.

13.Return to the data flow workspace and view data for the target table.

A solution file called SOLUTION_Pivot.atl is included in your Course Resources. To check thesolution, import the file and open it to view the data flow design and mapping logic. Do notexecute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Hierarchy Flattening transform

IntroductionThe Hierarchy Flattening transform enables you to break down hierarchical table structuresinto a single table to speed up data access.


• Use the Hierarchy Flattening transform

Explaining the Hierarchy Flattening transform

The Hierarchy Flattening transform constructs a complete hierarchy from parent/childrelationships, and then produces a description of the hierarchy in horizontally- orvertically-flattened format.

For horizontally-flattened hierarchies, each row of the output describes a single node in thehierarchy and the path to that node from the root.

For vertically-flattened hierarchies, each row of the output describes a single relationshipbetween ancestor and descendent and the number of nodes the relationship includes. There isa row in the output for each node and all of the descendants of that node. Each node isconsidered its owndescendent and, therefore, is listed one time as both ancestor anddescendent.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

The next section gives a brief description of the function, data input requirements, options, anddata output results for the Hierarchy Flattening transform. For more information on theHierarchy Flattening transform see “Transforms”Chapter 5 in theData Services Reference Guide.

Inputs/Outputs

Data input includes rows describing individual parent-child relationships. Each row mustcontain two columns that function as the keys of the parent and child in the relationship. Theinput can also include columns containing attributes describing the parent and/or child.

The input data set cannot include rows with operations other than NORMAL, but can containhierarchical data.

For a listing of the target columns, consult the Data Services Reference Guide.

Options

The Hierarchy Flattening transform offers several options:

DescriptionOption

Identifies the column of the source data that containsthe parent identifier in each parent-child relationship.Parent column

Identifies the column in the source data that containsthe child identifier in each parent-child relationship.Child column


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOption

Indicates how the hierarchical relationships aredescribed in the output.Flattening type

Indicates whether longest or shortest paths are used todescribe relationships between descendants andUse maximum length paths ancestors when the descendent has more than oneparent.

Indicates the maximum depth of the hierarchy.Maximum depth

Identifies a column or columns that are associatedwiththe parent column.Parent attribute list

Identifies a column or columns that are associatedwiththe child column.Child attribute list

Creates a separate sub-data flow process for theHierarchy Flattening transform when Data Servicesexecutes the data flow.

Run as a separate process

Activity: Using the Hierarchy Flattening transform

The Employee table in the Alpha datastore contains employee data in a recursive hierarchy.To determine all reports, direct or indirect, to a given executive or manager would requirecomplex SQL statements to traverse the hierarchy.

Objective

• Flatten the hierarchy to allow more efficient reporting on data.

Instructions

1. In the Omega project, create a new batch job called Alpha_Employees_Report_Jobwith adata flow called Alpha_Employees_Report_DF.

2. In the data flowworkspace, add the Employee table from the Alpha datastore as the sourceobject.

3. Create a template table called Manager_Emps in the HR_datamart datastore as the targetobject.

4. Add aHierarchy Flattening transform to the right of the source table and connect the sourcetable to the transform.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

5. In the transform editor for the Hierarchy Flattening transform, select the following options:

ValueOption

VerticalFlattening Type

REPORTSTOParent Column

EMPLOYEEIDChild Column

LASTNAME

Child Attribute List

FIRSTNAME

BIRTHDATE

HIREDATE

ADDRESS

CITYID

REGIONID

COUNTRYID

PHONE

EMAIL

DEPARTMENTID

LastUpdate

discharge_date

6. Add a Query transform to the left of the Hierarchy Flattening transform and connect thetransforms.

7. In the transform editor of the Query transform, create the following output columns:

DatatypeColumn

varchar(10)MANAGERID

varchar(50)MANAGER_NAME

varchar(10)EMPLOYEEID


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DatatypeColumn

varchar(102)EMPLOYEE_NAME

varchar(50)DEPARTMENT

datetimeHIREDATE

datetimeLASTUPDATE

varchar(20)PHONE

varchar(50)EMAIL

varchar(200)ADDRESS

varchar(50)CITY

varchar(50)REGION

varchar(50)COUNTRY

datetimeDISCHARGE_DATE

intDEPTH

intROOT_FLAG

intLEAF_FLAG

8. Map the output columns as follows:

Schema OutSchema In

MANAGERIDANCESTOR

EMPLOYEEIDDESCENDENT


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Schema OutSchema In

DEPTHDEPTH

ROOT_FLAGROOT_FLAG

LEAF_FLAGLEAF_FLAG

ADDRESSC_ADDRESS

DISCHARGE_DATEC_discharge_date

EMAILC_EMAIL

HIREDATEC_HIREDATE

LASTUPDATEC_LastUpdate

PHONEC_PHONE

9. Create a mapping expression for the MANAGER_NAME column to look up the manager'slast name from the Employee table in the Alpha datastore based on the employee ID in theANCESTOR column of the Hierarchy Flattening transform.


lookup_ext([Alpha.source.employee, 'PRE_LOAD_CACHE', 'MAX'], [LASTNAME], [NULL],

[EMPLOYEEID, '=', Hierarchy_Flattening.ANCESTOR]) SET


10.Create a mapping expression for the EMPLOYEE_NAME column to concatenate theemployee's last name and first name, separated by a comma.


Hierarchy_Flattening.C_LASTNAME || ', ' || Hierarchy_Flattening.C_FIRSTNAME

11.Create a mapping expression for the DEPARTMENT column to look up the name of theemployee's department from the Department table in the Alpha datastore based on theC_DEPARTMENTID column of the Hierarchy Flattening transform.



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

lookup_ext([Alpha.source.department, 'PRE_LOAD_CACHE', 'MAX'], [DEPARTMENTNAME],

[NULL], [DEPARTMENTID, '=', Hierarchy_Flattening.C_DEPARTMENTID]) SET


12.Create a mapping expression for the CITY column to look up the name of the employee'scity from the City table in the Alpha datastore based on the C_CITYID column of theHierarchy Flattening transform.


lookup_ext([Alpha.source.city, 'PRE_LOAD_CACHE', 'MAX'], [CITYNAME], [NULL],

[CITYID, '=', Hierarchy_Flattening.C_CITYID]) SET


13.Create amapping expression for the REGIONcolumn to look up the name of the employee'sregion from the Region table in the Alpha datastore based on the C_REGIONID column ofthe Hierarchy Flattening transform.


lookup_ext([Alpha.source.region, 'PRE_LOAD_CACHE', 'MAX'], [REGIONNAME], [NULL],

[REGIONID, '=',Hierarchy_Flattening.C_REGIONID]) SET


14.Create a mapping expression for the COUNTRY column to look up the name of theemployee's country from the Country table in the Alpha datastore based on theC_COUNTRYID column of the Hierarchy Flattening transform.


lookup_ext([Alpha.source.country, 'PRE_LOAD_CACHE', 'MAX'], [COUNTRYNAME],

[NULL], [COUNTRYID, '=', Hierarchy_Flattening.C_COUNTRYID]) SET


15.Add aWHERE clause to theQuery transform to return only rowswhere the depth is greaterthan zero.


Hierarchy_Flattening.DEPTH > 0

16.Execute Alpha_Employees_Report_Job with the default execution properties and save allobjects you have created.

17.Return to the data flow workspace and view data for the target table.Note that 179 rows were written to the target table.

A solution file called SOLUTION_HierarchyFlattening.atl is included in yourCourseResources.To check the solution, import the file and open it to view the data flow design and mappinglogic. Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Describing performance optimization

IntroductionYou can improve the performance of your jobs by pushing down operations to the source ortarget database to reduce the number of rows and operations that the enginemust retrieve andprocess.


• List operations that Data Services pushes down to the database• View SQL generated by a data flow• Explore data caching options• Explain process slicing

Describing push-down operations

Data Services examines the database and its environment when determiningwhich operationsto push down to the database:• Full push-down operations

The Data Services optimizer always tries to do a full push-down operation. Full push-downoperation s are operations that can be pushed down to the databases and the data streamsdirectly from the source database to the target database. For example, Data Services sendsSQL INSERT INTO... SELECT statements to the target database and it sends SELECT toretrieve data from the source.

Data Services can only do full push-down operation s to the source and target databaseswhen the following conditions are met:○ All of the operations between the source table and target table can be pushed down○ The source and target tables are from the same datastore or they are in datastores that

have a database link defined between them.

• Partial push-down operations

When a full push-down operation is not possible , Data Services tries to push down theSELECT statement to the source database. Operations within the SELECT statement thatcan be pushed to the database include:

DescriptionOperation

Aggregate functions, typically used with aGroup by statement, always produce a dataAggregations set smaller than or the same size as theoriginal data set.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

DescriptionOperation

Data Serviceswill only output unique rowswhen you use distinct rows.Distinct rows

Filtering can produce a data set smaller thanor equal to the original data set.Filtering

Joins typically produce a data set smallerthan or similar in size to the original tables.Joins

Ordering does not affect data set size. DataServices can efficiently sort data sets that

Ordering

fit in memory. Since Data Services does notperform paging (writing out intermediateresults to disk), it is recommended that youuse a dedicated disk-sorting program suchas SyncSort or theDBMS itself to order verylarge data sets.

A projection normally produces a smallerdata set because it only returns columnsreferenced by a data flow.

Projections

Most Data Services functions that haveequivalents in the underlaying database areappropriately translated.

Functions

Operations that cannot be pushed down

Data Services cannot push some transform operations to the database. For example:• Expressions that include Data Services functions that do not have database correspondents.• Load operations that contain triggers.• Transforms other than Query.• Joins between sources that are on different database servers that do not have database links

defined between them.

Similarly, not all operations can be combined into single requests. For example, when a storedprocedure contains a COMMIT statement or does not return a value, you cannot combine thestored procedure SQL with the SQL for other operations in a query. You can only pushoperations supported by the RDBMS down to that RDBMS.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Note: You cannot push built-in functions or transforms to the source database. For bestperformance, do not intersperse built-in transforms among operations that can be pushed downto the database. Database-specific functions can only be used in situations where they will bepushed down to the database for execution.

Viewing SQL generated by a data flow

Before running a job, you can view the SQL generated by the data flow and adjust your designto maximize the SQL that is pushed down to improve performance. Alter your design toimprove the data flow when necessary.

Keep inmind that Data Services only shows the SQL generated for table sources. Data Servicesdoes not show the SQL generated for SQL sources that are not table sources, such as the lookupfunction, the Key Generation transform, the key_generation function, the Table Comparisontransform, and target tables.

To view SQL

1. In theData Flows tab of the Local Object Library, right-click the data flow and selectDisplayOptimized SQL from the menu.TheOptimized SQL dialog box displays.

2. In the left pane, select the datastore for the data flow.

The optimized SQL for the datastore displays in the right pane.

Caching data

You can improve the performance of data transformations that occur in memory by cachingas much data as possible. By caching data, you limit the number of times the system mustaccess the database. Cached data must fit into available memory.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Pageable caching

Data Services allows administrators to select a pageable cache location to save content over the2 GB RAM limit. The pageable cache location is set up in Server Manager and the option to usepageable cache is selected on theDataflow Properties dialog box.

Persistent caching

Persistent cache datastores can be created through the Create New Datastore dialog box byselecting Persistent Cache as the database type. The newly-created persistent cache datastorewill appear in the list of datastores, and can be used as a source in jobs.

For more information about advanced caching features, see the Data Services PerformanceOptimization Guide.

Slicing processes

You can also optimize your jobs through process slicing, which involves splitting data flowsinto sub-data flows.

Sub-data flows work on smaller data sets and/or fewer transforms so there is less virtualmemory to consume per process. This way, you can leverage more physical memory per dataflow as each sub-data flow can access 2 GB of memory.

This functionality is available through the Advanced tab for the Query transform. You can runeach memory-intensive operation as a separate process.

For more information on process slicing, see the Data Services Performance Optimization Guide.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Using the Data Transfer transform

IntroductionThe Data Transfer transform allows a data flow to split its processing into two sub-data flowsand push down resource-consuming operations to the database server.


• Use the Data Transfer transform

Explaining the Data Transfer transform

The Data Transfer transform moves data from a source or the output from another transforminto a transfer object and subsequently reads data from the transfer object. You can use theData Transfer transform to push down resource-intensive database operations that occuranywhere within the data flow. The transfer type can be a relational database table, persistentcache table, file, or pipeline.

Use the Data Transfer transform to:• Push down operations to the database serverwhen the transfer type is a database table. You

can push down resource-consuming operations such as joins, GROUP BY, and sorts.• Define points in your data flow where you want to split processing into multiple sub-data

flows that each process part of the data. Data Services does not need to process the entireinput data inmemory at one time. Instead, theData Transfer transform splits the processingamong multiple sub-data flows that each use a portion of memory.

The next section gives a brief description of the function, data input requirements, options, anddata output results for the Data Transfer transform. Formore information on the Data Transfertransform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs

When the input data set for the Data Transfer transform is a table or file transfer type, the rowsmust be flagged with the NORMAL operation code. When input data set is a pipeline transfertype, the rows can be flagged as any operation code.

The input data set must not contain hierarchical (nested) data.

Output data sets have the same schema and same operation code as the input data sets. In thepush down scenario, the output rows are in the sort or GROUP BY order.

The sub-data flow names use the following format, where n is the number of the data flow:

dataflowname_n

The execution of the output depends on the temporary transfer type:

For Table or File temporary transfer types, Data Services automatically splits the data flow intosub-data flows and executes them serially.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

For Pipeline transfer types, Data Services splits the data flow into sub-data flows if you specifythe Run as a separate process option in another operation in the data flow. Data Servicesexecutes these sub-data flows that use pipeline in parallel.

Activity: Using the Data Transfer transform

The Data Transfer transform can be used to push data down to a database table so that it canbe processed by the database server rather than the Data Services Job Server. In this activity,you will join data from two database schemas. When the Data Transfer transform is not used,the join will occur on the Data Services Job Server. When the Data Transfer transform is addedto the data flow the join can be seen in the SQL Query by displaying the optimized SQL for thedata flow.

Objective

• Use the Data Transfer transform to optimize performance.

Instructions

1. In the Omega project, create a new batch job called No_Data_Transfer_Jobwith a data flowcalled No_Data_Transfer_DF.

2. In the Delta datastore, import the Employee_Comp table and add it to theNo_Data_Transfer_DF workspace as a source table.

3. Add the Employee table from the Alpha datastore as a source table.

4. Add a Query transform to the data flow workspace and attach both source tables to thetransform.

5. In the transform editor for the Query transform, add the LastName and BirthDate columnsfrom the Employee table and theComp_Type andComp columns from the Employee_Comptable to the output schema.

6. Add a WHERE clause to join the tables on the EmployeeID columns.

7. Create a template table called Employee_Temp in the Delta datastore as the target objectand connect it to the Query transform.

8. Save the job.

9. In the Local Object Library, use the right-click menu for the No_Data_Transfer_DF dataflow to display the optimized SQL.Note that the WHERE clause does not appear in either SQL statement.

10.In the Local Object Library, replicate the No_Data_Transfer_DF data flow and rename thecopy Data_Transfer_DF.

11.In the Local Object Library, replicate the No_Data_Transfer_Job job and rename the copyData_Transfer_Job.

12.Add the Data_Transfer_Job job to the Omega project.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

13.Delete the No_Data_Transfer_DF data flow from the Data_Transfer_Job and add theData_Transfer_DF data flow to the job by dragging it from the Local Object Library to thejob's workspace.

14.Delete the connection between the Employee_Comp table and the Query transform.

15.Add aData Transfer transformbetween the Employee_Comp table and theQuery transformand connect the three objects.

16.In the transform editor for the Data Transfer transform, select the Table option for TransferType field.

17.In the Table Options section, click the ellipses (...) button and select Table Name. Select theAlpha datastore. In the Table Name field enter PUSHDOWN_DATA. In theOwner field, enterSOURCE.

18.In the transform editor for the Query transform, update the WHERE clause to join theData_Transfer.employeeid and employee.employeeid fields. Verify the Comp_Type andComp columns are mapped to the Data Transfer transform.

19.Save the job.

20.In the Local Object Library, use the right-click menu for the Data_Transfer_DF data flow todisplay the optimized SQL.Note that the WHERE clause appears in the SQL statements.

A solution file called SOLUTION_DataTransfer.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Using the XML Pipeline transform

IntroductionThe XML Pipeline transform is used to process large XML files more efficiently by separatingthem into small batches.


• Use the XML Pipeline transform

Explaining the XML Pipeline transform

The XML Pipeline transform is used to process large XML files, one instance of a specifiedrepeatable structure at a time.

With this transform, Data Services does not need to read the entire XML input into memoryand build an internal data structure before performing the transformation.

This means that an NRDM structure is not required to represent the entire XML data input.Instead, this transform uses a portion of memory to process each instance of a repeatablestructure, then continually releases and re-uses the memory to continuously flow XML datathrough the transform.

During execution, Data Services pushes operations of the streaming transform to the XMLsource. Therefore, you cannot use a breakpoint between your XML source and an XMLPipelinetransform.

Note:

You can use the XML Pipeline transform to load into a relational or nested schema target. Thiscourse focuses on loading XML data into a relational target.

For more information on constructing nested schemas for your target, refer to theData ServicesDesigner Guide.

Inputs/Outputs

You can use an XML file or XML message. You can also connect more than one XML Pipelinetransform to an XML source.

When connected to an XML source, the transform editor shows the input and output schemastructures as a root schema containing repeating and non-repeating sub-schemas representedby these icons:

Schema structureIcon

Root schema and repeating sub-schema


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Schema structureIcon

Non-repeating sub-schema

Keep in mind these rules when using the XML Pipeline transform:• You cannot drag and drop the root level schema.• You can drag and drop the same child object repeated times to the output schema, but only

if you give each instance of that object a unique name. Rename the mapped instance beforeattempting to drag and drop the same object to the output again.

• When you drag and drop a column or sub-schema to the output schema, you cannot thenmap the parent schema for that column or sub-schema. Similarly, when you drag and dropa parent schema, you cannot then map an individual column or sub-schema from underthat parent.

• You cannot map items from two sibling repeating sub-schemas because the XML Pipelinetransform does not support Cartesian products (combining every row from one table withevery row in another table) of two repeatable schemas.

To take advantage of the XML Pipeline transform’s performance, always select a repeatablecolumn to be mapped. For example, if you map a repeatable schema column, the XML sourceproduces one row after parsing one item.

Avoid selecting non-repeatable columns that occur structurally after the repeatable schemabecause the XML source must then assemble the entire structure of items in memory beforeprocessing. Selecting non-repeatable columns that occur structurally after the repeatable schemaincreases memory consumption to process the output into your target.

Tomap both the repeatable schema and a non-repeatable column that occurs after the repeatableone, use two XML Pipeline transforms, and use the Query transform to combine the outputsof the two XML Pipeline transforms and map the columns into one single target.

Options

The XML Pipeline is streamlined to support massive throughput of XML data; therefore, itdoes not contain additional options other than input and output schemas, and the Mappingtab.

Activity: Using the XML Pipeline transform

Purchase order information is stored in XML files that have repeatable purchase orders anditems, and a non-repeated Total Purchase Orders column. You must combine the customername, order date, order items, and the totals into a single relational target table, with one rowper customer per item.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Objectives

• Use the XML Pipeline transform to extract XML data.• Combine the rows required from both XML sources into a single target table joined using

a Query transform

Instructions

1. On the Formats tab of the Local Object Library, create a new file format for an XML schemacalled purchaseOrders_Format, based on the purchaseOrders.xsd file in theActivity_Sourcefolder. Use a root element of PurchaseOrders.

2. In theOmega project, create a new job called Alpha_Purchase_Orders_Job, with a data flowcalled Alpha_Purchase_Orders_DF.

3. In the data flow workspace, add the PurchaseOrders_Format file format as the XML filesource object.

4. In the format editor for the file format, point the file format to the pos.xml file in theActivity_Source folder.Note that when working in a distributed environment, where Designer and the Job Serverare on differentmachines, it may be necessary to edit the path to the XML file if it is differenton the Job Server than the Designer client. Your instructor will tell you if you need to editthe path to the file for this activity.

5. Add two instances of the XML Pipeline transform to the data flow workspace and connectthe source object to each.

6. In the transform editor for the first XML Pipeline transform, map the following columns:

Schema OutSchema In

customerNamecustomerName

orderDateorderDate

7. Map the entire item repeatable schema from the input schema to the output schema.

8. In the transform editor for the second XML Pipeline transform,map the following columns:

Schema OutSchema In

customerNamecustomerName

orderDateorderDate

totalPOstotalPOs


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

9. Add aQuery transform to the data flowworkspace and connect bothXMLPipeline transformto it.

10.In the transform editor for the Query transform, map both columns and the repeatableschema from the first XML Pipeline transform from the input schema to the output schema.Also map the totalPOs columns from the second XML Pipeline transform.

11.Unnest the item repeatable schema.

12.Create a WHERE clause to join the inputs from the two XML Pipeline transforms on thecustomerName column.


XML_Pipeline.customerName = XML_Pipeline_1.customerName

13.Add a new template table called Item_POs to the Delta datastore and connect the Querytransform to it.

14.Execute Alpha_Purchase_Orders_Job with the default execution properties and save allobjects you have created.

15.Return to the data flow workspace and view data for the target table.

A solution file called SOLUTION_XMLPipeline.atl is included in your Course Resources. Tocheck the solution, import the file and open it to view the data flow design andmapping logic.Do not execute the solution job, as this may override the results in your target table.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Using Data Integrator transforms1. What is the Pivot transform used for?

2. What is the purpose of the Hierarchy Flattening transform?

3. What is the difference between the horizontal and vertical flattening hierarchies?

4. List three things you can do to improve job performance.

5. Name three options that can be pushed down to the database.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


• Describe the Data Integrator transforms• Use the Pivot transform• Use the Hierarchy Flattening transform• Describe performance optimization• Use the Data Transfer transform• Use the XML Pipeline transform


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Answer Key

This section contains the answers to the reviews and/or activities for the applicable lessons.

279Answer Key—Learner’s Guide

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly




Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



Quiz: Describing Data ServicesPage 28

1. List two benefits of using Data Services.

Answer:○ Create a single infrastructure for data movement to enable faster and lower cost

implementation.○ Manage data as a corporate asset independent of any single system.○ Integrate data across many systems and re-use that data for many purposes.○ Improve performance.○ Reduce burden on enterprise systems.○ Prepackage data solutions for fast deployment and quick return on investment (ROI).○ Cleanse customer and operational data anywhere across the enterprise.○ Enhance customer and operational data by appending additional information to increase

the value of the data.○ Match and consolidate data at multiple levels within a single pass for individuals,

households, or corporations.

2. Which of these objects is single-use?

Answer:b. Project

3. Place these objects in order by their hierarchy: data flows, jobs, projects, and work flows.

Answer: Projects, jobs, work flows, data flows.

4. Which tool do you use to associate a job server with a repository?

Answer: The Data Services Server Manager.

5. Which tool allows you to create a repository?

Answer: The Data Services Repository Manager.

6. What is the purpose of the Access Server?

Answer:TheAccess Server is a real-time, request-replymessage broker that collects incomingXML message requests, routes them to a real-time service, and delivers a message replywithin a user-specified time frame.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Defining source and target metadataPage 69

1. What is the difference between a datastore and a database?

Answer: A datastore is a connection to a database.

2. What are the two methods in which metadata can be manipulated in Data Services objects?What does each of these do?

Answer:

You can use an object’s options and properties settings to manipulate Data Services objects.

Options control the operation of objects. For example, the name of the database to connectto is a datastore option.

Properties document the object. For example, the name of the datastore and the date onwhich it was created are datastore properties. Properties aremerely descriptive of the objectand do not affect its operation.

3. Which of the following is NOT a datastore type?

Answer:d. File Format

4. What is the difference between a repository and a datastore?

Answer:A repository is a set of tables that hold system objects, source and target metadata,and transformation rules. A datastore is an actual connection to a database that holds data.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Creating batch jobsPage 101

1. Does a job have to be part of a project to be executed in the Designer?

Answer: Yes. Jobs can be created separately in the Local Object Library, but they must beassociated with a project in order to be executed.

2. How do you add a new template table?

Answer:Click and drag theTemplate Table icon from the tool palette or from theDatastorestab of the Local Object Library to the workspace.

3. Name the objects contained within a project.

Answer: Examples of objects are: jobs, work flows, and data flows.

4. What factors might you consider when determining whether to run work flows or dataflows serially or in parallel?

Answer:

Consider the following:○ Whether or not the flows are independent of each other○ Whether or not the server can handle the processing requirements of flows running at

the same time (in parallel)


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Troubleshooting batch jobsPage 131

1. List some reasons why a job might fail to execute.

Answer: Incorrect syntax, Job Server not running, port numbers forDesigner and Job Servernot matching.

2. Explain the View Data feature.

Answer: View Data allows you to look at the data for a source or target file.

3. What must you define in order to audit a data flow?

Answer: Youmust define audit points and audit rules when you want to audit a data flow.

4. True or false? The auditing feature is disabled when you run a job with the debugger.

Answer: True.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Using functions, scripts, and variablesPage 177

1. Describe the differences between a function and a transform.

Answer: Functions operate on single values, such as values in specific columns in a dataset. Transforms operate on data sets, creating, updating, and deleting rows of data.

2. Why are functions used in expressions?

Answer: Functions can be used in expressions tomap return values as new output columns.Adding output columns allows columns that are not in an input data set to be specified inan output data set.

3. What does a lookup function do? How do the different variations of the lookup functiondiffer?

Answer:All lookup functions return one row for each row in the source. They differ in howthey choose which of several matching rows to return.

4. What valuewould the Lookup_ext function return ifmultiplematching recordswere foundon the translate table?

Answer: Depends on Return Policy (Min or Max)

5. Explain the differences between a variable and a parameter.

Answer: A parameter is an expression that passes a piece of information to a work flow,data flow, or custom function when it is called in a job. A variable is a symbolic placeholderfor values.

6. When would you use a global variable instead of a local variable?

Answer:○ When the variable will need to be used multiple times within a job.○ When you want to reduce the development time required for passing values between

job components.○ When you need to create a dependency between job level global variable name and job

components.

7. What is the recommended naming convention for variables in Data Services?

Answer: Variable names must be preceded by a dollar sign ($). Local variables start with$L_, while global variables can be denoted by $G_.

8. Which object would you use to define a value that is constant in one environment, but maychange when a job is migrated to another environment?

Answer:d. Substitution parameter


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Quiz: Using platform transformsPage 207

1. What would you use to change a row type from NORMAL to INSERT?

Answer: The Map Operation transform.

2. What is the Case transform used for?

Answer: The Case transform simplifies branch logic in data flows by consolidating case ordecision-making logic in one transform with multiple paths defined in an expression table.It simplifies branch logic in data flows by consolidating case or decision-making logic intoone transform.

3. Name the transform that you would use to combine incoming data sets to produce a singleoutput data set with the same schema as the input data sets.

Answer: The Merge transform.

4. A validation rule consists of a condition and an action on failure. When can you use theaction on failure options in the validation rule?

Answer:

You can use the action on failure option only if:○ The column value failed the validation rule.○ Send to Pass or Send to both option is selected.

5. When would you use the Merge transform versus the SQL transform to merge records?

Answer: The SQL transformperforms better than theMerge transform, so it should be usedwhenever possible. However, the SQL transform cannot join records from file formats, soyou would need to use the Merge transform for those source objects.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Setting up error handlingPage 223

1. List the different strategies you can use to avoid duplicate rows of data when re-loading ajob.

Answer:○ Using the auto-correct load option in the target table.○ Including the Table Comparison transform in the data flow.○ Designing the data flow to completely replace the target table during each execution.○ Including a preload SQL statement to execute before the table loads.

2. True or false? You can only run a job in recovery mode after the initial run of the job hasbeen set to run with automatic recovery enabled.

Answer: True.

3. What are the two scripts in a manual recovery work flow used for?

Answer: The first script determines if recovery is required, usually by reading the status ina status table. The second script updates the status table to indicate successful job execution.

4. Which of the following types of exception can you NOT catch using a try/catch block?

Answer:b. Syntax errors


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Capturing changes in dataPage 249

1. What are the two most important reasons for using CDC?

Answer: Improving performance and preserving history.

2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

Answer: Source-based CDC.

3. What is the difference between an initial load and a delta load?

Answer:

An initial load is the first population of a database using data acquisition modules forextraction, transformation, and load. The first time you execute a batch job,Designer performsan initial load to create the data tables and populate them.

A delta load incrementally loads data that has been changed or added since the last loaditeration. When you execute your job, the delta load may run several times, loading datafrom the specified number of rows each time until all newdata has beenwritten to the targetdatabase.

4. What transforms do you typically use for target-based CDC?

Answer: Table Comparison, History Preserving, and Key Generation.


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Quiz: Using Data Integrator transformsPage 276

1. What is the Pivot transform used for?

Answer:Use the Pivot transformwhen youwant to group data frommultiple columns intoone column while at the same time maintaining information linked to the columns.

2. What is the purpose of the Hierarchy Flattening transform?

Answer: The Hierarchy Flattening transform enables you to break down hierarchical tablestructures into a single table to speed data access.

3. What is the difference between the horizontal and vertical flattening hierarchies?

Answer:

With horizontally-flattened hierarchies, each row of the output describes a single node inthe hierarchy and the path to that node from the root.

With vertical-flattened hierarchies, each row of the output describes a single relationshipbetween ancestor and descendent and the number of nodes the relationship includes. Thereis a row in the output for each node and all of the descendants of that node. Each node isconsidered its own descendent and, therefore, is listed one time as both ancestor anddescendent.

4. List three things you can do to improve job performance.

Answer:

Choose from the following:○ Utilize the push-down operations.○ View SQL generated by a data flow and adjust your design to maximize the SQL that is

pushed down to improve performance.○ Use data caching.○ Use process slicing.

5. Name three options that can be pushed down to the database.

Answer:

Choose from the following:○ Aggregations (typically performed with a GROUP BY)○ Distinct rows○ Filtering○ Joins○ Ordering○ Projections○ Functions that have equivalents in the underlying database


Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly



Fo

r in

te

rn

al u

se

b

y C

SC

o

nly

Fo

r in

te

rn

al u

se

b

y C

SC

o

nly


Documents

Boi300 en Col91 Nw