The Data Warehouse ETL Toolkit - Chapter 01

The Data warehouse ETL ToolkitBy Kimball

VSV TrainingChapter 01: Surrounding the Requirements

Prepared by: Khoi NGUYEN – Hien NGUYENDate: 09/02/2008

ETL Book Overview PART I: Requirements, Realities, and, Architecture

Chapter 01: Surrounding the Requirements Chapter 02: ETL Data Structures

PART II: Data Flow Chapter 03: Extracting Chapter 04: Cleaning and conforming Chapter 05: Delivering Dimension Tables Chapter 06: Delivering Fact Tables

PART III: Implementation and Operations Chapter 07: Development Chapter 08: Operations Chapter 09: Meta Data Chapter 10: Responsibilities

PART IV: Real Time Streaming ETL Systems Chapter 11: Real-Time ETL Systems Chapter 12: Conclusions

Chapter1: Surrounding the Requirements

Introduction Requirements Architecture The Mission of the Data Warehouse The Mission of the ETL Team

1.0 IntroductionETL system is the foundation of DW

Extract data sourcesEnforce data quality and consistencyConforms data of different sourcesDeliver data to presentation-ready format

ETL add significant values to dataRemove mistakes and correct missing dataProvides documented measures of confidence in dataCaptures the flow of transactional data for safekeepingAdjusts data from multiple sources to be used togetherStructures data to be usable be end-user tools

1.0 Introduction(You will Learn – In this book) Plan and design your ETL system Choose the appropriate architecture from the many possible choices Manage the implementation Manage the day-to-day operations Build the development/test/production suite of ETL processes Understand the tradeoffs of various back-room data structures, including flat files, normalized schemas, XML schemas, and star join (dimensional) schemas Analyze and extract source data Build a comprehensive data-cleaning subsystem Structure data into dimensional schemas for the most effective delivery to end users, business-intelligence tools, data-mining tools, OLAP cubes, and analytic applications Deliver data effectively both to highly centralized and profoundly distributed data warehouses using the same techniques Tune the overall ETL process for optimum performance

1.0 Introduction(Two Simultaneous Threads)

1.0 Introduction(The Planning and Design Thread – 1st step Requirements and realities)

Business needs Data profiling and other data-source realities Compliance requirements Security requirements Data integration Data latency Archiving and lineage End user delivery interfaces Available development skills Available management skills Legacy licenses

1.0 Introduction(The Planning and Design Thread – 2nd Steparchitecture step)

Hand-coded versus ETL vendor tool Batch versus streaming data flow Horizontal versus vertical task dependency Scheduler automation Exception handling Quality handling Recovery and restart Metadata Security

1.0 Introduction(The Planning and Design Thread – 3rd Stepsystem implementation step)

HardwareSoftwareCoding practicesDocumentation practicesSpecific quality checks

1.0 Introduction(The Planning and Design Thread – Final Stepadministration maintenance step)

Development systemsTest systemsProduction systemsHandoff proceduresUpdate propagation approachSystem snapshoting and rollback proceduresPerformance tuning

1.0 Introduction(The Data Flow Thread – EXTRACT step)

Reading source-data modelsConnecting to and accessing dataScheduling the source system, intercepting notifications

anddaemonsCapturing changed dataStaging the extracted data to disk

1.0 Introduction(The Data Flow Thread – Clean step)

Enforcing column propertiesEnforcing structureEnforcing data and value rulesEnforcing complex business rulesBuilding a metadata foundation to describe data qualityStaging the cleaned data to disk

1.0 Introduction(The Data Flow Thread – Conform step)

Conforming business labels (in dimensions) Conforming business metrics and performance indicators (in

fact tables) Deduplicating Householding Internationalizing Staging the conformed data to disk

1.0 Introduction(The Data Flow Thread – Deliver/Load step)

Loading flat and snowflaked dimensions Generating time dimensions Loading degenerate dimensions Loading subdimensions Loading types 1, 2, and 3 slowly changing dimensions Conforming dimensions and conforming facts Handling late-arriving dimensions and late-arriving facts Loading multi-valued dimensions Loading ragged hierarchy dimensions Loading text facts in dimensions Running the surrogate key pipeline for fact tables Loading three fundamental fact table grains Loading and updating aggregations Staging the delivered data to disk

1.0 Introduction(The Data Flow Thread – Operation step)

SchedulingJob executionException handlingRecovery and restartQuality checkingReleaseSupport

1.1 RequirementsBusiness NeedsCompliance RequirementsData ProfilingSecurity RequirementsData IntegrationData LatencyArchiving and LineageEnd User Delivery InterfacesAvailable skillsLegacy Licenses

1.1 RequirementsList of major categories of requirementsAlthough every one of the requirements can

be a show- stopper, business needs have to be more fundamental and important.

1.1.1 Business NeedsBusiness needs are the information

requirements of the end users of the data warehouse.

Business needs directly drive the choice of data sources

The ‘Data Warehouse Lifecycle toolkit’ describe the process for interviewing end users and gather business requirements (Another must read book for System Analysts)

1.1.2 Compliance RequirementsTypical due diligence requirements for the

data warehouse include:Archived copies of data sources and

subsequent staging of data.Proof of the complete transaction low that

changed any data.Fully documented algorithms for allocations

and adjustments.Proof of security of the data copies over time,

both on-line and off-line.

1.1.3 Data ProfilingData profiling is a necessary precursor to

designing any kind of system to use that data.Data profiling is a systematic examination of

the quality, scope, and context of a data source to allow an ETL system to be built.

1.1.4 Security RequirementsThe ETL team should not be directly

concerned with designing or managing end user security.

ETL team needs to work in a special environment.

Security must be extended to physical backups.

1.1.5 Data IntegrationData integration is a huge topic for IT

because ultimately IT aims to make all systems work together seamlessly.Conforming dimensionsConforming facts (KPIs)

Data integration is a separate step identified in our data flow thread as the conform step.

1.1.6 Data LatencyThe data latency requirement describes how

quickly the data must be delivered to end users.Traditional batch-oriented data flowsImprove processing algorithmsParallel processingBetter hardwareStreaming ?

1.1.7 Archiving and LineageStaging the data at each point where a major

transformation has occurred.Each staged/archived data set should have

accompanying metadata describing the origins and processing steps that produced the data.

1.1.8 End User Delivery InterfacesThe ETL team and the data modelers need to

work closely with the end user application developers to determine the exact requirements for the final data handoff.

1.1.9 Available SkillsThe big design decisions when building an

ETL system must be made on the basis of who builds and manages the system.

1.1.10 Legacy LicensesMajor design decisions will be made for you

implicitly by senior management s’ insistence that you use existing legacy licenses.ETL tool (Data Integrator) LicenseServer LicencesOther Legacy system license

1.2 ArchitectureThe choice of architecture is a fundamental

and early decision in the design of the ETL system.

1.2.1 ETL Tool vs. Hand Coding(Buy a Tool Suite or Roll Your Own?)

The answer is, “It depends.”

Hand-Coded ETL Advantages Automated unit testing tools

are available in a hand-coded system

Object-oriented programming techniques help you make all your transformations consistent for error reporting, validation, and metadata updates.

File-based vs. database processing

Existing legacy routines unchanged

Usage of Scripting language More Flexible In-house programmers may be

available.

Tool-Based ETL Advantages To make ETL step simple Simple, faster, cheaper

development Non-professional can use ETL

too Integrate metadata repository Automatically generate

metadata Built-in job scheduler Data lineage management Built-in connectors, security

features Good performance for large

datasets Load-balancing Auto impact analysis Monitoring tools

1.2.1 ETL Tool vs. Hand Coding(Using Proven Technology – Use ETL tool)Some additional benefits of using proven ETL

technology areas follows:Define once, apply many.Impact analysis.Metadata repository.Incremental aggregation.Managed batch loading

Simpler connectivity. Parallel pipe-lined multithreaded operation. Vendor experience.

1.2.1 ETL Tool vs. Hand Coding(Batch vs. Streaming Data Flow)The standard architecture for an ETL system

is based on periodic batch extracts from the source data.

Changing from a batch to a streaming data flow changes everything.

1.2.1 ETL Tool vs. Hand Coding(Horizontal vs. Vertical Task Dependency)A horizontally organized

task flow allows each final database load to run to completion independently.

A vertically oriented task flow synchronizes two or more separate job flows so that, above all, the final database loads occur simultaneously.

1.2.1 ETL Tool vs. Hand Coding(Scheduler Automation)A related architectural decision is how deeply

to control your overall ETL system with automated scheduler technology.

1.2.1 ETL Tool vs. Hand Coding(Exception Handling)Every job needs to be architected to write

these exception-reporting records into the database.

1.2.1 ETL Tool vs. Hand Coding(Quality Handling)A common response to quality issues that

arise while processing the data.

1.2.1 ETL Tool vs. Hand Coding(Recovery and Restart)You need to build your ETL system around

the ability to recover from abnormal ending of a job and restart.

1.2.1 ETL Tool vs. Hand Coding(Metadata)The biggest metadata challenge for the ETL

team is where and how to store process-flow information.

1.2.1 ETL Tool vs. Hand Coding(Security)Security in the ETL environment is less

granular than in the end user environment.

1.2.2 The Back Room - Preparing the Data

The back room and the front room of the data warehouse are physically, logically, and administratively separate.

1.2.2 The Back Room - Preparing the Data (cont’)

1.2.2 The Back Room - Preparing the Data (cont’)Prohibiting data access in the back room

kitchen relieves the ETL team from:Providing detailed security at a row, column, or

applications level.Building query performance-enhancing indexes

and aggregations.Providing continuous up-time under service-

level agreements.Guaranteeing that all datasets are consistent

with each other.

1.2.2 The Back Room - Preparing the Data (Four Staging Steps of a Data Warehouse)

1.2.3 The Front Room – Data AccessAccessing data in the presentation area of

the data warehouse is a client, or follow-on, project that must be closely coordinated with the building and managing of the ETL system.

Data marts are an important component of the front room.

1.2.3 The Front Room – Data Access(Data Mart)A data mart is a set of dimensional tables

supporting a business process.Some strong statements about data mart:

Data marts are based on the source of data, not on a department ’s view of data.

Data marts contain all atomic detail needed to support drilling down to the lowest level.

Data marts can be centrally controlled or decentralized.

1.2.3 The Front Room – Data Access(Extended Concept of the Front Room)

Indexing dimensional tables in the presentation area for query performance.

Choosing front-end tools, including query tools, report writers, and dashboards.

Writing SQL to solve end user queries. Data-mining techniques.Forecasting, behavior scoring, and calculating allocations.Security on the tables and applications accessible by end

users.Metadata supporting end user tools.End user training and documentation.

1.3 The Mission of the Data WarehouseThe mission of the data warehouse is to

publish the organization’s data assets to most effectively support decision making.

1.3.1 What the Data Ware house IsA data warehouse is a system that extracts,

cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.

1.3.2 What the Data Warehouse Is NotTop five things the data warehouse is

mistaken to be are as follows:A product.A language.A project.A data model.A copy of your transaction system.

1.3.3 Industry Terms Not Used ConsistentlyIndustry terms that are given different

meanings by different writers.

1.3.3 Industry Terms Not Used Consistently (Data Mart)Other authors frequently define a data mart

as an aggregated set of data pre-built to answer specific business questions for a given department.

Our data marts (all them dimensional data marts) look the same to all observers and would be implemented identically by anyone with access to the underlying measurement events.

1.3.3 Industry Terms Not Used Consistently (Enterprise Data Warehouse - EDW)EDW is sometimes used as the name the

name for a specific design approach (as contrasted with the uncapitalized enterprise data warehouse, which refers generically to the data warehouse assets of a large or organization)

1.3.4 Resolving Architectural Conflict: The Hybrid Bus Approach

A really dirty data source benefits from the discipline of enforcing the many-to-1 relationships brought to the surface by the process of normalization.

ETL team to convert any such normalized structures into simple dimensional structures for the conforming and final handoff steps

1.3.5 How the Data Warehouse Is ChangingThe data warehouse is undergoing a

significant change, perhaps the most significant change since the beginning of data warehousing.

1.4 The Mission of the ETL TeamThe mission of the ETL team at the highest

level is to build the back room of the data warehouse.

More specifically, the ETL system must:Deliver data most effectively to end user toolsAdd value to data in the cleaning and

conforming steps.Protect and document the lineage of data.

1.4 The Mission of the ETL Team (cont’)Every data warehouse the back room must

support four keys steps:Extracting data from the original sources.Quality assuring and cleaning data.Conforming the labels and measures in the

data to achieve consistency a cross the original sources.

Delivering data in a physical format that can be used by query tools, report writers, and dash boards.

The Data Warehouse ETL Toolkit - Chapter 01

Documents

Finance Practice: Data Warehouse Solutions · 2020. 11. 27. · • Develop ETL scripts and procedures; analysis and design of data ... granular data security, automated ETL workflows,

The Data Warehouse ETL Toolkit - Chapter 03

London 4 - Enabling Agile Data Warehouse - … · •ETL must complete before business workloads start •Planned or unexpected data surges can cause ETL to run late •Worse yet,

Essential Elements and Metrics for a Data Warehouse …conference.qaiglobalservices.com/stc2013/PDFs/Amita_Awasthi... · warehouse appliances, ETL, ... This paper is to elaborate

@redpilla @stewartbryson€¦ · @redpilla Technologies We Use Data Warehouse Analytics Data-engineering & ETL ANALYTICS

P1: FCH/SPH P2: FCH/SPH QC: FCH/SPH T1: FCH · WY046-FM WY046-Kimball-v4.cls August 18, 2004 13:42 The Data Warehouse ETL Toolkit Practical Techniques for Extracting, Cleaning, Conforming,

Historisation Operator · 2020-08-06 · ETL processing using tools such as DataStage, Informatica PowerCenter, and Pentaho PDI. ETL and data warehouse tuning and optimization. Temporal

A proposed model for data warehouse ETL processes proposed model for data... · A proposed model for data warehouse ETL processes ... its cleansing, customization, ... ETL tools are

Oracle Warehouse Builder 10g Release 2 Why generate rather than hand code ETL

ETL on Hadoop –What is Required - Meetup ETL on... · ETL Use Case using Hadoop Hadoop is replacing existing or conventional ETL processes – ETL layer and/or Data Warehouse can

ETL and OLAP SystemsManaging and analyzing metadata. Examples of ETL tools: I MS SQL Server Integration Services(SSIS), IBM Infosphere DataStage, SAS ETL Studio, Oracle Warehouse Builder,

Data Warehouse / ETL Testing Effectiveness

Louisiana DOTD Data Warehouse DOTD Data Warehous… · Data Warehouse Business Applications ETL Report Queries ETL Data Flow. Solution Area Business Case •Reporting around projects

The Data Warehouse ETL Toolkit - Chapter 04

04-1 DWh Data Warehouse - ETL

The UCSC Data Warehouse · The UCSC Data Warehouse Data Warehouses ... Document (Database structure, ... ETL Job (Template-Based,

Oracle Warehouse Builder 11gR2: OWB ETL Using ODI ... · OWB ETL Using ODI Knowledge Modules 2 Oracle Warehouse Builder and Data Integration Requirements Created as an end-to-end

data warehouse , data mart, etl

Oracle Data Warehouse – ETL in der Datenbank / Zusammenspiel mit Nicht- Oracle – ETL - Tools Alfred Schlaucher DATA WAREHOUSE

Extending the Enterprise Data Warehouse with Hadoop Robert ... · •Extraction and transformation of data for loading into the data warehouse – “ETL”. •Off-loading of analysis