View
33
Download
3
Category
Preview:
DESCRIPTION
The Data Warehouse ETL Toolkit - Chapter 01
Citation preview
The Data warehouse ETL ToolkitBy Kimball
VSV TrainingChapter 01: Surrounding the Requirements
Prepared by: Khoi NGUYEN – Hien NGUYENDate: 09/02/2008
ETL Book Overview PART I: Requirements, Realities, and, Architecture
Chapter 01: Surrounding the Requirements Chapter 02: ETL Data Structures
PART II: Data Flow Chapter 03: Extracting Chapter 04: Cleaning and conforming Chapter 05: Delivering Dimension Tables Chapter 06: Delivering Fact Tables
PART III: Implementation and Operations Chapter 07: Development Chapter 08: Operations Chapter 09: Meta Data Chapter 10: Responsibilities
PART IV: Real Time Streaming ETL Systems Chapter 11: Real-Time ETL Systems Chapter 12: Conclusions
Chapter1: Surrounding the Requirements
Introduction Requirements Architecture The Mission of the Data Warehouse The Mission of the ETL Team
1.0 IntroductionETL system is the foundation of DW
Extract data sourcesEnforce data quality and consistencyConforms data of different sourcesDeliver data to presentation-ready format
ETL add significant values to dataRemove mistakes and correct missing dataProvides documented measures of confidence in dataCaptures the flow of transactional data for safekeepingAdjusts data from multiple sources to be used togetherStructures data to be usable be end-user tools
1.0 Introduction(You will Learn – In this book) Plan and design your ETL system Choose the appropriate architecture from the many possible choices Manage the implementation Manage the day-to-day operations Build the development/test/production suite of ETL processes Understand the tradeoffs of various back-room data structures, including flat files, normalized schemas, XML schemas, and star join (dimensional) schemas Analyze and extract source data Build a comprehensive data-cleaning subsystem Structure data into dimensional schemas for the most effective delivery to end users, business-intelligence tools, data-mining tools, OLAP cubes, and analytic applications Deliver data effectively both to highly centralized and profoundly distributed data warehouses using the same techniques Tune the overall ETL process for optimum performance
1.0 Introduction(Two Simultaneous Threads)
1.0 Introduction(The Planning and Design Thread – 1st step Requirements and realities)
Business needs Data profiling and other data-source realities Compliance requirements Security requirements Data integration Data latency Archiving and lineage End user delivery interfaces Available development skills Available management skills Legacy licenses
1.0 Introduction(The Planning and Design Thread – 2nd Steparchitecture step)
Hand-coded versus ETL vendor tool Batch versus streaming data flow Horizontal versus vertical task dependency Scheduler automation Exception handling Quality handling Recovery and restart Metadata Security
1.0 Introduction(The Planning and Design Thread – 3rd Stepsystem implementation step)
HardwareSoftwareCoding practicesDocumentation practicesSpecific quality checks
1.0 Introduction(The Planning and Design Thread – Final Stepadministration maintenance step)
Development systemsTest systemsProduction systemsHandoff proceduresUpdate propagation approachSystem snapshoting and rollback proceduresPerformance tuning
1.0 Introduction(The Data Flow Thread – EXTRACT step)
Reading source-data modelsConnecting to and accessing dataScheduling the source system, intercepting notifications
anddaemonsCapturing changed dataStaging the extracted data to disk
1.0 Introduction(The Data Flow Thread – Clean step)
Enforcing column propertiesEnforcing structureEnforcing data and value rulesEnforcing complex business rulesBuilding a metadata foundation to describe data qualityStaging the cleaned data to disk
1.0 Introduction(The Data Flow Thread – Conform step)
Conforming business labels (in dimensions) Conforming business metrics and performance indicators (in
fact tables) Deduplicating Householding Internationalizing Staging the conformed data to disk
1.0 Introduction(The Data Flow Thread – Deliver/Load step)
Loading flat and snowflaked dimensions Generating time dimensions Loading degenerate dimensions Loading subdimensions Loading types 1, 2, and 3 slowly changing dimensions Conforming dimensions and conforming facts Handling late-arriving dimensions and late-arriving facts Loading multi-valued dimensions Loading ragged hierarchy dimensions Loading text facts in dimensions Running the surrogate key pipeline for fact tables Loading three fundamental fact table grains Loading and updating aggregations Staging the delivered data to disk
1.0 Introduction(The Data Flow Thread – Operation step)
SchedulingJob executionException handlingRecovery and restartQuality checkingReleaseSupport
1.1 RequirementsBusiness NeedsCompliance RequirementsData ProfilingSecurity RequirementsData IntegrationData LatencyArchiving and LineageEnd User Delivery InterfacesAvailable skillsLegacy Licenses
1.1 RequirementsList of major categories of requirementsAlthough every one of the requirements can
be a show- stopper, business needs have to be more fundamental and important.
1.1.1 Business NeedsBusiness needs are the information
requirements of the end users of the data warehouse.
Business needs directly drive the choice of data sources
The ‘Data Warehouse Lifecycle toolkit’ describe the process for interviewing end users and gather business requirements (Another must read book for System Analysts)
1.1.2 Compliance RequirementsTypical due diligence requirements for the
data warehouse include:Archived copies of data sources and
subsequent staging of data.Proof of the complete transaction low that
changed any data.Fully documented algorithms for allocations
and adjustments.Proof of security of the data copies over time,
both on-line and off-line.
1.1.3 Data ProfilingData profiling is a necessary precursor to
designing any kind of system to use that data.Data profiling is a systematic examination of
the quality, scope, and context of a data source to allow an ETL system to be built.
1.1.4 Security RequirementsThe ETL team should not be directly
concerned with designing or managing end user security.
ETL team needs to work in a special environment.
Security must be extended to physical backups.
1.1.5 Data IntegrationData integration is a huge topic for IT
because ultimately IT aims to make all systems work together seamlessly.Conforming dimensionsConforming facts (KPIs)
Data integration is a separate step identified in our data flow thread as the conform step.
1.1.6 Data LatencyThe data latency requirement describes how
quickly the data must be delivered to end users.Traditional batch-oriented data flowsImprove processing algorithmsParallel processingBetter hardwareStreaming ?
1.1.7 Archiving and LineageStaging the data at each point where a major
transformation has occurred.Each staged/archived data set should have
accompanying metadata describing the origins and processing steps that produced the data.
1.1.8 End User Delivery InterfacesThe ETL team and the data modelers need to
work closely with the end user application developers to determine the exact requirements for the final data handoff.
1.1.9 Available SkillsThe big design decisions when building an
ETL system must be made on the basis of who builds and manages the system.
1.1.10 Legacy LicensesMajor design decisions will be made for you
implicitly by senior management s’ insistence that you use existing legacy licenses.ETL tool (Data Integrator) LicenseServer LicencesOther Legacy system license
1.2 ArchitectureThe choice of architecture is a fundamental
and early decision in the design of the ETL system.
1.2.1 ETL Tool vs. Hand Coding(Buy a Tool Suite or Roll Your Own?)
The answer is, “It depends.”
Hand-Coded ETL Advantages Automated unit testing tools
are available in a hand-coded system
Object-oriented programming techniques help you make all your transformations consistent for error reporting, validation, and metadata updates.
File-based vs. database processing
Existing legacy routines unchanged
Usage of Scripting language More Flexible In-house programmers may be
available.
Tool-Based ETL Advantages To make ETL step simple Simple, faster, cheaper
development Non-professional can use ETL
too Integrate metadata repository Automatically generate
metadata Built-in job scheduler Data lineage management Built-in connectors, security
features Good performance for large
datasets Load-balancing Auto impact analysis Monitoring tools
1.2.1 ETL Tool vs. Hand Coding(Using Proven Technology – Use ETL tool)Some additional benefits of using proven ETL
technology areas follows:Define once, apply many.Impact analysis.Metadata repository.Incremental aggregation.Managed batch loading
Simpler connectivity. Parallel pipe-lined multithreaded operation. Vendor experience.
1.2.1 ETL Tool vs. Hand Coding(Batch vs. Streaming Data Flow)The standard architecture for an ETL system
is based on periodic batch extracts from the source data.
Changing from a batch to a streaming data flow changes everything.
1.2.1 ETL Tool vs. Hand Coding(Horizontal vs. Vertical Task Dependency)A horizontally organized
task flow allows each final database load to run to completion independently.
A vertically oriented task flow synchronizes two or more separate job flows so that, above all, the final database loads occur simultaneously.
1.2.1 ETL Tool vs. Hand Coding(Scheduler Automation)A related architectural decision is how deeply
to control your overall ETL system with automated scheduler technology.
1.2.1 ETL Tool vs. Hand Coding(Exception Handling)Every job needs to be architected to write
these exception-reporting records into the database.
1.2.1 ETL Tool vs. Hand Coding(Quality Handling)A common response to quality issues that
arise while processing the data.
1.2.1 ETL Tool vs. Hand Coding(Recovery and Restart)You need to build your ETL system around
the ability to recover from abnormal ending of a job and restart.
1.2.1 ETL Tool vs. Hand Coding(Metadata)The biggest metadata challenge for the ETL
team is where and how to store process-flow information.
1.2.1 ETL Tool vs. Hand Coding(Security)Security in the ETL environment is less
granular than in the end user environment.
1.2.2 The Back Room - Preparing the Data
The back room and the front room of the data warehouse are physically, logically, and administratively separate.
1.2.2 The Back Room - Preparing the Data (cont’)
1.2.2 The Back Room - Preparing the Data (cont’)Prohibiting data access in the back room
kitchen relieves the ETL team from:Providing detailed security at a row, column, or
applications level.Building query performance-enhancing indexes
and aggregations.Providing continuous up-time under service-
level agreements.Guaranteeing that all datasets are consistent
with each other.
1.2.2 The Back Room - Preparing the Data (Four Staging Steps of a Data Warehouse)
1.2.3 The Front Room – Data AccessAccessing data in the presentation area of
the data warehouse is a client, or follow-on, project that must be closely coordinated with the building and managing of the ETL system.
Data marts are an important component of the front room.
1.2.3 The Front Room – Data Access(Data Mart)A data mart is a set of dimensional tables
supporting a business process.Some strong statements about data mart:
Data marts are based on the source of data, not on a department ’s view of data.
Data marts contain all atomic detail needed to support drilling down to the lowest level.
Data marts can be centrally controlled or decentralized.
1.2.3 The Front Room – Data Access(Extended Concept of the Front Room)
Indexing dimensional tables in the presentation area for query performance.
Choosing front-end tools, including query tools, report writers, and dashboards.
Writing SQL to solve end user queries. Data-mining techniques.Forecasting, behavior scoring, and calculating allocations.Security on the tables and applications accessible by end
users.Metadata supporting end user tools.End user training and documentation.
1.3 The Mission of the Data WarehouseThe mission of the data warehouse is to
publish the organization’s data assets to most effectively support decision making.
1.3.1 What the Data Ware house IsA data warehouse is a system that extracts,
cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.
1.3.2 What the Data Warehouse Is NotTop five things the data warehouse is
mistaken to be are as follows:A product.A language.A project.A data model.A copy of your transaction system.
1.3.3 Industry Terms Not Used ConsistentlyIndustry terms that are given different
meanings by different writers.
1.3.3 Industry Terms Not Used Consistently (Data Mart)Other authors frequently define a data mart
as an aggregated set of data pre-built to answer specific business questions for a given department.
Our data marts (all them dimensional data marts) look the same to all observers and would be implemented identically by anyone with access to the underlying measurement events.
1.3.3 Industry Terms Not Used Consistently (Enterprise Data Warehouse - EDW)EDW is sometimes used as the name the
name for a specific design approach (as contrasted with the uncapitalized enterprise data warehouse, which refers generically to the data warehouse assets of a large or organization)
1.3.4 Resolving Architectural Conflict: The Hybrid Bus Approach
A really dirty data source benefits from the discipline of enforcing the many-to-1 relationships brought to the surface by the process of normalization.
ETL team to convert any such normalized structures into simple dimensional structures for the conforming and final handoff steps
1.3.5 How the Data Warehouse Is ChangingThe data warehouse is undergoing a
significant change, perhaps the most significant change since the beginning of data warehousing.
1.4 The Mission of the ETL TeamThe mission of the ETL team at the highest
level is to build the back room of the data warehouse.
More specifically, the ETL system must:Deliver data most effectively to end user toolsAdd value to data in the cleaning and
conforming steps.Protect and document the lineage of data.
1.4 The Mission of the ETL Team (cont’)Every data warehouse the back room must
support four keys steps:Extracting data from the original sources.Quality assuring and cleaning data.Conforming the labels and measures in the
data to achieve consistency a cross the original sources.
Delivering data in a physical format that can be used by query tools, report writers, and dash boards.
Recommended