Upload
jescie-soto
View
58
Download
5
Embed Size (px)
DESCRIPTION
Building the Data Warehouse: Transforming Data. Objectives. After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data - PowerPoint PPT Presentation
Citation preview
6Copyright © Oracle Corporation, 2002. All rights reserved.
Building the Data Warehouse: Transforming Data
6-2 Copyright © Oracle Corporation, 2002. All rights reserved.
Objectives
After completing this lesson, you should be able to do the following:
• Define transformation
• Identify possible staging models
• Identify data anomalies and eliminate them
• Explain the importance of quality data
• Describe techniques for transforming data
• Design transformation process
• List Oracle’s enhanced features and tools that can be used to transform data
6-3 Copyright © Oracle Corporation, 2002. All rights reserved.
Transformation
Transformation eliminates anomalies from operational data:
• Cleans and standardizes
• Presents subject-oriented data
Extract
Warehouse
Load
Operationalsystems
Data Staging Area
Transform:
Clean up
Consolidate
Restructure
6-4 Copyright © Oracle Corporation, 2002. All rights reserved.
Possible Staging Models
• Remote staging model
• Onsite staging model
6-5 Copyright © Oracle Corporation, 2002. All rights reserved.
Remote Staging Model
LoadWarehouse
LoadWarehouse
Data staging area within the warehouse environment
Data staging area in its own environment
Operationalsystem
Extract
Operationalsystem
Extract
Transform
Staging area
Transform
Staging area
6-6 Copyright © Oracle Corporation, 2002. All rights reserved.
On-site Staging Model
Data staging area within the operational environment,possibly affecting the operational system
Extract Load
Warehouse
Operational system
Transform
Staging area
6-7 Copyright © Oracle Corporation, 2002. All rights reserved.
Data Anomalies
• No unique key
• Data naming and coding anomalies
• Data meaning anomalies between groups
• Spelling and text inconsistencies
CUSNUM NAME ADDRESS
90233479 Oracle Limited 100 N.E. 1st St.
90233489 Oracle Computing 15 Main Road, Ft. Lauderdale
90234889 Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA
90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA
6-8 Copyright © Oracle Corporation, 2002. All rights reserved.
Transformation Routines
• Cleaning data
• Eliminating inconsistencies
• Adding elements
• Merging data
• Integrating data
• Transforming data before load
6-9 Copyright © Oracle Corporation, 2002. All rights reserved.
Transforming Data: Problems and Solutions
• Multipart keys
• Multiple local standards
• Multiple files
• Missing values
• Duplicate values
• Element names
• Element meanings
• Input formats
• Referential Integrity constraints
• Name and address
6-10 Copyright © Oracle Corporation, 2002. All rights reserved.
Multipart Keys Problem
Multipart keys
Country code
Sales territory
Productnumber
Salesperson code
Product code = 12 M 654313 45
6-12 Copyright © Oracle Corporation, 2002. All rights reserved.
Multiple Local Standards Problem
• Multiple local standards
• Tools or filters to preprocess
cm
inches
cm USD 600
1,000 GBP
FF 9,990
DD/MM/YY
MM/DD/YY
DD-Mon-YY
6-13 Copyright © Oracle Corporation, 2002. All rights reserved.
Multiple Files Problem
• Added complexity of multiple source files
• Start simple
Transformeddata
Multiple source files
Logic to detectcorrect source
6-14 Copyright © Oracle Corporation, 2002. All rights reserved.
Missing Values Problem
Solution:
• Ignore
• Wait
• Mark rows
• Extract when time-stamped
If NULL thenfield = ‘A’
A
6-15 Copyright © Oracle Corporation, 2002. All rights reserved.
Duplicate Values Problem
Solution:
• SQL self-join techniques
• RDMBS constraint utilitiesACME Inc
ACME Inc
ACME Inc
SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
6-16 Copyright © Oracle Corporation, 2002. All rights reserved.
Element Names Problem
Solution:
Common naming conventions
Customer
Customer
Client
Contact
Name
6-17 Copyright © Oracle Corporation, 2002. All rights reserved.
Element Meaning Problem
• Avoid misinterpretation
• Complex solution
• Document meaning in metadata
Customer’s name
Customer_detail
All customerdetails
All detailsexcept name
6-18 Copyright © Oracle Corporation, 2002. All rights reserved.
Input Format Problem
ASCIIEBCDIC
12373“123-73”
ACME Co.
áøåëéí äáàéí Beer (Pack of 8)
6-19 Copyright © Oracle Corporation, 2002. All rights reserved.
Referential Integrity Problem
Solution:
• SQL anti-join
• Server constraints
• Dedicated tools
Department
10
20
30
40
Emp Name Department
1099 Smith 10
1289 Jones 20
1234 Doe 50
6786 Harris 60
6-20 Copyright © Oracle Corporation, 2002. All rights reserved.
Name and Address Problem
• Single-field format
• Multiple-field format
Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
Database 1
NAME LOCATION
DIANNE ZIEFELD N100
HARRY H. ENFIELD M300
Database 2
NAME LOCATION
ZIEFELD, DIANNE 100
ENFIELD, HARRY H 300
Name Mr. J. Smith
Street 100 Main St.
Town Bigtown
Country County Luth
Code 23565
6-22 Copyright © Oracle Corporation, 2002. All rights reserved.
Name and Address Processing in Oracle9i Warehouse Builder
Name and address mapping operator supports:
• Parsing
• Standardization
• Postal matching and geocoding
6-24 Copyright © Oracle Corporation, 2002. All rights reserved.
Quality Data: Importance and Benefits
• Quality data: – Key to a successful warehouse implementation
• Quality data helps you in:– Targeting right customers– Determining buying patterns– Identifying householders: private and commercial– Matching customers– Identify historical data
6-26 Copyright © Oracle Corporation, 2002. All rights reserved.
Quality: Standards and Improvements
• Setting standards:– Define a quality strategy– Decide on optimal data-quality level
• Improving operational data quality:– Consider modifying rules for operational data– Document the sources– Create a data stewardship program– Design the cleanup process carefully– Initial cleanup and refresh routines
may differ
6-28 Copyright © Oracle Corporation, 2002. All rights reserved.
Data Quality Guidelines
Operational data:
• Should not be used directly in the warehouse
• Must be cleaned for each increment
• Is not simply fixed by modifying applications
6-30 Copyright © Oracle Corporation, 2002. All rights reserved.
Data Quality: Solutions and Management
Solutions:
• COBOL, Java, 4GL
• Specialized tools
• Customized data conversion process– Investigation– Conditioning and Standardization– Integration
Management:
• Take responsibility
• Resolve problems
• Data quality manager
6-31 Copyright © Oracle Corporation, 2002. All rights reserved.
Transformation Techniques
• Merging data
• Adding a Date Stamp
• Adding Keys to Data
6-32 Copyright © Oracle Corporation, 2002. All rights reserved.
Merging Data
• Operational transactions do not usually map one-to-one with warehouse data.
• Data for the warehouse is merged to provide information for analysis.
Pizza sales/returns by day, hour, seconds
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:02 Anchovy Pizza $12.00
Return 1/2/02 12:00:03 Anchovy Pizza - $12.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.00
6-33 Copyright © Oracle Corporation, 2002. All rights reserved.
Merging Data
Pizza sales
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.00
Pizza sales/returns by day, hour, seconds
Sale 1/2/02 12:00:01 Ham Pizza $10.00
Sale 1/2/02 12:00:02 Cheese Pizza $15.00
Sale 1/2/02 12:00:02 Anchovy Pizza $12.00
Return 1/2/02 12:00:03 Anchovy Pizza - $12.00
Sale 1/2/02 12:00:04 Sausage Pizza $11.00
6-34 Copyright © Oracle Corporation, 2002. All rights reserved.
Adding a Date Stamp
• Time element can be represented as a:– Single point in time– Time span
• Add time element to:– Fact tables– Dimension data
6-36 Copyright © Oracle Corporation, 2002. All rights reserved.
Adding a Date Stamp:Fact Tables and Dimensions
Item TableItem_idDept_id
Time_key
Store TableStore_id
District_idTime_key
Sales Fact TableItem_idStore_idTime_key
Sales_dollarsSales_units
Time TableWeek_idPeriod_idYear_id
Time_key
Product TableProduct_idTime_key
Product_desc
6-38 Copyright © Oracle Corporation, 2002. All rights reserved.
Adding Keys to Data
#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00
#3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00
#5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
#4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00
#dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00
#dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00
#dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00
Data values or artificial keys
6-39 Copyright © Oracle Corporation, 2002. All rights reserved.
Summarizing Data
1. During extraction on staging area
2. After loading to the warehouse server
Operationaldatabases
Warehousedatabase
Staging area
6-41 Copyright © Oracle Corporation, 2002. All rights reserved.
Maintaining Transformation Metadata
Transformation metadata contains:
• Transformation rules
• Algorithms and routines
SourcesExtract
StageTransform
RulesLoad
PublishQuery
6-42 Copyright © Oracle Corporation, 2002. All rights reserved.
Maintaining Transformation Metadata
• Restructure keys
• Identify and resolve coding differences
• Validate data from multiple sources
• Handle exception rules
• Identify and resolve format differences
• Fix referential integrity inconsistencies
• Identify summary data
6-43 Copyright © Oracle Corporation, 2002. All rights reserved.
Data Ownership and Responsibilities
• Data ownership and responsibilities should be shared by the:– Operational team – Data warehouse team
• Business benefit gained with “work together” approach
6-45 Copyright © Oracle Corporation, 2002. All rights reserved.
Transformation Timing and Location
• Transformation is performed:– Before load– In parallel
• Can be initiated at different points:– On the operational platform– In a separate staging area
6-46 Copyright © Oracle Corporation, 2002. All rights reserved.
6-47 Copyright © Oracle Corporation, 2002. All rights reserved.
Choosing a Transformation Point
• Workload
• Impact on environment
• CPU usage
• Disk space
• Network bandwidth
• Parallel execution
• Load window time
• User information needs
6-48 Copyright © Oracle Corporation, 2002. All rights reserved.
Monitoring and Tracking
Transformations should:
• Be self-documenting
• Provide summary statistics
• Handle process exceptions
6-49 Copyright © Oracle Corporation, 2002. All rights reserved.
Designing Transformation Processes
• Analysis:– Sources and target mappings, business rules– Key users, metadata, grain
• Design options: – Third-party tools– Custom 3GL programs– 4GLs like SQL or PL/SQL – Replication
• Design issues:– Performance– Size of the staging area– Exception handling, integrity maintenance
6-50 Copyright © Oracle Corporation, 2002. All rights reserved.
Transformation Tools
• Third-party tools
• SQL*Loader
• In-house developed programs
6-51 Copyright © Oracle Corporation, 2002. All rights reserved.
Oracle’s Enhanced Featuresfor Transformation
Transformation methods
Stagingtable 1
Stagingtable 2
Stagingtable 2
Flat Files
Loading intostaging tables
Merge intowarehouse tables
Multi stage Transformation
Transformdata
Validatedata
Datawarehouse
6-52 Copyright © Oracle Corporation, 2002. All rights reserved.
Oracle’s Enhanced Featuresfor Transformation
Transformation methods
Pipelined Transformation
External tables
Flat Files
Externaltable
Table functions
Transformdata
Validatedata
Merge intowarehouse tables
Warehousetables
6-53 Copyright © Oracle Corporation, 2002. All rights reserved.
Existingrowupdated
New rowinserted
Oracle’s Enhanced Featuresfor Transformation
Transformation mechanisms
Using SQL:• CREATE TABLES AS SELECT (CTAS)• UPDATE • MERGE
• Multitable INSERT
50
130
50
60
80
130
Cust Customer
Merge
6-54 Copyright © Oracle Corporation, 2002. All rights reserved.
Oracle’s Enhanced Featuresfor Transformation
Transformation mechanisms
Sourcetable
Multitable INSERT
Condition
Targettable 1
Targettable 2
Targettable 3
6-55 Copyright © Oracle Corporation, 2002. All rights reserved.
Oracle’s Enhanced Featuresfor Transformation
Transformation mechanisms (continued)
• Using PL/SQL:– Used for complex transformations
• Using Table Functions: Table Functions can:– Return multiple rows from a function – Accept results of multiple row SQL subqueries as
input– Take cursors as input – Be parallelized – Support incremental pipelining
6-56 Copyright © Oracle Corporation, 2002. All rights reserved.
6-57 Copyright © Oracle Corporation, 2002. All rights reserved.
Summary
In this lesson, you should have learned how to:
• Define transformation
• Identify possible staging models
• Identify data anomalies and eliminate them
• Explain the importance of quality data
• Describe techniques for transforming data
• Design transformation process
• List Oracle’s enhanced features and tools that can be used to transform data
6-58 Copyright © Oracle Corporation, 2002. All rights reserved.
Practice 6-1 Overview
This practice covers the following topics:
• Answering a series of questions based on the business scenario for Frontier Airways
• Answering a series of short questions