Upload
kodanda
View
234
Download
0
Embed Size (px)
Citation preview
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 1/30
ETL Software
Joanna Frazier Abhishek Sengupta
Chris Kadlec Erik ShepardSusan Kost Brian Strok
Ivan Vasquez
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 2/30
What is ETL?
Short for e xtract, t ransform, and l oad.Three database functions that are combinedinto one tool to pull data out of one database
and place it into another database.ETL is used to migrate data from one
database to another, to form data marts anddata warehouses and also to convertdatabases from one format or type toanother.
http://www.pcwebopedia.com/TERM/E/ETL.html
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 3/30
Side Note:
It should be noted that ETL is not 3 well-
defined steps.
We are breaking them up and presenting a
theoretical view for ease of understanding
before bringing them together and showing
you how this method actually works in the
real business world.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 5/30
Extraction
Data Needs to be taken from some data source
so that it can be put into the Data
Warehouse. To do this:
1. Some code at the data source exports the
data to be used.
2. Some external program takes the data
from the source.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 6/30
Extraction (cont)
If the data is exported, it is typically
exported into a text file that can then be
brought into an intermediary database.
If the data is extracted from the source, it is
typically transferred directly into an
intermediary database.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 7/30
Data Transformation
Locates, extracts, conditions, scrubs and
loads data onto the data warehouse platform
Physical database design must be available
before loading can be performed
“Designs the process and develops the utilities and
programming that allow the data warehouse to be
initially loaded and maintained”
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 8/30
Data Transformation
3 major steps
- Data Cleansing
- Data Integration- Other Transformations (includes
replacement of codes, derived values,
calculating aggregates)
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 9/30
Data Cleansing
Dirty Data
Dummy Values
Absence of Data
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Reused Primary Keys
Non-unique Identifiers
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 10/30
Data Integration
2 Major Problems
- Data that should be related but cannot be
(May arise due to non-unique primary keys or more often, the absence of primary keys)
- Data that is inadvertently related but should
not be
(Occurs when fields or records are reused for
multiple purposes)
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 11/30
Loading
The populating of tables that presentation
applications will use to make data available
to users
Most critical operations in any warehouse,
yet often neglected
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 12/30
Loading (cont)
The LOADING
process can be broken
down into 2 different
types: – Initial Load
– Continuous Load
(loading over time)
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 13/30
Initial Load
Consists of populating tables in warehouse
schema and verifying data readiness
Examples:
– DTS – data transformation services
– Bcp utility – batch copy
– SQL*Loader
– Native Database Languages (T-SQL, PL/SQL,
etc.)
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 14/30
Continuous Loads
Must be scheduled and processed in a
specific order to maintain integrity,
completeness, and a satisfactory level of
trust
Should be the most carefully planned step in
data warehousing or can lead to:
– Error duplication
– Exaggeration of inconsistencies in data
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 15/30
Continuous Loads (cont)
Must be during a fixed batch window
(usually overnight)
Must maximize system resources to load
data efficiently in allotted time
– Ex. Red Brick Loader can validate, load, and
index up to 12GB of data per hour on an SMP
system
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 16/30
Additional Aspects of Loader
Should be able to:
– Aggregations – build on past data (SUM,MODIFY, APPEND, UPDATE, etc)
– Filtering – additional cleaning and filtering based on user instructions
– Integrity – ensure data to be loaded meets
integrity constraints previously established – Index Building – creates indexes associatedwith the data being loaded
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 17/30
Questions to Ask
Data Source Connectivity: Oracle, Sybase,Informix, mainframe(CICS), Flatfiles.
Functionality: pre-built Transformations Metadata: Open Architecture, Reporting
Capability, Extensibility
Performance: Engine Driven, Code Generator,
Bulk Loading, "Data never touchesthe ground", Multi-threaded processes.
Administration: Versioning, Debugging, Auditing
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 18/30
More Questions to Ask
Backup and Disaster Recovery: Restart logic,Error detection
Modeling Tool Connectivity: Erwin,
Powerdesigner Ease of Use: GUI Interface, Intuitive design,
integrated toolset
Programming Language Supported: VB, C, C++,
COBOL Support: 24x7, Devoted Staff levels
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 19/30
ETL Vendors
Ascential
SAS
NCR Teradata IBM
Oracle
ValityFirstlogic
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 20/30
ETL Tool Set
Purchase or Grow Your Own
100’s of Vendors-www.dwinfocenter.org/clean.html
Pricing Varies Widely
Trend – Included as part of other initiatives
– CRMs• NCR’s Teradata
– Data Warehouses• Oracle, Red Brick, DB2, Prism, Sybase, Teradata,
Informix, Microsoft SQL Server
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 21/30
Pricing Trends
Costs
– FireSpout, ETL Engine
• Start at $150K
– MetaRecon Enterprise
• Server Package $250K, Client Package $50K
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 22/30
Pricing Trends
IBM, DB2, 7 – ASPs and other partners – pay with a percentage of
revenue received from customers once solution isrunning per subscriber or per transaction basis.
– Still offer per-user base pricing model. Majority of database purchases are sold with an accompanyingapplication and will still be done this way.
Formation 1.4
– Informix databases, Red Brick Warehouse, Oracle8Server, Microsoft SQL Server databases.
– $7500 per processor for the Formation Flow Engine
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 23/30
No use of ETL Tools
Start Immediately
Any logic set can be programmed
Disadvantages –
– Many programs to build
– Transformation logic is complex
– Lengthy program build process
– No automatic metadata generation – Maintenance – constant changes
– Infrastructure is very expensive
www.nyoug.org/dwetl_ny.pdf
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 24/30
Use ETL Tools
Enables rapid application development
(RAD)
Allows easy maintenance
Generates metadata automatically
Reduces development costs
Disadvantages – – Learning curve
– Some limits to logic capabilities
www.nyoug.org/dwetl_ny.pdf
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 25/30
ETL for Spatial Data Warehousing
What is spatial data warehousing
Spatial Data Warehousing is theaggregation of discrete spatial databases
together in a single repository, along withassociated value-added tabular datasets.
Often come from disparate data sources,
e.g. roads from the Department of Transportation, rivers and lakes from theDepartment of Natural Resources, etc.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 26/30
Spatial Data Types
Functionality, there are two principle spatial
data types
– Vector – Geometric data such as points, lines,
and polygons. Examples would be roads,contour lines, schools, etc.
– Raster – Continuous or image data. Examples
would be aerial photography.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 27/30
Demonstration
Georgia 2000 Information System
The Georgia 2000 Information System aggregates
spatial and tabular data from a wide variety of
sources. Foundation of the Georgia 2000 is the map data.
For example, political boundaries, roads, water
features, facilities, locations, etc. Tabular data is
value-added to the map data with information such
as spending patterns per county, etc.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 28/30
Problems unique to spatial data
warehousingCoordinate System (Projection)
Geometric Errors – Misalignment of geometric features
Geometric Errors – Distortions of photography due to camera angle, heightdisplacement, etc.
Topological Errors – Little pieces of unidentified areas called silvers. Canaccount in total for large areas.
7/28/2019 ETL Review
http://slidepdf.com/reader/full/etl-review 29/30
ETL for spatial data warehousing involvessystematic corrections of geometric, topological or coordinate system problems.
Another type of spatial data can be produced from
a process called “geocoding” in which points arelocated along a network (for example, a streetnetwork)
The quality of the underlying tabular data used as
input affects quality of geocoding. Correcting thistabular data for good results from geocodingrequires same types of ETL as does traditionaldata warehousing.