14
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Embed Size (px)

Citation preview

Page 1: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

BYLECTURER/ AISHA DAWOOD

DW Lab # 3Overview of Extraction,

Transformation, and Loading

Page 2: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Overview of ETL in Data WarehousesYou need to load your data warehouse regularly so that it can serve

its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the data warehouse.

The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL refers to a broad process, and not three well-defined steps. The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise.

LAB EXERCISE #3 Oracle Data Warehousing

Page 3: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

ETL Basics in Data Warehousing What happens during the ETL process? During extraction, the desired data is identified and extracted from many different

sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time.

Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files, for example, can easily grow to hundreds of megabytes in a very short period of time.

After data is extracted, it has to be physically transported to the target system or to an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too.

LAB EXERCISE #3 Oracle Data Warehousing

Page 4: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

The need of ETL in data warehouseAs the data warehouse is a living IT system, sources and targets

might change. Those changes must be maintained and tracked through the lifespan of the system without overwriting or deleting the old ETL process flow information.

To build and keep a level of trust about the information in the warehouse, the process flow of each individual record in the warehouse can be reconstructed at any point in time in the future in an ideal case.

LAB EXERCISE #3 Oracle Data Warehousing

Page 5: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Extraction Methods in Data Warehouses Logical Extraction Methods There are two types of logical extraction: ■ Full Extraction ■ Incremental Extraction

Full Extraction The data is extracted completely from the source system. Because this extraction reflects all the data

currently available on the source system, there's no need to keep track of changes to the data source since the last successful extraction. An example for a full extraction may be an export file of a distinct table or a remote SQL statement scanning the complete source table.

Incremental Extraction At a specific point in time, only the data that has changed since a well-defined event back in history

will be extracted. This event may be the last time of extraction or a more complex business event like the last booking day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself such as an application column, reflecting the last-changed timestamp.

LAB EXERCISE #3 Oracle Data Warehousing

Page 6: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Physical Extraction Methods Depending on the chosen logical extraction method and the capabilities and restrictions on the source

side, the extracted data can be physically extracted by two mechanisms. There are the following methods of physical extraction:

■ Online Extraction

■ Offline Extraction

Online Extraction The data is extracted directly from the source system itself. The extraction process can connect directly

to the source system to access the source tables themselves or to an intermediate system that stores the data in a preconfigured manner (for example, snapshot logs or change tables).

Offline Extraction The data is not extracted directly from the source system but is staged explicitly outside the original

source system. The data already has an existing structure (for example, redo logs, archive logs or transportable tablespaces) or was created by an extraction routine.

LAB EXERCISE #3 Oracle Data Warehousing

Page 7: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Offline Extraction

You should consider the following structures:

■ Flat files

Data in a defined, generic format. Additional information about the source object

is necessary for further processing.

■ Dump files

Oracle-specific format. Information about the containing objects may or may not

be included, depending on the chosen utility.

■ Redo and archive logs

Information is in a special, additional dump file.

■ Transportable tablespaces

A powerful way to extract and move large volumes of data between Oracle

databases.

LAB EXERCISE #3 Oracle Data Warehousing

Page 8: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Change Data Capture

There are several techniques for implementing a self-developed change capture on Oracle Database source systems:

■ Timestamps

■ Partitioning

■ Triggers

These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners of the source system prior to implementation.

LAB EXERCISE #3 Oracle Data Warehousing

Page 9: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Timestamps The tables in some operational systems have timestamp columns. The timestamp specifies the time and

date that a given row was last modified. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identified using the timestamp columns.

For example, the following query might be useful for extracting to’27-JUN-08’s data from an orders table in OE schema:

LAB EXERCISE #3 Oracle Data Warehousing

Such modificationwould require, first, modifying the operational system's tables to include a new timestamp column and then creating a trigger to update the timestamp column following every operation that modifies a given row.

Page 10: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Partitioning Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week's data.

LAB EXERCISE #3 Oracle Data Warehousing

Page 11: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Data Warehousing Extraction Examples You can extract data in two ways: ■ Extraction Using Data Files ■ Extraction Through Distributed Operations

Most database systems provide mechanisms for exporting or unloading data from the

internal database format into flat files. When the source system is an Oracle database, several alternatives are available for extracting data into files: ■ Extracting into Flat Files Using SQL*Plus ■ Extracting into Flat Files Using OCI or Pro*C Programs ■ Exporting into Export Files Using the Export Utility ■ Extracting into Export Files Using External Tables

LAB EXERCISE #3 Oracle Data Warehousing

Page 12: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

Extracting into Flat Files Using SQL*Plus: The most basic technique for extracting data is to execute a SQL query in SQL*Plus and direct the

output of the query to a file. For example, to extract a flat file, country_city.log, with the pipe sign as delimiter between column values, containing a list of the cities in the US in the tables countries and customers, the following SQL script could be run:

SPOOL country_city.log

SELECT distinct t1.country_name ||'|'|| t2.cust_city

FROM countries t1, customers t2 WHERE t1.country_id = t2.country_id

AND t1.country_name= 'United States of America';

SPOOL off

LAB EXERCISE #3 Oracle Data Warehousing

Page 13: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

The output location identified by the command “host cd” … The output file:

LAB EXERCISE #3 Oracle Data Warehousing

Page 14: BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading

This extraction technique can be parallelized by initiating multiple, concurrent SQL*Plus sessions, each session running a separate query representing a different portion of the data to be extracted. For example, suppose that you wish to extract data from an orders table, and that the orders table has been range partitioned by month. (OE schema)

SPOOL order_SEP.dat

SELECT *

FROM orders

WHERE order_date BETWEEN

TO_DATE('01-SEP-07') AND

TO_DATE('29-SEP-07');

SPOOL OFF

Note: The physical method is based on a range of values. By viewing the data dictionary

LAB EXERCISE #3 Oracle Data Warehousing