Introduction to ETL process

Preview:

Citation preview

Introduction to ETL processOmid Vahdaty

Assuming●ETL = Extract transform load

●SQL knowledge

●DW concepts

Concepts●Dimensions

●Facts

●Aggregate facts

●Data mart

BI vs ETL?●ETL is from DB to DB

○ Tools: Talend ○ Informatica○ SAP BODS ○ Oracle DATA integrator○ Microsoft SSIS

●BI is ○ AD hoc queries○ Dash boarding○ Tools: SAP BO , IBM cognos, Jasper soft , Tablue , Oracle BI.

ETL●Extract data from DB via jobs.

●Transform - ○ change the format of data before loading.

○ Cleaning the data

○ Remove bad data or fix it.

○ Data integrity

●Load - simply load the data.

ETL Tool layers1.Staging - where extracted data is saved

2.Integration - process of data is loaded

3.Access - where the data will be queried,.

ETL tasks● Understand the data to be used for reporting

● Review the Data Model

● Source to target mapping

● Data checks on source data

● Packages and schema validation

● Data verification in the target system

● Verification of data transformation calculations and aggregation rules

● Sample data comparison between the source and the target system

● Data integrity and quality checks in the target system

● Performance testing on data

ETL testing

Validation of data movement from the source to the target system.

Verification of data count in the source and the target system.

Verifying data extraction, transformation as per requirement and expectation.

Verifying if table relations – joins and keys – are preserved during the transformation.

Database testing Verifying if primary and foreign keys are maintained.

Verifying if the columns in a table have valid data values.

Verifying data accuracy in columns. Example − Number of months column shouldn’t have a value greater than 12.

Verifying missing data in columns. Check if there are null columns which actually should have a valid value.

ETL testing categories●Source 2 target

○ count testing

○ data validation testing (duplicates? Data integrity )

○ Data transformation

○ Constraint testing (null, unique, keys, ranges)

● Change /delta testing

●End Report test

ETL Challenges●Data loss during ETL●Incorrect, incomplete or duplicate data.●DW system contains historical data, so the data volume is too large and extremely complex to perform ETL testing in the target system.●Performance●Checking Critical columns●Support Date time format and time zone conversation ●Supported text encoding●Ignoring headers in CSV●Incorrect column number due to separator usage in text field

Extract validation● Count check

● Reconcile records with the source data

● Data type check

● Ensure no spam data loaded

● Remove duplicate data

● Check all the keys are in place

Transform validation● Data threshold validation check, for example, age value shouldn’t be more than 100.

● Record count check, before and after the transformation logic applied.

● Data flow validation from the staging area to the intermediate tables.

● Surrogate key check.

Load verificationRecord count check from the intermediate table to the target system.

Ensure the key field data is not missing or Null.

Check if the aggregate values and calculated measures are loaded in the fact tables.

Check modeling views based on the target tables.

Check if CDC has been applied on the incremental load table.

Data check in dimension table and history table check.

Check the BI reports based on the loaded fact and dimension table and as per the expected results.

Data duplication validation●Example: Select Cust_Id, Cust_NAME, Quantity, COUNT (*)

FROM Customer GROUP BY Cust_Id, Cust_NAME, Quantity HAVING COUNT (*) >1;

●Reasons for duplicate data:

○ If no primary key is defined, then duplicate values may come.

○ Due to incorrect mapping or environmental issues.

○ Manual errors while transferring data from the source to the target system.

Data Integrity testing● number check,

● date check,

● null check,

● precision check

● invalid characters,

● incorrect upper/lower case order,

Detailed use cases for testing: https://www.tutorialspoint.com/etl_testing/etl_testing_scenarios.htm

Best practices●Analyze data

●Fix bad data in the source

●Find a compatible ETL tool

●Monitor ETL job

●Apply Incremental ETL techniques when timestamp available.

Courses & books●http://www.robertomarchetto.com/talend_data_integration_free_boo

k

●Basic Time series ETL by Omid: https://docs.google.com/document/d/1KoFMeFtxXDGiZIswcGS1o8zmp2ZlPGbX0yj6ceQ_zlU/edit?usp=sharing

Exercise: Original table, create, drop, and add new data.

/*drop table t;create table t( i int IDENTITY(1,1) NOT NULL, d datetime );*/

-- assuming unique datevalues in t, not null

insert into t (d) values (getdate());

SELECT * from t where d>=DATEADD(minute, -1, GETDATE());

Exercise: Staging tableinsert into t_staging (i,d) SELECT * from t where d>=DATEADD(minute, -1, GETDATE()) ;

insert into t_presentation (d,i) select distinct(d),i from t_staging order by d desc;

truncate table t_staging;

Exercise: presentationselect count (*) from t_presentation;select count (*) from t;select * from t_presentation order by d desc

Talent

Talend sourcesAdd mssql jdbc :https://www.talendforge.org/forum/viewtopic.php?id=54068

How to connect 2 components:https://www.talendforge.org/forum/viewtopic.php?id=6493

How to Create loop of a job (FYI, right click on project name, create project. Under Job design - far left, upper corner)https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide62EN/tLoop

Running in parallel:https://www.talendbyexample.com/talend-job-parallelization-reference.html

Running Several Queries for ETL such insert into, truncatehttp://www.vikramtakkar.com/2013/05/example-to-execute-multiple-sql-queries.html

Recommended