42
MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/07/22 MaxQDPro: Kettle- ETL Tool 1 A Pentaho Data Integration tool

Kettle – Etl Tool

Embed Size (px)

DESCRIPTION

Pentaho Kettle ETL tools demostration and jest of the ETL process

Citation preview

Page 1: Kettle – Etl Tool

MaxQDPro TeamAnjan.K Harish.R

II Sem M.Tech CSE

04/10/23 MaxQDPro: Kettle- ETL Tool 1

A Pentaho Data Integration tool

Page 2: Kettle – Etl Tool

Introduction◦ ETL Process◦ Pentaho’s Kettle

Data Integration Challenges Prerequisites and Recent Releases Pentaho DI Components JDBC Spoon

◦ Transformations◦ Jobs

04/10/23 2MaxQDPro: Kettle- ETL Tool

Page 3: Kettle – Etl Tool

4 major components:◦ Extracting

Gathering raw data from source systems and storing it in ETL staging environment

Data profiling Identifying data that changed since last load

◦ Transforming- Cleaning and Conforming Processing data to improve its quality, format it, merge from

multiple sources, enforce conformed dimensions Data cleansing Recording error events Audit dimensions Creating and maintaining conformed dimensions and facts

04/10/23MaxQDPro: Kettle- ETL Tool 3

Page 4: Kettle – Etl Tool

Data filtering◦ Is not null, greater than, less than, includes

Field manipulation◦ Trimming, padding, upper and lowercase conversion

Data calculations◦ + - X / , average, absolute value, arctangent, natural logarithm

Date manipulation◦ First day of month, Last day of month, add months, week of year,

day of year

Data type conversion◦ String to number, number to string, date to number

Merging fields & splitting fields

Looking up date◦ Look up in a database, in a text file, an excel sheet, …

04/10/23 4MaxQDPro: Kettle- ETL Tool

Page 5: Kettle – Etl Tool

◦ Loading Loading data into data warehouse tables Managing hierarchies in dimensions Managing special dimensions such as date and

time, junk, mini, shrunken, small static, and user-maintained dimensions

Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for

specific purposes

04/10/23MaxQDPro: Kettle- ETL Tool 5

Page 6: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 6

Page 7: Kettle – Etl Tool

Complexity and significant operational problems. 

Exceeds the designers expectations Data Profiling of a source. Data warehouses typically grow

asynchronously. Establishing the scalability of an ETL system

across the lifetime .

04/10/23MaxQDPro: Kettle- ETL Tool 7

Page 8: Kettle – Etl Tool

Many off-the-shelf tools exist High-end tools may not justify value for

smaller warehouses Proprietary ETL

◦ High upfront cost◦ Long term maintenance

Custom Code◦ Low upfront cost◦ Support grows as business requirements changes

04/10/23 8MaxQDPro: Kettle- ETL Tool

Page 9: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 9

Tool Vendor

Oracle Warehouse Builder (OWB) Oracle 

Data Integrator (BODI) Business Objects

IBM Information Server (Ascential) IBM

SAS Data Integration Studio SAS Institute

PowerCenter Informatica 

Oracle Data Integrator (Sunopsis) Oracle

Data Migrator Information Builders

Integration Services Microsoft

Talend Open Studio Talend

DataFlow Group 1 Software (Sagent)

Data Integrator Pervasive

Transformation Server DataMirror

Transformation Manager ETL Solutions Ltd.

Data Manager Cognos

DT/Studio Embarcadero Technologies

ETL4ALL IKAN

DB2 Warehouse Edition IBM

Jitterbit Jitterbit

Pentaho Data Integration Pentaho 

Page 10: Kettle – Etl Tool

Kettle – Kettle Extraction Transformation Transportation & Loading tool

Its open source business intelligence suite for powerful data integration by Pentaho. Founded in 2004.

Products of Pentaho◦ Mondrain – OLAP server written in Java◦ Kettle – ETL tool◦ Weka – Machine learning and Data mining tool

04/10/23 10MaxQDPro: Kettle- ETL Tool

Page 11: Kettle – Etl Tool

Data is everywhere Data is inconsistent

◦ Records are different in each system Performance issues

◦ Running queries to summarize data for stipulated long period takes operating system for task

◦ Brings the OS on max load Data is never all in Data Warehouse

◦ Excel sheet, acquisition, new application

04/10/23 11MaxQDPro: Kettle- ETL Tool

Page 12: Kettle – Etl Tool

Meta data , model driven approach◦ What to do? And how to do?◦ Complex transformation with zero code◦ Graphically design data transformation and jobs

100% Java with cross-platform support Extensible architecture Repository-based Full featured ETL Integration with Pentaho Open BI Platform

04/10/23 12MaxQDPro: Kettle- ETL Tool

Page 13: Kettle – Etl Tool

Prerequisites Recent Releases

Java Runtime Environment 1.5 and above

Compatible with almost any platform

Compatible with wide range of Databases technologies.

4/25 Data Integration 3.0.3 GA

4/18 Data Integration 3.1 Milestone

2/8 Data Integration 3.0.2 GA

12/12 Data Integration 3.0.1 GA

11/15 Data Integration 3.0 GA

10/31 Data Integration 3.0 RC2

10/24 Data Integration 2.5.2 GA

10/08 Data Integration 3.0 RC1

08/24 Data Integration 2.5.1 GA

04/10/23MaxQDPro: Kettle- ETL Tool 13

Page 14: Kettle – Etl Tool

Pan◦ A program to execute transformations designed by Spoon

in XML or database repository. ◦ Transformations are scheduled in batch mode to be run

automatically at regular intervals Carte

◦ Simple web server to execute transformations and jobs remotely.

◦ Accept an XML (small servlet) that contains transformation to execute and the execution configuration. 

◦ Allows to remotely monitor, start and stop the transformations and jobs

◦ Server running in Carte is a Slave Server

04/10/23MaxQDPro: Kettle- ETL Tool 14

Page 15: Kettle – Etl Tool

Spoon◦  GUI that allows you to design transformations and

jobs that can be run with the Kettle tools — Pan and Kitchen

◦ Transformations and Jobs can describe themselves using an XML file or can be put in a Kettle database repository.

◦ Spoon is available as executable script and batch file to make use of tool in heterogeneous environment.

◦ Latest version of Spoon is 3.2 beta version. Kitchen

◦ Execute jobs designed by Spoon in XML or database repository

04/10/23MaxQDPro: Kettle- ETL Tool 15

Page 16: Kettle – Etl Tool

Create Shortcut with spoon.ico pointing to bat file Works on most of OS

Installing◦ Ensure JRE 1.5 is

installed.◦ Unzip the binary

distribution in any folder Launching

◦ spoon.bat in windows platform

◦ spoon.sh in Unix like platform

Supported platform◦ Microsoft Windows

including Vista◦ Linux GTK: on i386 and

x86_64 processors ◦ Apple's OSX: works both

on PowerPC and Intel machines

◦ Solaris: using a Motif interface 

◦ AIX, HP-UX, FreeBSD

04/10/23MaxQDPro: Kettle- ETL Tool 16

Page 17: Kettle – Etl Tool

Latest JDBC 3.0

JDBC -Database connectivity Java tool.

Comes in four different types◦ Type1: JDBC-ODBC Bridge◦ Type 2 : Native API partial

Java driver◦ Type 3 : Middleware Java

Drivers◦ Type 4: Direct to DB Java

Drivers

Microsoft Based DB like MS Access rely on Type 1drivers

Oracle, Mysql can be connected with other types. But traditionally used is the Type 4 driver.

JDBC can also operate in Distributed environment.

04/10/23MaxQDPro: Kettle- ETL Tool 17

Page 18: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 18

Page 19: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 19

Page 20: Kettle – Etl Tool

Key Improvement ◦ Execution Results Pane for logs, metrics and

performance graph◦ Improved Database Connection dialog◦ Snap to grid (graphical workspace)◦ Zoom (Graphical Workspace)◦ Easier to use left panel for the objects palette◦ Over 30 new or improved Transformation Steps◦ 13 new or improved Job Entries◦ Support for four new database types - MonetDB,

KingbaseES, Vertica, and HP NeoView◦ Improved translations

04/10/23MaxQDPro: Kettle- ETL Tool 20

Page 21: Kettle – Etl Tool

Repository Connection establishment Auto login

◦ By setting manually KETTLE_REPOSITORY, KETTLE_USER and KETTLE_PASSWORD environmental variables.

Login◦ By default PDI provides login username and

password ad admin.◦ It strictly advised to change default password to

avoid any security vulnerablity.

04/10/23MaxQDPro: Kettle- ETL Tool 21

Page 22: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 22

Page 23: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 23

Page 24: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 24

Page 25: Kettle – Etl Tool

Transformation ◦ Value: Values are part of a row

and can contain any type of data

◦ Row: a row exists of 0 or more values 

◦ Output stream: an output stream is a stack of rows that leaves a step. 

◦ Input stream: an input stream is a stack of rows that enters a step. 

◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps.

◦ Note: A note is a piece of information that can be added to a transformation

04/10/23MaxQDPro: Kettle- ETL Tool 25

Engine capable of performing a multitude of functions such as reading, manipulating and writing data to and from various data sources.

Page 26: Kettle – Etl Tool

Jobs◦ Job Entry: A job entry is

one part of a job and performs a certain

◦ Hop: A hop is a graphical representation of one or more data streams between 2 steps

◦ Note: a note is a piece of information that can be added to a job

04/10/23MaxQDPro: Kettle- ETL Tool 26

A way of calling transformations and controlling the sequence of their execution. Usually jobs are scheduled in batch mode to be run automatically at regular intervals.

Page 27: Kettle – Etl Tool

Input StepsOutput Steps

Lookup StepsTransformation Steps

Join StepsDW Steps

Mapping Steps

Job Steps

04/10/23 27MaxQDPro: Kettle- ETL Tool

Page 28: Kettle – Etl Tool

04/10/23 28MaxQDPro: Kettle- ETL Tool

Page 29: Kettle – Etl Tool

04/10/23 29MaxQDPro: Kettle- ETL Tool

Page 30: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 30

Page 31: Kettle – Etl Tool

04/10/23 31MaxQDPro: Kettle- ETL Tool

Page 32: Kettle – Etl Tool

04/10/23 32MaxQDPro: Kettle- ETL Tool

Page 33: Kettle – Etl Tool

04/10/23 33MaxQDPro: Kettle- ETL Tool

Page 34: Kettle – Etl Tool

04/10/23 34MaxQDPro: Kettle- ETL Tool

Page 35: Kettle – Etl Tool

Table Output Step

04/10/23 35MaxQDPro: Kettle- ETL Tool

Page 36: Kettle – Etl Tool

Insert / Update Output Step

04/10/23 36MaxQDPro: Kettle- ETL Tool

Page 37: Kettle – Etl Tool

Besides the execution order, it specifies the condition for next job entry

· “Unconditional” - next job entry will be executed regardless of the result of the originating job entry.

· “Follow when result is true” - next job entry will only be executed when the result of the originating job entry is true,

· “Follow when result is false” - next job entry will only be executed when the result of the originating job entry was false

04/10/23 37MaxQDPro: Kettle- ETL Tool

Page 38: Kettle – Etl Tool

04/10/23 38MaxQDPro: Kettle- ETL Tool

Page 39: Kettle – Etl Tool

04/10/23 39MaxQDPro: Kettle- ETL Tool

Page 40: Kettle – Etl Tool

04/10/23MaxQDPro: Kettle- ETL Tool 40

Page 41: Kettle – Etl Tool

Brief Introduction to ETL process JDBC Repository Connection Pentaho Data Integration Tool

◦ Components Pan Carte Kitchen Spoon

◦ Transformation with different Input Data Source◦ Jobs

04/10/23MaxQDPro: Kettle- ETL Tool 41

Page 42: Kettle – Etl Tool

kettle.pentaho.org◦ Kettle project homepage

kettle.javaforge.com◦ Kettle community website: forum, source, documentation, tech tips,

samples, …

www.pentaho.org/download/◦ All Pentaho modules, pre-configured with sample data◦ Developer forums, documentation◦ Ventana Research Open Source BI Survey

www.mysql.com◦ White paper -

http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html◦ Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-

webinars/pentaho-2006-09-19.php ◦ Roland Bouman blog on Pentaho Data Integration and MySQL

http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html

04/10/23 42MaxQDPro: Kettle- ETL Tool