36
Infosys Technologies Limited Version No: 3.0 i Lab Guide For Pentaho Data Integration 4.0.1 (also known as Kettle)

PDI-Labguide ETL Using Pentaho Data Integration

Embed Size (px)

DESCRIPTION

xcxvcbfhygnmhjmhm

Citation preview

Page 1: PDI-Labguide ETL Using Pentaho Data Integration

Infosys Technologies Limited

Version No: 3.0 i

Lab Guide

For Pentaho Data Integration 4.0.1

(also known as Kettle)

Page 2: PDI-Labguide ETL Using Pentaho Data Integration

Table of Contents

Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE ............................................................... 3

Assignment 1: The Kettle Repository ..................................................................................................... 3

Assignment 2: My first Data transfer using Kettle ................................................................................ 6

Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations .... 15

Assignment 4: Creating an ODBC data source ..................................................................................... 26

Assignment 5: Using the ‘Database Lookup’ transformation............................................................ 29

Page 3: PDI-Labguide ETL Using Pentaho Data Integration

Assignment 0: Installing PDI 4.0.1 and opening the PDI IDE

Learning Objective: To download and install Pentaho Data Integration 4.0.1, and open the PDI

interface.

Step 1: Install Java Runtime Environment (version 1.4 or higher) in your system.

Step 2: Go to http://www.pentaho.com site and download Pentaho Data Integration 4.0.1.

Step 3: Unzip the downloaded PDI zip file. Open the ‘data-integration’ folder, and double click on the

spoon.bat file to open the PDI IDE.

Assignment 1: The Kettle Repository

Learning Objective: To learn the concept of a repository in PDI (Kettle) and learn how to create,

connect or disconnect from a repository.

Concept of Repository: The Kettle repository is a workspace that the data integrator works on. This

workspace is a physical region of the hard-drive that is designated exclusively for Kettle. In the

repository, all information about transformations, jobs, schedules, etc. is stored. The repository concept

promotes re-usability, which in turn saves time and effort.

A repository may be created in two ways:

1) Kettle database repository

2) Kettle file repository

When kettle is started, the ‘Repository Connection’ dialog box appears, asking you to select arepository

from the list of existing repositories, or create a new one.

To create a file repository:

Step 1: In ‘Repository Connection’ dialog box click on + [ ] button. The ‘Select the repository type’

dialog box will appear.

Page 4: PDI-Labguide ETL Using Pentaho Data Integration

Step 2: Select ‘kettle file repository’ and click ok.

Page 5: PDI-Labguide ETL Using Pentaho Data Integration

Step 3: In ‘File repository settings’ dialog box, click on Browse button, select a folder that shall

exclusively be your file repository space; fill ID and Name and click on ‘OK’ button. Click on the

‘Repository connection’- ‘OK’ button to select the newly-created repository.

You are now ready to create transformations and jobs on this workspace.

To disconnect from the current working repository, go to Tools menu:

Tools -> Repository -> Disconnect repository

…or alternatively, press Ctrl+D.

NOTE: In the course of working with Kettle, if you want to change your repository or create a new one,

then you can do so by first disconnecting from the current working repository. Then, open the

‘Repository Connection’ dialog box from:

Tools -> Repository -> Connect

…or alternatively, press Ctrl+R. The ‘Repository Connection’ dialog box appears.

Page 6: PDI-Labguide ETL Using Pentaho Data Integration

Assignment 2: My first Data transfer using Kettle

Learning Objective: To create a simple transformation that involves data transfer from a flat file to an

Access database destination.

Step 1: In the Kettle IDE file menu, open File -> New -> Transformation, or alternatively, press Ctrl+N.

Step 2: To save your transformation file with a name of your choice, press Ctrl+S. The ‘Transformation

properties’ dialog box opens up. Give the transformation a name of your choice, and then click on ‘OK’.

Page 7: PDI-Labguide ETL Using Pentaho Data Integration

Step 3: On the ‘Design’ pane on the left of the IDE, expand the ‘Input’ group. Drag and drop the ‘Text file

input’ on the transformation design surface.

Page 8: PDI-Labguide ETL Using Pentaho Data Integration

Step 4: Double-click on the ‘Text file input’. The text file input properties dialog box opens up. Click on

‘Browse’ to select the flatfile to be used as an input.

Select the ‘Products.txt’ flat file that will be used as input for the transformation. After clicking on

‘Open’, click on the button ‘Add’ to add the file to the list of selected files.

Page 9: PDI-Labguide ETL Using Pentaho Data Integration

Step 5: Go to the ‘Content’ tab. Since this is a ‘Comma separated values (CSV)’ flat file, specify the

separator as comma (,).

Page 10: PDI-Labguide ETL Using Pentaho Data Integration

Step 6: Open the fields tab click on Get fields, enter 0 to see the scan results of flat file and click on close

button.

You can also see the text file contains by click on preview rows button.

Step 7: Once done, click on ‘OK’ to complete the process of defining a flat file input.

Page 11: PDI-Labguide ETL Using Pentaho Data Integration

Step 8: Expand the ‘Output’ group on the design pane, and drag and drop ‘Access output’ on the

transformation surface.

To determine data flow sequence from one transformation item to another, a ‘Hop’ is used.

To create the hop: a) Click on the Text file input, then press the <SHIFT> key and draw a line to the Access Output.

OR b) Place the mouse pointer on Text file input until the hover menu appears and then drag the

hop Output connector to Access output.

OR c) Place mouse pointer on the Text file input, press the middle button of the mouse then drag

the hop pointer and release on Access Output.

Page 12: PDI-Labguide ETL Using Pentaho Data Integration

Step 9: Double-click on the ‘Access output’ to open its properties dialog box. Since the access database

does not currently exist, enter the file name along with the full path in ‘The database filename’ field.

Also enter the name of the target table in the ‘Target table’ field. Keep the checkboxes of the ‘Create

database’ and ‘Create table’ options selected, so that the database and the table will be created

respectively if they do not exist already.

After this is done, click on ‘OK’.

Step 10: To run the transformation, click on the green-coloured triangular button.

Page 13: PDI-Labguide ETL Using Pentaho Data Integration

The ‘Execute a transformation’ dialog box opens up. Click on ‘Launch’ to execute the transformation.

The ‘Execution Results’ pane appears.

In the ‘Step Metrics tab, the column ‘Active’ shows ‘Finished’ if the transformation was executed

successfully.

Page 14: PDI-Labguide ETL Using Pentaho Data Integration

Open the ‘Northwind’ access database file. You will see that the data has been successfully populated in

the ‘Products’ table.

Page 15: PDI-Labguide ETL Using Pentaho Data Integration

Assignment 3: Using the ‘Add constants’, ‘Calculator’ and ‘Select Values’ transformations

Learning Objective: To learn how to use the ‘Calculator’ to calculate a new column using existing

column values, and select specific fields to be populated in the destination using the ‘Select Values’

transformation.

Requirements:

i. The columns from the ‘employee’ excel sheet that are required to be sent to an Excel worksheet

are: EmployeeID, LastName, FirstName, Title, TitleOfCourtesy, HireDate, City, Country,

HomePhone, Extension and ReportsTo.

ii. In the ‘Employee’ table, the ‘Firstname’ and ‘Lastname’ columns should be stored as a single

column in the destination.

Step 1: Create a new transformation called ‘Employee’. Drag and drop ‘Excel input’ on the

transformation surface. Double-click the ‘Excel’ input to open its properties dialog box. Click on

‘Browse’.

Page 16: PDI-Labguide ETL Using Pentaho Data Integration

Select the excel workbook that contains the source data for the ‘Employee’ table, and click on the ‘Add’

button to add it to the list of selected files.

Step 2: Go to the ‘Sheets’ tab, and click on ‘Get sheetnames’ to get the list of the names of the sheets

that you wish to include in the data flow. A dialog appears, that asks you to select the sheets you want.

Page 17: PDI-Labguide ETL Using Pentaho Data Integration

Select the sheet named ‘employee’ and click on the ‘>’ button to include it in the list of selected sheets.

Then click on ‘OK’.

Step 3: Next, go to the ‘Fields’ tab and click on ‘Get fields from header row’ button to get a list of the

field names from the first row of the excel sheet ‘employee’.

Page 18: PDI-Labguide ETL Using Pentaho Data Integration

Click on ‘Preview rows’ and enter the number of rows that you would like to preview (this facility is for

the developer to ensure that the connection will successfully be able to fetch the data from the excel

sheet correctly).

Step 4: Click on ‘OK’ to complete the task of defining a connection to the excel sheet data source.

Page 19: PDI-Labguide ETL Using Pentaho Data Integration

Step 5: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Add constants’

transformation on the transformation surface. Double-click on it to open its properties dialog box.

Name the new field as ‘space’, specify data-type as ‘String’ and length as 1. The value should be given as

a space.

After this is done, click on ‘OK’. The ‘Add constants’ will now add a new field called ‘space’ in the data

flow.

Step 6: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Calculator’

transformation on the transformation surface. Create a hop from ‘Add constants’ to ‘Calculator’.

Page 20: PDI-Labguide ETL Using Pentaho Data Integration

Step 7: Double-click on the ‘Calculator’ to open its properties dialog box.

i. Specify the new field name as ‘FullName’.

ii. Select the calculation type as ‘A+B+C’.

iii. Specify ‘Field A’ as ‘FirstName’, ‘Field B’ as ‘space’, ‘Field C’ as ‘LastName’, ‘Value type’ as ‘String’

and ‘Length’ as 70. Click on ‘OK’.

Page 21: PDI-Labguide ETL Using Pentaho Data Integration

Step 8: From the ‘Transform’ group in the Design pane of Kettle, drag and drop ‘Select values’

transformation on the transformation surface. Create a hop from ‘Calculator’ to ‘Select values’.

[NOTE: The ‘Select values’ transformation is used for the purpose of specifically removing the columns

that are not required further in the data flow. The existing columns that are required may also be re-

named to any other name and cast to another data type, if needed.]

Step 9: Double-click on the ‘Select values’ transformation to open its properties dialog box. Click on the

‘Get fields to select’ button the fetch the fields that are presently in the data flow.

Page 22: PDI-Labguide ETL Using Pentaho Data Integration

Step 10: Go to the ‘Remove’ tab. This is where the columns that have to be excluded from the data flow

are specified.

Under the ‘Fieldname’ column, click on the drop-down. It will show a list of the available fields in the

data flow. Click on the name of the column you wish to exclude. For example, click on ‘Address’, since it

is not required further in the data flow.

Do the same for all other fields that have to be excluded.

Page 23: PDI-Labguide ETL Using Pentaho Data Integration

Step 11: Under the ‘Metadata’ tab, click on the ‘Get fields to change’ button. Remove the fields that are

not required in the data flow. Specify the alternative name, data-type, length, precision, etc. for each of

the input fields (if required).

Once done, click on ‘OK’.

Step 12: From the ‘Output’ group in the Design pane of Kettle, drag and drop ‘Excel output’ on the

transformation surface.

Create a hop from ‘Select values’ to ‘Excel output’. Double-click on ‘Excel output’ to open its properties

dialog box.

Click on the ‘Browse’ button.

Page 24: PDI-Labguide ETL Using Pentaho Data Integration

Step 13: Select the folder where you want to save the excel destination workbook. Specify the name of

the file, and click on ‘Save’.

Page 25: PDI-Labguide ETL Using Pentaho Data Integration

Step 14: In the ‘Content’ tab, specify the sheet name as ‘Employee’.

Step 15: In the ‘Fields’ tab, click on the ‘Get Fields’ button to fetch the fields that have to be included in

the ‘Employee’ worksheet. Specify ‘#’ as format for integer fields. Once done, click on ‘OK’.

Page 26: PDI-Labguide ETL Using Pentaho Data Integration

Step 16: Your transformation is now complete and ready to be executed. Run the transformation by

clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.

After execution, the destination Excel sheet looks like this:

Assignment 4: Creating an ODBC data source

Step 1: Click on Start->Control Panel->Administrative Tools->Data Sources (ODBC), then in ODBC Data

Source Administrator dialog box select User DSN tab. Click on ‘Add’.

Page 27: PDI-Labguide ETL Using Pentaho Data Integration

Step 2: Select ‘Microsoft Access driver (*.mdb, *.accdb) and click on ‘Finish’.

Step 3: Specify data source name, description and then click on ‘Select’ to select the access database to

be used.

Page 28: PDI-Labguide ETL Using Pentaho Data Integration

Step 4: Select ‘Northwind.accdb’ from its location and click on ‘OK’.

Step 5: Click on ‘OK’ again.

Page 29: PDI-Labguide ETL Using Pentaho Data Integration

Step 6: Click on ‘OK’ again.

The ODBC data source has now been created.

Assignment 5: Using the ‘Database Lookup’ transformation

Learning Objective: To learn how to lookup values from an referenced table using key-value pairs, and

include the value field(s) into the data flow.

Requirements:

i. The ‘OrderDetails’ sheet from the excel workbook ‘Northwind’ contains product-wise data

about orders. Replace the ‘ProductID’ field by the ‘ProductName’ and populate the data into

the Northwind.accdb Access database, into a table named ‘OrderDetails’.

Page 30: PDI-Labguide ETL Using Pentaho Data Integration

Step 1: Create a new transformation file, and save it as ‘OrderDetails’.

Step 2: Drag and drop an ‘Excel input’ on the transformation surface. Edit the properties of the Excel

input.

i. Select the data source as ‘Northwind.xls’.

ii. Select the source sheet as ‘orderdetails’.

iii. Click on ‘Get fields from header row’ to fetch the fields for the data flow. Click on ‘OK’, once

done.

Step 3: Drag and drop ‘Database lookup’ on the transformation surface. Create a hop from ‘Excel input’

to the ‘Database lookup’.

Page 31: PDI-Labguide ETL Using Pentaho Data Integration

Step 4: Double-click on ‘Database lookup’ to open its properties dialog box. For creating a new

connection to the Access database table ‘Products’ that belongs to the ‘Northwind.accdb’ database,

click on ‘New’.

Page 32: PDI-Labguide ETL Using Pentaho Data Integration

Step 5: Give the connection a name. Select connection type as ‘MS Access’. Specify the name of the

ODBC connection to the Northwind.accdb database. Click on ‘Test’ to test the connection.

If connection is successful, the following message is displayed:

Click on ‘OK’.

Page 33: PDI-Labguide ETL Using Pentaho Data Integration

Step 6: Click on ‘Browse’ to select the lookup table.

Page 34: PDI-Labguide ETL Using Pentaho Data Integration

Step 7: Select the ‘Products’ table as the table to be looked up for value fields.

Step 8: To equate the key values between the source table and the lookup table, specify ‘Table field’ as

‘ProductID’, comparator as ‘=’ and ‘Field1’ as ‘ProductID’. Select the ‘Values to return from the lookup

table’ as ‘ProductName’.

Page 35: PDI-Labguide ETL Using Pentaho Data Integration

Step 9:

i. Drag and drop ‘Select Values’ on the transformation surface. Create a hop from ‘Database

lookup’ to ‘Select values’.

ii. In the ‘Remove’ tab, select the field ‘ProductID’ to be removed.

iii. In the ‘Metadata’ tab, specify the data types of the fields that are included in the data flow.

Step 10: Drag and drop ‘Access output’ on the transformation surface. Create a hop from ‘Select values’

to the ‘Access output’.

i. Specify the database as the existing ‘Northwind.accdb’ database.

ii. Give the table name as ‘OrderDetails’.

iii. Click on ‘OK’.

Page 36: PDI-Labguide ETL Using Pentaho Data Integration

Step 10: Your transformation is now complete and ready to be executed. Run the transformation by

clicking on the green triangular button, and then clicking on the ‘Launch’ button after that.

After execution, the destination table looks like this:

<EOF>