Installing and using Pentaho Kettle to transfer data to an LOVD

Installing and using Pentaho Kettle to transfer data to an LOVD databaseGetting Started

From the Pentaho Kettle home page follow the release note for Kettle version 4.3. Download the software from Sourceforge (PDI 4.3.0-stable) and unzip.

Go to pdi-ce-4.3.0-stable/data-integration/ Spoon is the graphical interface where you can create and connect the

processes. Open it.

Double click

At this point we recommend following the tutorial (http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+%28Kettle%29+Tutorial). Although it doesn’t feature any of the processes that will be used, it is worth using it as a starting point to familiarise yourself with the software.

As covered in the tutorial, it is worth setting up environment variables that Kettle will recognise for directories. Create the directories ‘files’ and ‘scripts’ in the data-integration directory. The Kettle properties file will need amending to point to these directories. In Windows, find this file (in this example it is in a personal user directory C:\Documents and Settings\user\.kettle\kettle) and add the following lines to the bottom (or the appropriate path for your directories): FILES=C:/Documents and Settings/user.name/Desktop/pdi-ce-4.3.0-stable/data-

integration/files SCRIPTS=C:/Documents and Settings/user.name/Desktop/pdi-ce-4.3.0-stable/data-

integration/scripts

On linux OS the file to modify is ~/.kettle/kettle.properties with the appropriate path to the directories

To use the connection to MySQL you will also need to download the mysql java connector and put it in the correct place. Download from

http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+(Kettle)+Tutorial

http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+(Kettle)+Tutorial

http://dev.mysql.com/downloads/connector/j/, unzip and put the jar file in /data-integration/libext/JDBC/

Example – Connecting to a spreadsheet

Starting with a new transformation Select ‘Input->Microsoft Excel Input’ by dragging it from the steps panel on

the left of the screen into the main panel Double click the icon

Fill in the ‘File or directory’ field with the path and .xls filename. Press ‘Add’ to select this file.

Select the ‘Fields’ tab and press ‘Get fields from header row...’ to populate with the fields from the spreadsheet.

http://dev.mysql.com/downloads/connector/j/

Press ‘OK’ to close this window

Senario 1 – Connecting to a database and extracting data

First you need to establish a database connectionClick in the ‘View’ tab at the top of the left panel and double click ‘Database connections’ to configure the connection.

To connect to a MySQL database select it from the list and fill in the Setting fields on the right.

You will need to enter the database name, username and password. With the connection now established, select Input->Table input from the left

hand panel and drag it into the main screen. Double click the icon and select the connection you have just made from the

drop down list. Fill in the SQL query to pull out the data from the database table

(e.g. SELECT * FROM lab_database_table)

Putting information from the database into a spreadsheet (for example)

Following the database connection above, we can send this data to a spreadsheet

Drag and drop Output->Microsoft Excel Output icon from the left panel into the main screen.

Connect the ‘Table input’ icon to the ‘Excel output’ output icon (left click on it, hold shift and connect the arrow to the output icon)

Double click the excel output icon. Select the spreadsheet filename and optionally customise the headers and fields that are written to the spreadsheet by clicking on the relevant tabs.

The transformation should then be saved, creating a .ktr file

Running the processes

The process can be run through the Spoon interface or, alternatively, on the command line using pan.sh

You will need to make it executable before running it: chmod 755 pan.sh

To run the example in scenario 1: ./pan.sh –file ./scripts/senario1.ktr –norep

If everything worked you should have the .xls file in the ‘files’ directory.

Senario 2 – Automatic variant submission to an LOVD database – Kabuki example

The Manchester laboratory has a genetic service for Kabuki syndrome – identifying mutations in the MLL2 gene. We used this data, stored in an excel spreadsheet, as a way of testing the Kettle software for automatic LOVD submission.

Local LOVD installationThe NGRL have LOVD2 installed on a local server so we created a new gene (MLL2) by logging on to LOVD through the online interface.

The relevant tables for automated variant submissions are lovd_MLL2_variants, lovd_patients and lovd_patients2variants. The basic idea behind automated variant submission is that Kettle connects to this database and enters the appropriate data in the relevant tables and fields.

When you upload a file to LOVD in the conventional way you can leave many compulsory fields empty as the underlying LOVD scripts will populate them automatically. Many of these fields are crucial for the database interface to function correctly and so replicating the action of these scripts to automatically fill in the empty fields needs to be carried out programmatically. A perl script forms part of the process and computes the appropriate values for entry to the database.

Null versus empty string value issue

Many of the fields in the lovd database that we are populating do not allow ‘NULL’ values. This essentially means that you are required to put something in them. I uploaded some basic variant data (DNA mutation, protein mutation etc) through the lovd web interface and looked at how this was reflected in the database tables where there was no data for that field. Where no data are available the fields are blank (contain empty string values) and this is ideally what we would replicate. Unfortunately the current version of Kettle (v4.3) treats empty string values as null and therefore when I attempt to fill these fields with empty strings a null is inserted and the process fails, reporting the error. The Kettle documentation states that this issue will be addressed in version 5 (– available early 2013) but in the meantime we have developed a work around.

Kettle processes

We have created a workflow for this scenario. The process is controlled by execute.kjb – a Kettle job that runs different transformations in order to carry out the task.

Breakdown of the steps:1. START – this just starts the process and is a requirement of all jobs2. set_parameters – this transformation takes in some input from the user – the name

of the spreadsheet and the gene name (e.g. ‘MLL2’) 3. excel2txt (excel2txt.ktr) – this transformation reads in the spreadsheet and converts

it into a text format 4. Check if lab data exists – this is a simple error check that aborts the job if no lab data

is found

The next 3 steps retrieve existing IDs from the database to ensure that the new IDs don’t conflict and that no duplicate entries are made

5. get variant IDs (retrieve_ids.ktr) – this transformation reads the existing variantids from the lovd2.lovd_MLL2_variants table. This is then used to ensure that the variantids aren’t duplicated.

6. get patient IDs (retrieve_ids2.ktr) – this transformation reads existing patientids from lovd2.lovd_patients2variants table.

7. get patients2variants IDs (retrieve_ids2.ktr) – this transformation reads existing patientids and variantids from lovd2.lovd_patients2variants table.

8. Check if current ID file exists – self explanatory9. run perl script… (create_DBIDs.pl) - this is a shell command that runs a perl script to

generate the appropriate information for each field in each table (as above) and print out all of the information to a text file. It fills in many of the compulsory fields (date created etc) and ensures that patient IDs and variants aren’t duplicated in the database.

10. load_LOVD_tables (lovd_loader.ktr) – this transformation reads in the text file and outputs the data to the lovd_MLL2_variants and lovd_patients2variants database tables.

11. load lovd_patients (lovd_loader2.ktr) – this transformation loads the lovd_patients data – this is separate from the others as patientIDs can have multiple variants

12. tidy – this is a simple script to remove any intermediate files that have been generated to keep things tidy.

Points to consider: The parameters in each transformation and location of files would have to be

adapted for each laboratory setting. Where variants are stored in a database, step 3 would need replacing.

To run this Kettle transformation from the command line you use the ‘kitchen.sh’ script, making sure that all of the transformations and scripts are in the ./scripts directory and that the spreadsheet is where you point to in the transformations (these can be modified to point to different locations).

Argument1 = ‘kabuki_spreadsheet’ – the name of the spreadsheetArgument2 = ‘MLL2’ – the gene name

./kitchen.sh –file ./script/execute.kjb kabuki_spreadsheet MLL2 -norep

We provide this Kettle job and individual transformations for modification as well as the perl script for preventing redundant entries and filling in necessary fields.