85
Pentaho Data Integration January, 2014 Alex Rayón Jerez [email protected] DeustoTech Learning – Deusto Institute of Technology – University of Deusto Avda. Universidades 24, 48007 Bilbao, Spain www.deusto.es

Kettle: Pentaho Data Integration tool

Embed Size (px)

DESCRIPTION

Pentaho Data Integration. Preparing and blending data from any source for analytics. Thus, enabling data-driven decision making. Application for education, specially, academic and learning analytics.

Citation preview

Page 1: Kettle: Pentaho Data Integration tool

Pentaho Data IntegrationJanuary, 2014

Alex Rayón [email protected]

DeustoTech Learning – Deusto Institute of Technology – University of DeustoAvda. Universidades 24, 48007 Bilbao, Spain

www.deusto.es

Page 2: Kettle: Pentaho Data Integration tool

Before starting….

Who has used a

relational database?

Source: http://www.agiledata.org/essays/databaseTesting.html

Page 3: Kettle: Pentaho Data Integration tool

Before starting…. (II)

Who has written scripts or Java code to move data from one

source and load it to another?

Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code

Page 4: Kettle: Pentaho Data Integration tool

Before starting…. (III)

What did you use?1. Scripts

2. Custom Java Code3. ETL

Page 5: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

Page 6: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

Page 7: Kettle: Pentaho Data Integration tool

Pentaho at a glance

Business Intelligence

Page 8: Kettle: Pentaho Data Integration tool

Pentaho at a glance (II)

Page 9: Kettle: Pentaho Data Integration tool

Pentaho at a glance (III)

● Business Intelligence & Analytics● Open Core

○ GPL v2○ Apache 2.0○ Enterprise and OEM licenses

● Java-based● Web front-ends

Page 10: Kettle: Pentaho Data Integration tool

Pentaho at a glance (IV)

● The Pentaho Stack○ Data Integration / ETL○ Big Data / NoSQL○ Data Modeling○ Reporting ○ OLAP / Analysis○ Data Visualization○ Dashboarding○ Data Mining / Predictive Analysis○ Scheduling

Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/

Page 11: Kettle: Pentaho Data Integration tool

Pentaho at a glance (V)

● Modules○ Pentaho Data Integration

■ Kettle○ Pentaho Analysis

■ Mondrian○ Pentaho Reporting○ Pentaho Dashboards○ Pentaho Data Mining

■ WEKA

Page 12: Kettle: Pentaho Data Integration tool

Pentaho at a glance (VI)

● Figures○ + 10.000 deployments○ + 185 countries○ + 1.200 customers○ Since 2012, in Gartner

Magic Quadrant for BI Platforms

○ 1 download / 30 seconds

Page 13: Kettle: Pentaho Data Integration tool

Pentaho at a glance (VII)

● Open Source Leader

Page 14: Kettle: Pentaho Data Integration tool

Pentaho at a glance (VIII)

Single Platform

Page 15: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

Page 22: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

Page 23: Kettle: Pentaho Data Integration tool

ETLDefinition and characteristics

● An ETL tool is a tool that○ Extracts data from various data sources (usually

legacy data)○ Transforms data

■ from → being optimized for transaction■ to → being optimized for reporting and analysis

■ synchronizes the data coming from different databases

■ data cleanses to remove errors○ Loads data into a data warehouse

Page 24: Kettle: Pentaho Data Integration tool

ETLWhy do I need it?

● ETL tools save time and money when developing a data warehouse by removing the need for hand-coding

● It is very difficult for database administrators to connect between different brands of databases without using an external tool

● In the event that databases are altered or new databases need to be integrated, a lot of hand-coded work needs to be completely redone

Page 25: Kettle: Pentaho Data Integration tool

ETLBusiness Intelligence

● ETL is the heart and soul of business intelligence (BI)○ ETL processes

bring together

and combine data

from multiple

source systems

into a data warehouse

Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html

Page 26: Kettle: Pentaho Data Integration tool

ETLBusiness Intelligence (II)

According to most practitioners, ETL

design and development work consumes 60 to 80

percent of an entire BI project

Source: http://www.dwuser.com/news/tag/optimization/

Source: The Data Warehousing Institute. www.dw-institute.com

Page 27: Kettle: Pentaho Data Integration tool

ETLProcessing framework

Source: The Data Warehousing Institute. www.dw-institute.com

Page 28: Kettle: Pentaho Data Integration tool

ETLTools

Source: http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

Page 29: Kettle: Pentaho Data Integration tool

ETLOpen Source tools

● CloverETL● KETL● Kettle● Talend

Page 30: Kettle: Pentaho Data Integration tool

ETLCloverETL

● Create a basic archive of functions for mapping and transformations, allowing companies to move large amounts of data as quickly and efficiently as possible

● Uses building blocks called components to create a transformation graph, which is a visual depiction of the intended data processing

Page 31: Kettle: Pentaho Data Integration tool

ETLCloverETL (II)

● The graphic presentation simplifies even complex data transformations, allowing for drag-and-drop functionality

● Limited to approximately 40 different components to simplify graph creation○ Yet you may configure each component to meet

specific needs

● It also features extensive debugging capabilities to ensure all transformation graphs work precisely as intended

Page 32: Kettle: Pentaho Data Integration tool

ETLKETL

● Contains a scalable, platform-independent engine capable of supporting multiple computers and 64-bit servers

● The program also offers performance monitoring, extensive data source support, XML compatibility and a scheduling engine for time-based and event-driven job execution

Page 33: Kettle: Pentaho Data Integration tool

ETLKettle

● The Pentaho company produced Kettle as an OS alternative to commercial ETL software○ No relation to Kinetic Networks' KETL

● Kettle features a drop-and-drag, graphical environment with progress feedback for all data transactions, including automatic documentation of executed jobs

● XML Input Stream to handle huge XML files without suffering a loss in performance or a spike in memory usage

○ Users can also upgrade the free Kettle version for optional pay features and dedicated technical support.

Page 34: Kettle: Pentaho Data Integration tool

ETLTalend

● Provides a graphical environment for data integration, migration and synchronization

● Drag and drop graphic components to create the java code required to execute the desired task, saving time and effort

● Pre-built connectors to enable compatibility with a wide range of business systems and databases

● Users gain real-time access to corporate data, allowing for the monitoring and debugging of transactions to ensure smooth data integration

Page 35: Kettle: Pentaho Data Integration tool

ETLComparison

● The set of criteria that were used for the ETL tools comparison were divided into seven categories:○ TCO○ Risk○ Ease of use○ Support○ Deployment○ Speed○ Data Quality○ Monitoring○ Connectivity

Page 37: Kettle: Pentaho Data Integration tool

ETLComparison (III)

● Total Cost of Ownership ○ The overall cost for a certain

product.○ This can mean initial ordering,

licensing servicing, support, training, consulting, and any other additional payments that need to be made before the product is in full use

○ Commercial Open Source products are typically free to use, but the support, training and consulting are what companies need to pay for

Page 38: Kettle: Pentaho Data Integration tool

ETLComparison (IV)

● Risk

○ There are always risks with projects, especially big projects.

○ The risks for projects failing are:■ Going over budget■ Going over schedule

■ Not completing the requirements or expectations of the customers

○ Open Source products have much lower risk then

Commercial ones since they do not restrict the use of their products by pricey licenses

Page 39: Kettle: Pentaho Data Integration tool

ETLComparison (V)

● Ease of use

○ All of the ETL tools, apart from Inaport, have GUI to simplify the development process

○ Having a good GUI also reduces the time to train and use the tools

○ Pentaho Kettle has an easy to use GUI out of all the tools

■ Training can also be found online or within the community

Page 40: Kettle: Pentaho Data Integration tool

ETLComparison (VI)

● Support

○ Nowadays, all software products have support and all of the ETL tool providers offer support

○ Pentaho Kettle – Offers support from US, UK and has a partner consultant in Hong Kong

● Deployment

○ Pentaho Kettle is a stand-alone java engine that can run

on any machine that can run java. Needs an external scheduler to run automatically.

○ It can be deployed on many different machines and used as “slave servers” to help with transformation processing.

○ Recommended one 1Ghz CPU and 512mbs RAM

Page 41: Kettle: Pentaho Data Integration tool

ETLComparison (VII)

● Speed

○ The speed of ETL tools depends largely on the data that

needs to be transferred over the network and the processing power involved in transforming the data.

○ Pentaho Kettle is faster than Talend, but the Java-

connector slows it down somewhat. Also requires manual

tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic

Page 42: Kettle: Pentaho Data Integration tool

ETLComparison (VIII)

● Data Quality

○ Data Quality is fast becoming the most important feature in any data integration tool.

○ Pentaho – has DQ features in its GUI, allows for

customized SQL statements, by using JavaScript and

Regular Expressions. It also has some additional modules after subscribing.

● Monitoring

○ Pentaho Kettle – has practical monitoring tools and logging

Page 43: Kettle: Pentaho Data Integration tool

ETLComparison (IX)

● Connectivity○ In most cases, ETL tools transfer data from legacy systems

○ Their connectivity is very important to the usefulness of the ETL tools.

○ Kettle can connect to a very wide variety of databases, flat files, xml files, excel files and web services.

Page 44: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

Page 45: Kettle: Pentaho Data Integration tool

KettleIntroduction

Project Kettle

Powerful Extraction, Transformation and Loading (ETL) capabilities using an

innovative, metadata-driven approach

Page 46: Kettle: Pentaho Data Integration tool

KettleIntroduction (II)

● What is Kettle?○ Batch data integration

and processing tool written in Java

○ Exists to retrieve, process and load data

○ PDI is a synonymous term

Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230

Page 47: Kettle: Pentaho Data Integration tool

KettleIntroduction (III)

● It uses an innovative meta-driven approach● It has a very easy-to-use GUI● Strong community of 13,500 registered

users● It uses a stand-alone Java engine that

process the tasks for moving data between many different databases and files

Page 48: Kettle: Pentaho Data Integration tool

KettleIntroduction (IV)

Page 49: Kettle: Pentaho Data Integration tool

KettleData Integration Platform

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

Page 50: Kettle: Pentaho Data Integration tool

KettleArchitecture

Source: Pentaho Corporation

Page 51: Kettle: Pentaho Data Integration tool

KettleMost common uses

● Datawarehouse and datamart loads● Data Integration● Data cleansing● Data migration● Data export● etc.

Page 52: Kettle: Pentaho Data Integration tool

KettleData Integration

● Changing input to desired output● Jobs○ Synchronous workflow of job

entries (tasks)● Transformations○ Stepwise parallel & asynchronous

processing of a recordstream● Distributed

Page 53: Kettle: Pentaho Data Integration tool

KettleData Integration challenges

● Data is everywhere● Data is inconsistent

○ Records are different in each system● Performance issues

○ Running queries to summarize data for stipulated long period takes operating system for task

○ Brings the OS on max load● Data is never all in Data Warehouse

○ Excel sheet, acquisition, new application

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 54: Kettle: Pentaho Data Integration tool

KettleTransformations

● String and Date Manipulation● Data Validation / Business Rules● Lookup / Join● Calculation, Statistics● Cryptography● Decisions, Flow control● Scripting● etc.

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 55: Kettle: Pentaho Data Integration tool

KettleWhat is good for?

● Mirroring data from master to slave● Syncing two data sources● Processing data retrieved from multiple

sources and pushed to multiple destinations

● Loading data to RDBMS● Datamart / Datawarehouse○ Dimension lookup/update step

● Graphical manipulation of data

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 56: Kettle: Pentaho Data Integration tool

KettleAlternatives

● Code○ Custom java○ Spring batch

● Scripts○ perl, python,

shell, etc○ Possibly + db

loader tool and cron

● Commercial ETL tools○ Datastage○ Informatica

● Oracle Warehouse Builder

● SQL Server Integration services

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 57: Kettle: Pentaho Data Integration tool

KettleExtraction

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 58: Kettle: Pentaho Data Integration tool

KettleExtraction (II)

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 59: Kettle: Pentaho Data Integration tool

KettleExtraction (III)

● RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.)

● NoSQL Data: HBase, Cassandra, MongoDB● OLAP (Mondrian, Palo, XML/A)● Web (REST, SOAP, XML, JSON)● Files (CSV, Fixed, Excel, etc.)● ERP (SAP, Salesforce, OpenERP)● Hadoop Data: HDFS, Hive● Web Data: Twitter, Facebook, Log Files, Web Logs● Others: LDAP/Active Directory, Google Analytics,

etc.DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 60: Kettle: Pentaho Data Integration tool

KettleTransportation

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 61: Kettle: Pentaho Data Integration tool

KettleTransformation

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 62: Kettle: Pentaho Data Integration tool

KettleLoading

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 63: Kettle: Pentaho Data Integration tool

KettleEnvironment

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 64: Kettle: Pentaho Data Integration tool

KettleComparison of Data Integration tools

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 65: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 66: Kettle: Pentaho Data Integration tool

Big DataBusiness Intelligente

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

A brief (BI) history….

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 67: Kettle: Pentaho Data Integration tool

Big DataWEKA

Project WekaA comprehensive set of tools for Machine

Learning and Data Mining

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 68: Kettle: Pentaho Data Integration tool

Big DataAmong Pentaho’s products

MondrianOLAP server written in Java

KettleETL tool

WekaMachine learning and Data Mining tool

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 69: Kettle: Pentaho Data Integration tool

Big DataWEKA platform

● WEKA (Waikato Environment for Knowledge Analysis)

● Funded by the New Zealand’s Government (for more than 10 years)○ Develop an open-source state-of-the-art

workbench of data mining tools○ Explore fielded applications○ Develop new fundamental methods

● Became part of Pentaho platform in 2006 (PDM - Pentaho Data Mining)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 70: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA

● (One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data

● Goal: improve marketing, sales, and customer support operations, risk assessment etc.○ Who is likely to remain a loyal customer?○ What products should be marketed to which

prospects?○ What determines whether a person will respond

to a certain offer?○ How can I detect potential fraud?

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 71: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (II)

Central idea: historical data contains information that will be useful in the future (patterns → generalizations)

Data Mining employs a set of algorithms that automatically detect

patterns and regularities in data

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 72: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (III)

● A bank’s case as an example

○ Problem: Prediction (Probability Score) of a

Corporate Customer Delinquency (or default) in the next year

○ Customer historical data used include:■ Customer footings behavior (assets & liabilities)■ Customer delinquencies (rates and time data)■ Business Sector behavioral data

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 73: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (IV)

● Variable selection using the Information Value (IV) criterion

● Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 74: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (V)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 75: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (VI)

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 76: Kettle: Pentaho Data Integration tool

Big DataData Mining with WEKA (VII)

● Limitations○ Traditional algorithms need to have all data

in (main) memory■ big datasets are an issue

● Solution○ Incremental schemes○ Stream algorithms

■ MOA (Massive Online Analysis) ■ http://moa.cs.waikato.ac.nz/

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 78: Kettle: Pentaho Data Integration tool

Table of Contents

● Pentaho at a glance● In the academic field● ETL● Kettle● Big Data● Predictive Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 79: Kettle: Pentaho Data Integration tool

Predictive analyticsUnified solution for Big Data Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 80: Kettle: Pentaho Data Integration tool

Predictive analyticsUnified solution for Big Data Analytics (II)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery for iPad● Full analytical power on

the go – unique to Pentaho

● Mobile-optimized user interface

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 81: Kettle: Pentaho Data Integration tool

Predictive analyticsUnified solution for Big Data Analytics (III)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery and development for big data● Broadens big data access to

data analysts● Removes the need for

separate big data visualization tools

● Further improves productivity for big data developers

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 82: Kettle: Pentaho Data Integration tool

Predictive analyticsUnified solution for Big Data Analytics (IV)

Pentaho Instaview

● Instaview is simple○ Created for data analysts○ Dramatically simplifies ways to

access Hadoop and NoSQL data stores

● Instaview is instant & interactive○ Time accelerator – 3 quick steps from

data to analytics○ Interact with big data sources –

group, sort, aggregate & visualize● Instaview is big data analytics

○ Marketing analysis for weblog data in Hadoop

○ Application log analysis for data in MongoDB

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 83: Kettle: Pentaho Data Integration tool

Predictive analyticsComparison

Source: http://cdn.oreillystatic.com/en/assets/1/event/100/Using%20R%20and%20Hadoop%20for%20Statistical%20Computation%20at%20Scale%20Presentation.htm#/2

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 84: Kettle: Pentaho Data Integration tool

Referenceshttp://cdn.oreillystatic.com/en/assets/1/event/100/Big%20Data%20Architectural%20Patterns%20Presentation.pdf

http://blog.pentaho.com/tag/strata/

http://www.slideshare.net/mattcasters/pentaho-data-integration-introduction?from_search=2

http://www.slideshare.net/infoaxon/open-source-bi-7640848

http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

http://www.pentaho.com/Blend-of-the-Week?mkt_tok=3RkMMJWWfF9wsRonuKvNce%2FhmjTEU5z17%2BQoXaO2hokz2EFye%2BLIHETpodcMTcdgPbjYDBceEJhqyQJxPr3DJNAN1dt%2BRhDhCA%3D%3D#Analytics

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014

Page 85: Kettle: Pentaho Data Integration tool

Copyright (c) 2014 University of DeustoThis work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/

Alex Rayón JerezJanuary 2014

DeustoTech-Learning 2013/2014 - 9 de Enero del 2014