68
All rights reserved Javlin 2011 March 2011 Mgr. Jan Ulrych ETL

ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

March 2011

Mgr. Jan Ulrych

ETL

Page 2: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Organizational Matters

• Introduction to ETL

• More about ETL

• CloverETL intro

• ETL Projects

• Current Trends

Page 3: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Presenter

• Jan Ulrych

› Graduated from Faculty of Mathematics and Physics at Charles University, Prague

› Works for Javlin a.s. as ETL Consultant since 2008

› E-mail: [email protected]

• Professional experience

› DVRA project at DHL IT Services Europe, Prague

› ETL Consultant on various data integration projects

› Since 2010 is a Pre-sales consultant for CloverETL™

Page 4: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Javlin Overview

• Javlin – since 2005 • Javlin is a software developer and services provider

› CloverETL platform › Javlin services and ETL consulting › Software development for major clients

• Employees: staff of 60+ › Developers and consultants › Service, sales, support › Executive management

• Experienced management team

› Data and ETL software development legacy › Key industry expertise – finance, health, media, logistics, government

• Office locations › Prague › Brno › Greater Washington DC

Page 5: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Selected Customers

Page 6: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Origins & Motivation

Session 1:

Page 7: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Warehousing

• “A data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.” Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004

• The most visible part is

“querying and analysis”

• The most complex and time consuming part is

“extracts, cleans, conforms, and delivers”

Page 8: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Warehousing

• The most complex and time consuming part is

“extracts, cleans, conforms, and delivers”

• How complex is it?

› 70-80% of BI (DI or DW) project is reliable ETL process

Page 9: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Getting data into DW

• How to load data into DW? › Scripts in linux shell, perl, python, … › sqlldr + SQL › Hardcoded in Java, C#, C › In-house built ETL tool › Off-the shelf ETL tool

• Aspects to be kept in mind

› Manageability › Maintainability › Transparency › Scalability › Flexibility › Complexity › Auditing › Job restartability › Testing

Page 10: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Introduction to ETL

Session 1:

Page 11: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL

• How to load data into DW in a right way?

› Introduce formal ETL process

• This section covers

› What ETL is

› Motivation

› Where to use ETL

› How to Implement ETL

› Key ETL Aspects

Page 12: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Motivation

• Is ETL interesting area? › 70-80% of BI (DI or DW) project is reliable ETL process.

• Let’s have a look on the DW & DI market size › In 2003, DI was USD 9.3 billion market

› In 2008, DI was USD 13 billion market

› By 2010, yearly grow estimated to USD 2.2 billion

› TCO of DI can reach USD 509,600 annually

• The more systems in the world, the more work in Data Integration!

Page 13: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

What is ETL?

• ETL = Extract – Transform – Load

• Extract

› Get the data from source system as efficiently as possible

• Transform

› Perform calculations on data

• Load

› Load the data in the target storage

Page 14: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

What is ETL?

• ETL = Extract – Transform – Load

• Extract › Get the data from source system as efficiently as possible

• Clean › Perform data cleansing and dimension conforming

• Transform › Perform calculations on data

• Load › Load the data in the target storage

Page 15: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Why is ETL (System) Important?

• Adds value to data

› Removes mistakes and corrects data

› Documented measures of confidence in data

› Captures the flow of transactional data

› Adjusts data from multiple sources to be used together (conforming)

› Structures data to be usable by BI tools

› Enables subsequent business / analytical data procesing

Page 16: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL Disambiguation

• ETL = Extract – Transform – Load › Not tight specifically to DW anymore

• Process/System › A complete process including

• Data extraction

• Enforcing DQ and consistency standards

• Conforming data from disparate systems

• Delivering data to target

• People, HW, Documentation, Support, etc.

• Tool › A piece of software implementing the

• three (four) E-(C)-T-L steps.

• A tool designed specifically to perform data transformations

Page 17: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data

ETL Data Integration

Analytics

Presentation

DBMS (MS SQL, MySQL, Oracle), XML, flat files, CSV, mainframe

ETL

BI Tools (SAP, Cognos), KPI, Data Mining

Dashboards, Reports, Portals

Extracting value from the database

Integrated data value for applications

ETL Process

Page 18: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Source A Files, Databases, Message Queues,

Web Services

Source B Files, Databases, Message Queues,

Web Services

ETL Read

Apply Logic Write

True Data Integration is agnostic of source or target application

ETL is a bridge for bi-directional flow

ETL Tool: True Data Integration

Page 19: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Migration

Process of transferring data between storage types or formats. An automated migration frees up human resources from tedious tasks. Design, extraction, cleansing, load and verification are done for moderate to high complexity jobs.

Data Consolidation

Usually associated with moving data from remote locations to a central location or combining data due to an acquisition or merger.

Data Integration

Process of combining data residing at different sources and providing a unified view. Emerges in both commercial and scientific fields and is focus of extensive theoretical work. Also referred to as Enterprise Information Integration.

ETL Data Integration Solutions (1)

Page 20: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Master Data Management Processes and tools to define and manage non-transactional data. Provides for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing data to an organization to ensure consistency and control.

Data Warehouse

Repository of electronically stored data. ETL facilitates populating, reporting and analysis. Includes business intelligence as well as metadata retrieval and management tools.

Data Synchronization

Process of making sure two or more locations contain the same up-to-date files. Add, change, or delete a file from one location, synchronization will mirror the action at the new location.

ETL Data Integration Solutions (2)

Page 21: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Where is ETL used?

Page 22: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

How to Implement ETL System (1)

Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004

Page 23: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

How to Implement ETL System (2)

• Scripting (shell, perl, python)

• PL/SQL, sqlldr

• Transformation hardcoded in Java, C#

• Develop (universal) ETL tool in-house

• Using off-the-shelf ETL tool

Page 24: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL Tool Key Features (1)

• Extract, Load => flexible on interfaces › Flat files, DBMS, XML data, XLS, › MQ, web services, LDAP › Semi-structured data (emails, web logs, wiki pages) › Unstructured data (blogs, documents) › Extensibility with custom connectors › Local data, remote data FTP(S), SFTP, SCP, http(s)

• Clean › Lookups, Validations, Filters, Translations

• Transform › Changing data structure, Joins, (De)Normalization,

Aggregation, RollUp, Sorting, Partitioning, Data De-duplication

› Ability to call external tools

Page 25: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL Tool Key Features (2)

• Performance › Symmetric Multiprocessing (SMP)

• Pipeline processing • Multithreaded processing

› Massively Parallel Processing (MPP) • Clustering • MapReduce

› Load balancing

• User friendliness › GUI › Metadata capture › Training time

• Development › Reusable components › Impact Analysis / Data Lineage › Documentation

Page 26: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL Tool Key Features (3)

• Manageability › Team collaboration › Transformation repository › Metadata repository › Development process (Dev -> Test -> Prod) › Security

• Runtime › Scheduler Automation › Recovery and Restart › Workflow

• Others › Vendor stability › Release cycle › Support

Page 27: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Source: Ted Friedman, Mark A. Beyer, Eric Thoo: Magic Quadrant for Data Integration Tools; Gartner RAS Core Research Note G00207435; 19 November 2010

ETL Market

Page 28: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Well Known ETL Tools

• Commercial › Ab Initio › IBM DataStage › Informatica PowerCenter › Microsoft Data Integration Services › Oracle Data Integrator › SAP Business Objects – Data Integrator › SAS Data Integration Studio

• Open-source based › Adeptia Integration Suite › Apatar › CloverETL › Pentaho Data Integration (Kettle) › Talend Open Studio/Integration Suite

The list above is not meant to be comprehensive

Page 29: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL or ELT?

• ETL = Extract – Transform – Load › Much more flexible

• ELT = Extract – Load – Transform › Pushed forward (mainly) by Oracle

• First get the data into database

• Then use Oracle DB tools to work with it

› Less flexible • Tightly coupled to vendor’s database/solution

• Less flexible on output formats

• Requires staging area

› Possibly better performance • All data are processed in the same database

• Nothing is downloaded from database for Transform step

Page 30: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

CloverETL™

Session 2:

Dominate Your Data with

Page 31: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

What is CloverETL

• A data integration software platform › Manages, designs and runs your data › Embeddable and scalable › Integrates easily with databases, operating

systems and applications

• ETL platform that dominates your data Extract – Transform – Load

› Reads from one or more data sources › Transforms data in almost any way imaginable › Writes to any number of data targets

• Legacy of open source and commitment to commercial use

• CloverETL Engine can be used as embedded OEM as well

www.cloveretl.com

Clover works on all OS › Linux › Windows › HP-UX › AIX › IBM AS/400 › Solaris › Mac OS X

Page 32: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Platform independent

Java, integration, library support

Scalability Desktop → Enterprise

→ Cluster

Usability built by ETL experts

for ETL experts

Performance

outstanding at all production levels

Services

Clover was built to require minimal

services

OEM Embeddable

small footprint, extensible and customizable

Cost delivers lowest total

cost of ownership

CloverETL Key Features

Page 33: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

CloverETL Engine • Pure Java 6.0 • Embeddable • Extensible • Ideal for OEM

CloverETL Designer • Visual Designer • Transformation Developer • Eclipse Platform

CloverETL Server • Production Platform • Web app’s (Tomcat, WebSphere, GlassFish) • For Enterprise Integration

Design Manage

Runtime

CloverETL Product Suite

Page 34: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Features • Transformation design • Intuitive GUI • Drag & drop • Components Library • Debug

CloverETL Designer

Page 35: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Vision CloverETL enables companies to get more value from their data quickly without massive infrastructure expense or years of project investment.

Dominate your data

Approach CloverETL is the best value for money. It builds off the open source foundation of the CloverETL Engine and scales from desktop to enterprise to cluster.

Get it done quickly

Investment CloverETL is an easy to use ETL software that can grow by both core features and user expertise… at a fraction of the cost of larger system vendors.

Low cost buy-in for our customers

CloverETL Vision

Page 36: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

CloverETL Approach

Input

Output

Transform

• Components › Prebuilt algorithms

• Graph › Processing algorithm

in visual form

• Data Flow › Edges between components

• Process › Build > Connect > Configure

› Data processed as structured records with named fields

› Components operate on record fields

Page 37: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Transformation capabilities

• 50+ specialized components available for use

› Readers and writers

• Text or binary files, CSV, XML

• Archives including ZIP or GZIP

• Remote transfers over HTTP/FTP(S)

• Access to messaging via JMS

› Database connectivity

• JDBC connectivity, support for bulk loaders

• Supports any SQL statements, stored procedure calls

› Transformers and aggregators • Variety of data manipulation algorithms: sorting, deduping, joining,

arithmetics, aggregating, statistic functions and more

• Customizable with user-defined code

• Alternative implementations for efficient execution

Page 38: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Physical architecture

• Others: Server-centric with thin clients

› Server is necessary for development and execution

› Transformations and data are stored remotely on server

› Limited use when working on multiple sites or in restricted-access networks

• CloverETL: Designer is a standalone application

› Development and execution is possible without central server

› Transformations and metadata are stored on local machine

› Transformations can be also deployed to server for centralized management

› Designer available for all major platforms (Linux, Windows, Mac)

Page 39: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Repository

• Others: Central repository

› With proprietary storage format (binary files, tables)

› Often managed with a proprietary version control system

› Can be problematic with mass changes or when “hacking” is necessary

• CloverETL: XML format files and directories

› No proprietary repository, uses plain files and directories

› Transformation and metadata stored in human-readable XML

› Open to any version management systems

• CVS, SVN, IBM Rational ClearCase, git

› Integrated with Eclipse VMS clients

• Subclipse, EGit, ClearCase

Page 40: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Flexibility and expressive power

• Others: Expression-based languages only

› Expressions and built-in functions for simple data manipulation

› No support for programming statements (loops, user functions)

› Limited when data manipulation requires complex coding

• CloverETL: Scripting and more

› Components are customizable with CTL or Java code

› CTL: scripting language with simple syntax

• Allows simple expressions as well as complex code

• Has variables, loops, user-defined functions

• Many built-in data validation functions

› Java: mature programming language, allow coding

• Access to variety of existing libraries

• GC-based memory management = rapid development

Page 41: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Extensibility

• Others: Custom components

› Limited support for developing custom components

› Impossible to extend other aspects of data processing

• CloverETL: Plugin-based extensible platform

› Ready to extend, customize and modify

› Plugin-based architecture for easy extension management

› Supports custom components, connections, functions

› Implemented in Java

• Access to libraries

• Memory management

• Developers

Page 42: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

CloverETL™ Designer Examples

Session 2:

Page 43: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Features • Automation and scheduling • File and event triggers • Workflows • Monitoring and logging • User management • Real-time ETL • Clustering • Load balancing • Failover • Distributed processing • Inexpensive buy-in

CloverETL Server

Page 44: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Key features

• Runtime automation

› Allows integration with existing infrastructure

› Simplified management and execution of data transformations

• Scalability and optimization

› Increases transformation performance

› Shortens response time in continuous/transactional processing

• Security

› Controls access to data, transformations and server configuration

› Secures communication between server and clients

• Clustering

› Allows cooperation between multiple processing nodes

› Improves scalability, performance and error resiliency

Page 45: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Runtime automation

• Internal scheduler

› Supports one-time or periodic execution

› Allows interval-based scheduling, including flexible cron-like rules

• Integration with enterprise schedulers

› Transformation execution can be started by external scheduler

› cron, IBM Tivoli (Maestro), Autosys, UC4

› Scheduler instructs CloverETL Server via one of its interfaces (HTTP,JMX)

• Events, tasks and dependencies

› Tasks and transformations are started on internal or external events

› Internal: transformation finished or failed

› External: file arrived or its size changed

› Allows creating dependencies between executed tasks

› Suitable for logical sequencing of execution or monitoring purposes

Page 46: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Runtime automation (cont.)

• Monitoring

› Support alert emails or messages with configurable content

› Automatically populated with execution status, log, statistics

› Allow integration with ticketing systems, support team

• Execution history

› Automatically stores performance and statistics about each execution

› Stored in database tables and log files

› Open for any trend analysis and reporting

• Archiving

› Configurable cleanup of execution logs and history

› Can be further extended by scripting

Page 47: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Scalability and optimization

• Parallel execution

› Executes multiple transformations in parallel

› Can execute multiple instances of single transformations (SOA)

• Graph pooling

› Improves response time

› Useful with SOA-architectures and Launch Services

• Launch services

› Applications implemented as data transformations and deployed as RESTful web services

› Schema on next slide

Page 48: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Large quantity of data loaded Processed in parallel in Cluster Written out in parallel

Increased processing and throughput - Parallel execution of transformations over multiple servers or nodes - Load balancing based on individual node utilization

Increased fault tolerance - Fail over in case of problems with particular nodes - Data replication

Increased flexibility - Cluster can be dynamically reconfigured by adding or removing nodes

CloverETL Cluster

Page 49: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Integration Projects

Session 3:

Page 50: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Integration Projects

• This section covers

› How to manage DI projects

› Phases of DI project

› Responsibilities

› Typical issues

Page 51: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Data Integration Projects

• Phases of typical Data Integration Project

› Requirements

› Planning

› Analysis (of ETL steps)

› Implementation

› Documentation

› RTP

› Support

Page 52: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Requirements

• Functional › Input data

› Output data

› Output data format

› Transformation logic

• Non-functional › Time restrictions

› Availability

› Frequency of update

› Data Latency

› How to handle erroneous records

› Security requirements

Page 53: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Planning

• ETL system implementation to be planned properly

› Time for implementation

› Correctly prioritized

› Thorough data analysis extremely important

› Unforeseen data quality issues cause delays

› Biggest risk is unexpected data quality issues

› Communicate properly

• Keep it simple › Do no try to save the world

› “If you think it can be done simply, do it simply”

Page 54: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Analysis – Extract

• In which source systems is the data we need? • How can we access the system?

› Flat files / database access / XML / web service / JMS › Full extract / incremental / change notification / on-request › Local access / ftp(s) / sftp / scp / MQ

• Data Syntax › What is a record? › What is record delimiter? › What is field delimiter? › What is the data length / data type / format?

• Data Semantics › What are the field names? › What data does the field represent? › Ok, what data does the field really represent? › Are there any duplicates?

• What are the limitations / restrictions? › What are the data volumes? › How often can the export be done? › Any impact on network (NIA)?

• What is the expected data growth rate?

Page 55: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Analysis – Clean & Transform

• Which fields need to be validated? › … against which source? › How to handle erroneous data?

• What is the data flow?

› Between source and target data › What is the transformation logic? › Action on error?

• Stop transformation • Process valid data only

• Transformation restartability › Small data volumes – transaction based › Huge data volumes – process in bulk mode

Page 56: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Analysis – Load

• What is the target schema? • What is the target data volume? • What are the history requirements

› Usually SCD type 1, 2 or 3

• Data Syntax › What is a record? › What is record delimiter? › What is field delimiter? › What is the data length / data type / format?

• Data Semantics › What are the field names? › What data does the field represent? › Ok, what data does the field really represent?

• What are the limitations / restrictions? › What are the data volumes? › How often can the export be done? › Any impact on network (NIA)?

Page 57: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Implementation

• Development

› Enforce standardization • Naming conventions

• Best practices

• Generating surrogate keys

• Looking up keys

• Applying default values

• Testing

› Review

› Testing on Production data

› Unit Testing

Page 58: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Implementation

• Documentation

› Data sources / targets / transformations

› Data Lineage • Important to know and publish

› Frequency of ETL processes runs

› Error handling

› Support – Monitoring checklist

• RTP

Page 59: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Planning & Leadership

• Contact person for each source system

• Contact business person

• ETL team responsibilities

› Define ETL scope

› Perform source system data analysis

› Define data quality strategy

› Gather & document business rules from business users

› ETL Implementation

› Defining & executing Unit & QA testing

› Implementing production

Page 60: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

ETL team Roles & Responsibilities

• ETL Manager

• ETL Architect

• ETL Developer

• System Analyst

• Data-Quality Specialist

• Database Administrator (DBA)

• Dimension Manager

• Fact Table Provider

Page 61: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Typical Issues

• Project Management

• Poor Data Analysis

• Data Understanding

• Performance

• Scalability

Page 62: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Typical Issues – Project Management

• ETL takes 70-80% project resources

› Be aware of this from beginning

› Communicate this to stakeholders

› Plan ETL phase properly

• Data Quality

› Unexpected data quality issues is biggest risk

› DQ issues cause delays

› DQ issues generate extra work

Page 63: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Typical Issues – Data Understanding

• Source system (data)

› Not documented

› Documented incorrectly

› Represent something else than they should

› Data not clean

• Transformation/Requirements

› Not specified properly

› Not specified at all

› Initial analysis has not revealed issues/complexity

› Requirements being changed

Page 64: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Typical Issues – Performance & Scalability

• Performance › Is the performance ok now?

› Will performance be ok in 5 years?

• Scalability › What is the data growth rate?

› Are we testing on production data volumes?

• Change Data Capture › Time consuming task

› Issue on old systems

Page 65: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Current Trends

Session 4:

Page 66: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Market Trends

• Shift to Semi-structured & Unstructured data › Emails, documents, blogs, …

• Real-time processing

› CRM, Zero-latency business › SOA, Web Services, ESB, JMS, MQ

• Cloud-mania

› Cloud, Cluster, Elastic Cluster › MapReduce, Apache Hadoop

• Reducing TCO › License costs, Development cost, Maintenance › Emphasis on value; not price

• Services for small customers

› Require better ROI

Page 67: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley Ralph Kimball, Margy Ross: The Data Warehouse Toolkit; Wiley Len Silverston: The Data Model Resource Book; Wiley

Literature

Page 68: ETL … · ›In-house built ETL tool ›Off-the shelf ETL tool •Aspects to be kept in mind ›Manageability ›Maintainability ›Transparency ›Scalability ›Flexibility ›Complexity

All rights reserved Javlin 2011

Web www.cloveretl.com

US Javlin Inc. 8000 Towers Crescent Drive Suite 1350 Vienna, VA 22182 USA Web: www.javlininc.com Email: [email protected] Phone: +1 703 847 3600

Contact Javlin

Europe Javlin a.s. Křemencova 18 110 00 Praha 1 Czech Republic Web: www.javlin.eu E-mail: [email protected] Phone: +420 277 003 200