23
Pentaho Best Practices Presented by BNY Mellon Scott Stewart October 25, 2017 Pentaho ETL Architect

Pentaho Best Practices - PentahoWorld 2017 · Pentaho Best Practices Presented by BNY Mellon Scott Stewart October 25, 2017 Pentaho ETL Architect

Embed Size (px)

Citation preview

Pentaho Best Practices

Presented by BNY MellonScott Stewart

October 25, 2017

Pentaho ETL Architect

2

Scott Stewart

Biography

About

ETL Architect for BNY Mellon NEXEN Analytics

Over 5 Years Pentaho Experience

Over 10 years of data integration and ETL experience

Google, NetApp, Sony-Ericcson, Cisco

3

External Forces are Creating Digital Disruption

Non-traditional

Competitors

Decreasing

Technology

Barrier to Entry

Enabling

Productivity

and Efficiency

Security and

Risk Resiliency

Millennial

Consumer

Behavior

Agile

Environments

Big Data

Insights and

Analytics

Focus on Client

Experience

Investment in

Innovation

Modern

Consumer

Behavior

Increasing

Regulatory Change

Global Political

Turmoil

Low Global

Economic Growth

Low Interest

Rates

Changing Investor

Demands

Increasing Cyber

Security Threats

Evolution of

Marketplace

Lending

Transparency in

Financial Services

4

Servicing Multiple Needs Through Common Components

Professionals Investors Developers Employees Traders Machines

Legacy Solutions

Non

BXP

Private Cloud Public Cloud

BXP

Electronic (APIs)Access

Services

Data

Solutions

Digital Pulse

Foundational

ServicesBusiness Services Workflows

Third-Party

Solutions

Browser / Mobile

Infrastructure

5

Servicing Multiple Needs Through Common Components

Professionals Investors Developers Employees Traders Machines

Legacy Solutions

Non

BXP

Infr

as

tru

ctu

re

Private Cloud Public Cloud

BXP

Electronic (APIs)

Data

Solutions

Digital Pulse

Foundational

ServicesBusiness Services Workflows Third-Party Solutions

Browser / Mobile

Access

Services

BNY Mellon NEXENSM Analytics is a service that

consumes and is consumed by several services of

the NEXEN Digital Platform• BXP DigitalCloud based platform

• Pulse integration

• API Gateway integration

6

Success

Improve

Client

Experience

Reduce

Cost of

Ownership

Integration

to BNY

Technology

Reduce

Microsoft

Developed

Technology

Big Data

Integration

Achieving Success with NEXEN Analytics

7

Why Define Best Practices

Adaptable to Changing

Environment

Improve Maintainability

Improve Quality

Foster Organizational Learning

Foster

Scalability

8

Establish Clear Naming Conventions

Name Files Responsibly

Example

✓Use camel case names

✓Use verb-noun pattern

▪ loadProductData.kjb

▪ extractProductCodes.ktr

Make it simple to follow

Make it describe purpose of each file

Don’t use spaces or slashes

9

Alternate Example

✓Use underscore delimiter

✓Prefix file name with “j_” for job, “t_” for transformation

▪ j_load_product_data.kjb

▪ t_extract_product_codes.ktr

Establish Clear Naming Conventions

Example

✓Use camel case names

✓Use verb-noun pattern

▪ loadProductData.kjb

▪ extractProductCodes.ktr

Make it simple to follow

Make it describe purpose of each file

Don’t use spaces or slashes

Name Files Responsibly

10

Organize Your Project

Use folders to organize files. Don’t put

everything into a single folder

Every folder has a README.md file to

describe purpose and content

Keep data and code in separate trees

Rule of

If greater than 7 objects then

consider breaking into subfolders

Define Project Folder Hierarchy

11

Create a Folder for Each

Individual Pipeline

Designate a Folder for

Shared Code

Organize Your Project

Define Project Folder Hierarchy

12

Categorize Configuration Files by

EnvironmentDistinctly Isolate Data from the

Code

Organize Your Project

Define Project Folder Hierarchy

13

A Little Effort Goes a Long Way

Clearly Name Each Step

What does this do?

Simple names make the intent clearer

14

Clearly Name Each Step

Step Names Should:

Indicate Purpose of

Step

Be in Title Case Except

For Specific Names

Establish Guidelines for

Naming so team understands

what is expected

Step Type Rules Examples

Table

Input/Output

Specify primary table

or nature of join

• Read Product Code Table

• Write prod_info Table

Text Input/Output Specify Filename or

filename pattern

• Read Prod*.csv file

• Write ProductInventory.txt

Filter Values Specify what passes

through filter or

condition

• Pass On Today’s Data

• Is Processed Flag Set?

Select Values nature of columns

filtered or modified

• Clean Temp Columns

• Sync to Merge Stream"

Lookup/Merge indicate lookup/merge

source

• Lookup Up prod_type By

prod_code

• Merge Product Detail with

Product Options

Get/Set Variables

Get System Info

Name variables • Get System Date and IP

Address

• Set $hostname

Do Not Use Step Name Defaults

15

Which would you rather maintain?

What a Mess!!

16

Make “Code” Readable

Standardize Grid Size

Top to Bottom vs. Left to Right

Clearly indicate main stream - visually should

show one straight line either:

▪Straight down

▪An ordered "Z" pattern

Rule of

15 objects on canvas at maximum

• If too many objects for jobs, group things into sub-jobs

• If too many objects for transforms, group things into sub-

transforms (or mappings)

Make It Easy To Follow The Flow Of Data

17

Define a Standard for Making Comments (“Notes”)

Use Comments

Ensure that every job or transformation has a

Header Comment

Additionally, make sure the format of each

comment is standardized

Define how you will structure the content within

the comment

Color Coordinate Your Notes

Complex Logic To-Do’s

18

Database Connection Details File Paths Definitions Built-Ins For Code Location

Don’t Hardcode, Use Variables

19

Java & Java Script

Use the Right Tool For The Job

20

SQL

Use the Right Tool For The Job

21

Don’t repeat yourself

Use the Right Tool For The Job

22

Don’t Abandon Standard Development Best Practice

Use Code Repository to Track Changes

Foster Frequent Team Communication

Hold Code Reviews

Establish Testing Standards

Bringing It All Together