41
© 2006 IBM Corporation Information On Demand Conference 2007 High Performance Data Transformation Simon Tang Manager, Technical Sales, GCG Information On Demand Information Management

Information On Demand Conference 2007 - IBM · Information On Demand Conference 2007 ... Java, C++, UNIX codes can be ... Eliminate the write to disk and the read from disk between

Embed Size (px)

Citation preview

© 2006 IBM Corporation

Information On Demand Conference 2007

High Performance Data Transformation

Simon TangManager, Technical Sales, GCGInformation On DemandInformation Management

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Quiz Time

What do all these companies have in common?

US$10 billion Retailer migrating and consolidating financial data into Oracle Financials

Reduced projected 2,700-day manual effort to 217 days

Saved US$2 million

US$4.5 billion global Chemicals Company consolidating 13 SAP instances into 1 global instance

Would save US$37 million in annual operating costs

US$45 billion Manufacturer consolidating more than 3,300 legacy software applications

down to 400 while reducing IT staff by 50%

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Quiz Time: Answers

Raw, disparate data and disconnected systems

Enterprise Data Integration

Business Results that drive revenue and lower costs

Happened despite pouring hundreds of millions of $ into

new ERP, CRM, SCB, BI, BPM and DW systems

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Challenges in Data ManagementInconsistent islands of information underlying applications

Complex, manual & costly copy synchronization

Inconsistent and poor quality data

Inability to exploit enterprise meta data across tools

Touching data multiple times at its source – storing multiple times and updating multiple times

Inability to share common business rules across projects, processes and applications

Lack single, repeatable methodology for consistency across all projects

CRM Order Proc

SupplyChain

Procure-ment

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Different database and system to connect

DB2, Informix, ODBC, Oracle,

Red Brick, SAS, Sybase,

Teradata, etc

Adabas, Allbase/SQL, Datacom/DB,

DB2/400, DB2/OS390,

Essbase, FOCUS,

IDMS/SQL, IMS, NonStopSQL,

RDB, VSAM, etc

WebSphere MQ, SeeBeyond, JMS, XML, EJB, Web Services, EXML, XMLS, EDI, SWIFT, etc

Oracle Applications, PeopleSoft, SAP R/3,

SAP BW, Siebel

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Complex and long Hand-coding

These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …

These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …

… but what happens when there is a new source or requirement?

Cheap? Works? Maybe not.

… but what happens when there is a new source or requirement?

Cheap? Works? Maybe not.

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Fast growing data volume

Source: “Surviving the Perfect Storm in Data Management” DM Review, January 2001

Prediction:Your data

volume is not going to get

smaller

Prediction:Your data

volume is not going to get

smaller

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

The IBM Solution: IBM Information ServerDelivering information you can trust

Understand Cleanse Transform Deliver

Parallel ProcessingRich Connectivity to Applications, Data, and Content

IBM Information Server

Discover, model, and govern information

structure and content

Standardize, merge,and correct information

Combine and restructure information

for new uses

Synchronize, virtualizeand move information

for in-line delivery

Unified Deployment

Unified Metadata Management

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

IBM Information Server Architecture

AnalysisInterface

Web AdminInterface

DevelopmentInterface

UNIFIED USER INTERFACE

COMMON SERVICES

MetadataServices

SecurityServices

Logging &ReportingServices

UNIFIED METADATA

Design Operational

UNIFIED PARALLEL PROCESSING

Understand Cleanse Transform

COMMON CONNECTIVITY

UnifiedService

Deployment

Structured, Unstructured, Applications, Mainframe

Deliver

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Global 2000 Profiting from Intelligent Information

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Logistics

Asia Pacific Client List (Partial)

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Hong Kong Customer Lists

BankingBank of ChinaHSBCStandard CharteredBank of AmericaAIG Credit Card

TelecommunicationSmartoneCSLSundayHutchisonNew World

GovernmentsHealth Department

OthersIKEAOOCLTOM Group

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

China Mobile – 决策经营分析; 网管优化

上海、广东、浙江、江苏、天津、江西、福建、山东、吉林、辽宁、安徽、贵州、四川、云南、重庆、总公司 等20/31个用户

China Unicom- 运营支撑

上海、陕西、北京、浙江、辽宁联通...

CNC: 天津 电信……

大型企业

中国远洋运输集装箱

上海通用汽车

– Center data Storage ;ODS– SAP; Oracle; Seibel ;XML …

苏州西门子; TOM.COM;

南洋烟草…

中国银行总行; 广东分行;香港中银 数据仓库

交通银行总行

中国建设银行分行-历史数据分析、客户综合信息分析

– 海南建行、四川建行

– 河北建行、陕西建行

浦东发展银行

SAP ERP、SWFTI实时整合

光大银行: 国际业务、信贷分析

民生银行

企业级集中数据平台、SAP ERP

农信、平安保险…

还有更多。。。

中国区主要客户

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Physical Metadata: WebSphere Information Analyzer

Data-centric analysis of application, database and file-based sources

Secure, detailed profiling of fields, across fields, and across sources

Creation of metadata from profiling results

Results instantly promotable across IBM Information Server

UnderstandAnalyze source data structures, and

monitor adherence to integration and quality rules

WebSphere Information Analyzer

DataAnalysts

Subject Matter Experts

Physical View

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Business Metadata: WebSphere Business Glossary

Web-based authoring, managing & sharing of business metadata

Aligns the efforts of IT with the goals of the business

Provides business context to information technology assets

Establishes responsibility and accountability

Understand

Subject Matter Experts

Create and manage business vocabulary and relationships, while

linking to physical sources

WebSphere Business Glossary

Business Users

Business View

GL Account Number

The ten digit account number. Sometimes referred to as the account ID. This value is of the form L-FIIIIVVVV.

Database = DB2

Schema = NAACCT

Table = DLYTRANS

Column = ACCT_NO

data type = char(11)

Technical Business

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Why Should I Care About Cleansing Information?

Lack of information standards– Different formats & structures

across different systems

Data surprises in individual fields– Data misplaced in the database

Information buried in free-form fields

Data myopia– Lack of consistent identifiers inhibit

a single view

The redundancy nightmare– Duplicate records with a lack of

standards

Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116

Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116

Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116

Name Tax ID Telephone

J Smith DBA Lime Cons. 228-02-1975 6173380300Williams & Co. C/O Bill 025-37-1888 415-392-20001st Natl Provident 34-2671434 3380321HP 15 State St. 508-466-1200 Orlando

WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH

WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES

USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM

RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)

19-84-103 RS232 Cable 6' M-F CandS

CS-89641 6 ft. Cable Male-F, RS232 #87951

C&SUCH6 Male/Female 25 PIN 6 Foot Cable

90328574 IBM 187 N.Pk. Str. Salem NH 0145690328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 0145690238495 Int. Bus. Machines 187 No. Park St Salem NH 0415690233479 International Bus. M. 187 Park Ave Salem NH 0415690233489 Inter-Nation Consults 15 Main Street Andover MA 0234190345672 I.B. Manufacturing Park Blvd. Bostno MA 04106

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Specialized data quality functions seamlessly integrated with DataStage

Visual tools for defining complex matching and survivorship logic

Ensures clean, standardized, de-duplicated information

Enables a single version of the truth

Cleanse

Subject Matter Experts

Standardize and correct source data fields, and match records together

across sources to create a single view

QualityStage™

Visual Match Rule Design

DataAnalysts

Data Cleansing: WebSphere QualityStage

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Data Transformation & Movement: WebSphereDataStage

Codeless visual design of data flows with hundreds of built-in transformation functions

Optimized reuse of data integration objects

Leverages parallel processing without requiring design changes

Capable of supporting batch and real-time operations

TransformTransform and aggregate any volume

of information in batch or real time through visually designed logic

Hundreds of Built-inTransformation Functions

ArchitectsDevelopers

WebSphere DataStage®

Deliver

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Recommended Best Practices: Native Connectivity Software

Do you wish to worry what will be your next application or database to connect to?

Do you wish to worry what will be your next application or database to connect to?

Advice:

Go for pre-built connectors with little/no coding

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Recommended Best Practices: Graphical ETL Tools

Benefits:

1. Jobs are easy to develop, understand, debug and maintain

2. Robust, fully-tested, best practices approach to data migration or extraction

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Recommended Best Practices: Graphical ETL Tools

Benefits:

1. Complex transformations can be made very simple with mere point-and-click

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Graphical Impact Analysis and Lineage Provide Trust

HTML View

Graphical Tree View

Path View

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Spot the Differences: Easy

Pretty easy. Now try the next one.

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Spot the Differences: Not so Easy

? ?

??????

Not so easy….

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Job Difference – Integrated report

Difference report displayedin Designer - jobs opened automatically from report hot links

Options available to:

- Print report- Save report as

HTML

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Job, Table or Routine Difference

Tables

Available for Jobs, Tables & Routines

Textual report with hot links to the relevant editor in Designer.

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Header & job bitmap

Job properties

HTML Job Report - sample

Link properties

Stage informatio

n

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Strategy #6 – Implement a Highly Scalable Foundation

32

Number of Processors1 8 16 24 32 . . .

Processing Time(Hours)

18

16

24

. . .

Number of Processors1 8 16 24 32 . . .

Processing Throughput(Hundreds of Gigabytes)

1X

8X

16X

24X

32X

. . .

2 considerations in handling growth:

You want these

or

Not these

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Strategy #6 – Implement a Highly Scalable Foundation

Three Elements of a Scalable Infrastructure

Scalable Database Platform

Database vendors have offered a scalable parallel relational database for more than 5 years.

Scalable Hardware Platform

Hardware vendorshave offered scalableparallel computers for more than 5 years.

Scalable Data Integration Platform

Data integration vendors are starting to offer “scalable” “parallel” platforms

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Data Partitioning

Break up big data into partitions

Automated partitioning based on needs of the input process

Run one partition on each processor

4X times faster on 4 processors; 100X faster on 100 processors

This is exactly how the parallel databases work!

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Think of a conveyor belt moving the records from process to process!

Transform, clean and load processes are executing simultaneously on the same processorRecords are moving forward through the flowEliminate the write to disk and the read from disk between processesStart a downstream process while an upstream process is still runningThis eliminates intermediate staging to disk, which is critical for big dataThis also keeps the processors busyStill have limits on scalability

Data Flow Architecture: Data Pipelining

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

In-flight Data Repartitioning

Record repartitioning occurs automatically

No need to repartition data as

– add processors– change hardware architecture

Broad range of partitioning methods

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Application Execution: Sequential or Parallel

Sequential 4-Way Parallel 64-Way Parallel

Uniprocessor SMP System MPP, GRID, and Clustered Systems

Source Data

TRANSFORM ENRICH LOADData

Warehouse

Recommended Best Practices: Parallelism

One application assembly

Auto parallel-enabled and parallel-aware run-time execution

Serial

Scan

Join

SortTime toProcess

Parallel Parallel

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Recommended Best Practices: Parallelism

Make sure you get this

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared MemoryShared MemorySMP System

CPU CPUCPU CPUCPUCPUCPU

Shared Disk

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPU CPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Shared Memory

Shared Disk

Shared MemorySMP System

CPU CPU CPUCPU

Not this

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Scalable Performance

Benchmark: Scalable Data Integration Using WebSphere DataStage Enterprise Edition

0

25,000

50,000

75,000

100,000

2 4 6 8 10 12 14 16 18 20 22 24CPU/Node

Rec./Sec.

1:1 Ratio Linear

Note: Contact IBM for an audited Performance Benchmark Report.

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Case Study – A major mobile operator in HK

Challenges:

The ETL process takes around 11-12 hours to finish and failed to meet the service level

Using Store Procedure for the transformation logics

– Difficult to develop, maintain and perform performance tuning– Performance is unpredictable, is not linear scalable with the data

volume growth

– A lot of staging area, I/O read write, which slow down the wholeprocesses performance

– Need to build and maintain a lot of indexes, which increase the effort and complexity for maintenance.

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Stored Proc and DS ComparisonProcess A

6hrs 32m

1hr 5m

11m 54s

2m 36s

Time (Stored Procedure)

6m 22s1000008

2m 48s100008

2m 21s60000

2m 30s30000

Time (DataStage EE)

Data Size

Product Arrangement

0

100

200

300

400

500

0 200000 400000 600000 800000 1000000 1200000

Source Data Size

Tim

e (m

in)

Stored Procedure DataStage EE

27m 54s6hrs 48m1500000

Process B

-

Time (Stored Procedure)

1hr 11m3000000

Time (DataStage EE)

Data Size

DIA Daily MRT

0

100

200

300

400

500

0 500 1000 1500 2000 2500 3000 3500

Source Data Size (k)

Tim

e (m

in)

Stored Procedure DataStage EE

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Stored Proc and DS Comparison (continue)

Transaction Cellular

> 3.5hrs

21m

12m

6m

Time (Stored Procedure)

42m 49s1000000

35m 10s100000

35m 31s60000

33m 26s30000

Time (DataStage EE)

Data Size

Transaction Cellular

0

50

100

150

200

250

0 200000 400000 600000 800000 1000000 1200000

Source Data Size

Tim

e (m

in)

Stored Procedure DataStage EE

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Product Arrangement

0123456

0 200000 400000 600000 800000 1000000 1200000

Source Data Size

Tim

e (m

in)

2-CPU 4-CPU

Volume Test ResultProcess C

35m 15s

18m 45s

17m 8s

16m 31s

Time (2-CPU)

24m 39s1000000

11m100000

11m 21s60000

9m 16s30000

Time (4-CPU)Data Size (number of

records)

Process D

4m 51s

3m 55s

4m 41s

4m 49s

Time (2-CPU)

2m 42s1000008

2m 33s100008

2m 18s60000

2m 28s30000

Time (4-CPU)Data Size

Transaction Cellular

0

10

20

30

40

0 200000 400000 600000 800000 1000000 1200000

Source Data Size

Tim

e (m

in)

2-CPU 4-CPU

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007

Case Study – A major mobile operator in HK

Benefits:

GUI interface, it is easier to develop and maintain, on average reduce 50% effort.

Faster response to requirement changes

No need to maintain extra table indexes and staging tables, reduce a lot of Database administration tasks

DataStage EE performance remains almost constant when data volume grows. Time to complete with stored procedure increases substantially.

DataStage EE has a linear scalability, easy to predict the hardware sizing as the data volume growth

© 2006 IBM Corporation

IBM Software Group | Lotus software

Information On Demand Conference 2007