© 2006 IBM Corporation
Information On Demand Conference 2007
High Performance Data Transformation
Simon TangManager, Technical Sales, GCGInformation On DemandInformation Management
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Quiz Time
What do all these companies have in common?
US$10 billion Retailer migrating and consolidating financial data into Oracle Financials
Reduced projected 2,700-day manual effort to 217 days
Saved US$2 million
US$4.5 billion global Chemicals Company consolidating 13 SAP instances into 1 global instance
Would save US$37 million in annual operating costs
US$45 billion Manufacturer consolidating more than 3,300 legacy software applications
down to 400 while reducing IT staff by 50%
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Quiz Time: Answers
Raw, disparate data and disconnected systems
Enterprise Data Integration
Business Results that drive revenue and lower costs
Happened despite pouring hundreds of millions of $ into
new ERP, CRM, SCB, BI, BPM and DW systems
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Challenges in Data ManagementInconsistent islands of information underlying applications
Complex, manual & costly copy synchronization
Inconsistent and poor quality data
Inability to exploit enterprise meta data across tools
Touching data multiple times at its source – storing multiple times and updating multiple times
Inability to share common business rules across projects, processes and applications
Lack single, repeatable methodology for consistency across all projects
CRM Order Proc
SupplyChain
Procure-ment
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Different database and system to connect
DB2, Informix, ODBC, Oracle,
Red Brick, SAS, Sybase,
Teradata, etc
Adabas, Allbase/SQL, Datacom/DB,
DB2/400, DB2/OS390,
Essbase, FOCUS,
IDMS/SQL, IMS, NonStopSQL,
RDB, VSAM, etc
WebSphere MQ, SeeBeyond, JMS, XML, EJB, Web Services, EXML, XMLS, EDI, SWIFT, etc
Oracle Applications, PeopleSoft, SAP R/3,
SAP BW, Siebel
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Complex and long Hand-coding
These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …
These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …
… but what happens when there is a new source or requirement?
Cheap? Works? Maybe not.
… but what happens when there is a new source or requirement?
Cheap? Works? Maybe not.
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Fast growing data volume
Source: “Surviving the Perfect Storm in Data Management” DM Review, January 2001
Prediction:Your data
volume is not going to get
smaller
Prediction:Your data
volume is not going to get
smaller
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
The IBM Solution: IBM Information ServerDelivering information you can trust
Understand Cleanse Transform Deliver
Parallel ProcessingRich Connectivity to Applications, Data, and Content
IBM Information Server
Discover, model, and govern information
structure and content
Standardize, merge,and correct information
Combine and restructure information
for new uses
Synchronize, virtualizeand move information
for in-line delivery
Unified Deployment
Unified Metadata Management
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
IBM Information Server Architecture
AnalysisInterface
Web AdminInterface
DevelopmentInterface
UNIFIED USER INTERFACE
COMMON SERVICES
MetadataServices
SecurityServices
Logging &ReportingServices
UNIFIED METADATA
Design Operational
UNIFIED PARALLEL PROCESSING
Understand Cleanse Transform
COMMON CONNECTIVITY
UnifiedService
Deployment
Structured, Unstructured, Applications, Mainframe
Deliver
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Global 2000 Profiting from Intelligent Information
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Logistics
Asia Pacific Client List (Partial)
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Hong Kong Customer Lists
BankingBank of ChinaHSBCStandard CharteredBank of AmericaAIG Credit Card
TelecommunicationSmartoneCSLSundayHutchisonNew World
GovernmentsHealth Department
OthersIKEAOOCLTOM Group
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
China Mobile – 决策经营分析; 网管优化
上海、广东、浙江、江苏、天津、江西、福建、山东、吉林、辽宁、安徽、贵州、四川、云南、重庆、总公司 等20/31个用户
China Unicom- 运营支撑
上海、陕西、北京、浙江、辽宁联通...
CNC: 天津 电信……
大型企业
中国远洋运输集装箱
上海通用汽车
– Center data Storage ;ODS– SAP; Oracle; Seibel ;XML …
苏州西门子; TOM.COM;
南洋烟草…
中国银行总行; 广东分行;香港中银 数据仓库
交通银行总行
中国建设银行分行-历史数据分析、客户综合信息分析
– 海南建行、四川建行
– 河北建行、陕西建行
浦东发展银行
SAP ERP、SWFTI实时整合
光大银行: 国际业务、信贷分析
民生银行
企业级集中数据平台、SAP ERP
农信、平安保险…
还有更多。。。
中国区主要客户
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Physical Metadata: WebSphere Information Analyzer
Data-centric analysis of application, database and file-based sources
Secure, detailed profiling of fields, across fields, and across sources
Creation of metadata from profiling results
Results instantly promotable across IBM Information Server
UnderstandAnalyze source data structures, and
monitor adherence to integration and quality rules
WebSphere Information Analyzer
DataAnalysts
Subject Matter Experts
Physical View
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Business Metadata: WebSphere Business Glossary
Web-based authoring, managing & sharing of business metadata
Aligns the efforts of IT with the goals of the business
Provides business context to information technology assets
Establishes responsibility and accountability
Understand
Subject Matter Experts
Create and manage business vocabulary and relationships, while
linking to physical sources
WebSphere Business Glossary
Business Users
Business View
GL Account Number
The ten digit account number. Sometimes referred to as the account ID. This value is of the form L-FIIIIVVVV.
Database = DB2
Schema = NAACCT
Table = DLYTRANS
Column = ACCT_NO
data type = char(11)
Technical Business
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Why Should I Care About Cleansing Information?
Lack of information standards– Different formats & structures
across different systems
Data surprises in individual fields– Data misplaced in the database
Information buried in free-form fields
Data myopia– Lack of consistent identifiers inhibit
a single view
The redundancy nightmare– Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300Williams & Co. C/O Bill 025-37-1888 415-392-20001st Natl Provident 34-2671434 3380321HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 0145690328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 0145690238495 Int. Bus. Machines 187 No. Park St Salem NH 0415690233479 International Bus. M. 187 Park Ave Salem NH 0415690233489 Inter-Nation Consults 15 Main Street Andover MA 0234190345672 I.B. Manufacturing Park Blvd. Bostno MA 04106
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Specialized data quality functions seamlessly integrated with DataStage
Visual tools for defining complex matching and survivorship logic
Ensures clean, standardized, de-duplicated information
Enables a single version of the truth
Cleanse
Subject Matter Experts
Standardize and correct source data fields, and match records together
across sources to create a single view
QualityStage™
Visual Match Rule Design
DataAnalysts
Data Cleansing: WebSphere QualityStage
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Data Transformation & Movement: WebSphereDataStage
Codeless visual design of data flows with hundreds of built-in transformation functions
Optimized reuse of data integration objects
Leverages parallel processing without requiring design changes
Capable of supporting batch and real-time operations
TransformTransform and aggregate any volume
of information in batch or real time through visually designed logic
Hundreds of Built-inTransformation Functions
ArchitectsDevelopers
WebSphere DataStage®
Deliver
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Recommended Best Practices: Native Connectivity Software
Do you wish to worry what will be your next application or database to connect to?
Do you wish to worry what will be your next application or database to connect to?
Advice:
Go for pre-built connectors with little/no coding
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Recommended Best Practices: Graphical ETL Tools
Benefits:
1. Jobs are easy to develop, understand, debug and maintain
2. Robust, fully-tested, best practices approach to data migration or extraction
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Recommended Best Practices: Graphical ETL Tools
Benefits:
1. Complex transformations can be made very simple with mere point-and-click
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Graphical Impact Analysis and Lineage Provide Trust
HTML View
Graphical Tree View
Path View
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Spot the Differences: Easy
Pretty easy. Now try the next one.
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Spot the Differences: Not so Easy
? ?
??????
Not so easy….
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Job Difference – Integrated report
Difference report displayedin Designer - jobs opened automatically from report hot links
Options available to:
- Print report- Save report as
HTML
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Job, Table or Routine Difference
Tables
Available for Jobs, Tables & Routines
Textual report with hot links to the relevant editor in Designer.
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Header & job bitmap
Job properties
HTML Job Report - sample
Link properties
Stage informatio
n
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Strategy #6 – Implement a Highly Scalable Foundation
32
Number of Processors1 8 16 24 32 . . .
Processing Time(Hours)
18
16
24
. . .
Number of Processors1 8 16 24 32 . . .
Processing Throughput(Hundreds of Gigabytes)
1X
8X
16X
24X
32X
. . .
2 considerations in handling growth:
You want these
or
Not these
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Strategy #6 – Implement a Highly Scalable Foundation
Three Elements of a Scalable Infrastructure
Scalable Database Platform
Database vendors have offered a scalable parallel relational database for more than 5 years.
Scalable Hardware Platform
Hardware vendorshave offered scalableparallel computers for more than 5 years.
Scalable Data Integration Platform
Data integration vendors are starting to offer “scalable” “parallel” platforms
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Data Partitioning
Break up big data into partitions
Automated partitioning based on needs of the input process
Run one partition on each processor
4X times faster on 4 processors; 100X faster on 100 processors
This is exactly how the parallel databases work!
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Think of a conveyor belt moving the records from process to process!
Transform, clean and load processes are executing simultaneously on the same processorRecords are moving forward through the flowEliminate the write to disk and the read from disk between processesStart a downstream process while an upstream process is still runningThis eliminates intermediate staging to disk, which is critical for big dataThis also keeps the processors busyStill have limits on scalability
Data Flow Architecture: Data Pipelining
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
In-flight Data Repartitioning
Record repartitioning occurs automatically
No need to repartition data as
– add processors– change hardware architecture
Broad range of partitioning methods
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Application Execution: Sequential or Parallel
Sequential 4-Way Parallel 64-Way Parallel
Uniprocessor SMP System MPP, GRID, and Clustered Systems
Source Data
TRANSFORM ENRICH LOADData
Warehouse
Recommended Best Practices: Parallelism
One application assembly
Auto parallel-enabled and parallel-aware run-time execution
Serial
Scan
Join
SortTime toProcess
Parallel Parallel
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Recommended Best Practices: Parallelism
Make sure you get this
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared MemoryShared MemorySMP System
CPU CPUCPU CPUCPUCPUCPU
Shared Disk
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPU CPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared MemorySMP System
CPU CPU CPUCPU
Not this
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Scalable Performance
Benchmark: Scalable Data Integration Using WebSphere DataStage Enterprise Edition
0
25,000
50,000
75,000
100,000
2 4 6 8 10 12 14 16 18 20 22 24CPU/Node
Rec./Sec.
1:1 Ratio Linear
Note: Contact IBM for an audited Performance Benchmark Report.
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Case Study – A major mobile operator in HK
Challenges:
The ETL process takes around 11-12 hours to finish and failed to meet the service level
Using Store Procedure for the transformation logics
– Difficult to develop, maintain and perform performance tuning– Performance is unpredictable, is not linear scalable with the data
volume growth
– A lot of staging area, I/O read write, which slow down the wholeprocesses performance
– Need to build and maintain a lot of indexes, which increase the effort and complexity for maintenance.
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Stored Proc and DS ComparisonProcess A
6hrs 32m
1hr 5m
11m 54s
2m 36s
Time (Stored Procedure)
6m 22s1000008
2m 48s100008
2m 21s60000
2m 30s30000
Time (DataStage EE)
Data Size
Product Arrangement
0
100
200
300
400
500
0 200000 400000 600000 800000 1000000 1200000
Source Data Size
Tim
e (m
in)
Stored Procedure DataStage EE
27m 54s6hrs 48m1500000
Process B
-
Time (Stored Procedure)
1hr 11m3000000
Time (DataStage EE)
Data Size
DIA Daily MRT
0
100
200
300
400
500
0 500 1000 1500 2000 2500 3000 3500
Source Data Size (k)
Tim
e (m
in)
Stored Procedure DataStage EE
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Stored Proc and DS Comparison (continue)
Transaction Cellular
> 3.5hrs
21m
12m
6m
Time (Stored Procedure)
42m 49s1000000
35m 10s100000
35m 31s60000
33m 26s30000
Time (DataStage EE)
Data Size
Transaction Cellular
0
50
100
150
200
250
0 200000 400000 600000 800000 1000000 1200000
Source Data Size
Tim
e (m
in)
Stored Procedure DataStage EE
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Product Arrangement
0123456
0 200000 400000 600000 800000 1000000 1200000
Source Data Size
Tim
e (m
in)
2-CPU 4-CPU
Volume Test ResultProcess C
35m 15s
18m 45s
17m 8s
16m 31s
Time (2-CPU)
24m 39s1000000
11m100000
11m 21s60000
9m 16s30000
Time (4-CPU)Data Size (number of
records)
Process D
4m 51s
3m 55s
4m 41s
4m 49s
Time (2-CPU)
2m 42s1000008
2m 33s100008
2m 18s60000
2m 28s30000
Time (4-CPU)Data Size
Transaction Cellular
0
10
20
30
40
0 200000 400000 600000 800000 1000000 1200000
Source Data Size
Tim
e (m
in)
2-CPU 4-CPU
© 2006 IBM Corporation
IBM Software Group | Lotus software
Information On Demand Conference 2007
Case Study – A major mobile operator in HK
Benefits:
GUI interface, it is easier to develop and maintain, on average reduce 50% effort.
Faster response to requirement changes
No need to maintain extra table indexes and staging tables, reduce a lot of Database administration tasks
DataStage EE performance remains almost constant when data volume grows. Time to complete with stored procedure increases substantially.
DataStage EE has a linear scalability, easy to predict the hardware sizing as the data volume growth