Upload
lynn-perkins
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
An Effective Data Integration: Strategy to Drive Innovation on the InfoSphere Platform
Simon TangInfoSphere Technical ManagerIBM GCG
“How can I see how this is used” –Governance Steward
Pain Point: Understanding Core Information Assets
“What systems will be impacted from this change” - DBA
“I’m not sure what the business wants” - Developer
“This data does not look right” – Business User“I don’t have the
information I need” – Business Analyst “We are not leveraging
our information” - Architect
Impact of NOT Managing Core Information Assets
Inaccurate or incomplete data is a leading cause of failure in
business-intelligence and CRM projects
83% of data integration projects either overrun or fail
Low data quality costs companies $611 billion
annually
Undetected defects will cost 10 to 100 times as much to fix upstream
25% of time is spent clarifying
bad data
Lack of consumer confidence
Lost opportunities
Scrap and reworkIncreased $$$
5
Who are Looking for Trusted Information?
Target Audience
• Data/Business Analysts
• Subject Matter Experts
• Architects
• Governance Stewards
What are they working on?
• Information-centric projects:
• BI & Data Warehousing
• Master Data Management
• Application Implementation, Consolidation or Migration
• Information Architecture
• Governance Initiatives
What do these roles do today?
• Manage information manually in disconnected tools, documents, and spreadsheets
What is wrong with what they do today?
• Time consuming – churn between business & IT
• Imprecise & error prone – manual processes not thorough enough
• No collaboration – different roles work in silos
• Lacks audit trail – no ongoing record
• Redundancy – duplication of effort & storage
Subject Matter Experts GovernanceStewards
Data/Business Analysts
Architects
Trusted Information1. Accurate
2. Complete
3. Insightful
4. Real Time
A Flexible Platform for Managing, Integrating, Analyzing and Governing Information
AnalyzeIntegrate
Transactional
& Collaborative
Applications
Manage
Business Analytics
Applications
External
Information
Sources
Cubes
Streams
Big Data Master
Data
Content
Data
Streaming
Information
Data
Warehouses
GovernQuality
Security &
PrivacyLifecycle
Challenges in Data Management
• Inconsistent islands of information underlying applications
• Complex, manual & costly copy synchronization• Inconsistent and poor quality data• Inability to exploit enterprise meta data across tools
• Touching data multiple times at its source – storing multiple times and updating multiple times
• Inability to share common business rules across projects, processes and applications
• Lack single, repeatable methodology for consistency across all projects
CRM Order Proc
SupplyChain
Procure-ment
Convert information into a trusted strategic asset
• Discover and understand the data across heterogeneous systems• Design trusted information structures for business optimization • Govern that information over time
Only IBM has
invested to provide
the breadth of
capabilities to
define and govern
your information…
• Business Vocabulary• Data Relationships• Data Quality Compliance• Data Models and
Mapping• Business Specification
Rules• Provenance of
information
Remedy: 10 Proven Strategies
Consider where your organization’s most SIGNIFICANT data pain exists – take that
approach first
No single path is THE panacea to all corporate data problems - multiple approaches must
be employed
Strategy #1 – Understand Source Systems
Business Analysis
Data Analysis
1. Discovers actual characteristics of data
2. Verify if characteristics of data conform to established / known business rules
3. Report on the assessment and variances / exceptions
Strategy #1 – Understand Source Systems Poor data quality costs U.S. businesses over $600 billion each year Data deteriorates up to 3% every month What is the key to integrating corporate data? – Having the right
data before you start
0 10 20 30 40 50 60 70 80 90 100
Ensuring adequate data qualityUnderstanding source data
Creating complex transformationsCreating complex mappings
Ensuring adequate performance
Collecting and maintaining meta data
Finding skilled programmers
Providing access to meta data
Ensuring adequate scalability
Integrating 3rd party tools
Ensuring adequate reliability
Recommended Best Practices: Automated Data Profiling
No coding
Advice: You won’t have the time, $ or energy to profile 100%
quickly so go automated
Foreign Key &Duplicate Analysis
Table & Primary Key Analysis
Co
lum
nA
nalysis
Foreign Key &Duplicate Analysis
Source 1
Source 2
Strategy #2 – Build-in Data Quality
• Same company / person?• Same address?• Same parts?• Same instructions?
NAME ADDRESS
IBM 187 N. Pk. Str. Salem NH 01456
I.B.M. Inc. 187 N. Pk. St. Sarem NH 01456
International Bus. M. 187 No. Park St Salem NH 04156
Int. Bus. Machines 187 Park Ave Salem NH 01456
Inter-Nation Consult. 15 Main St. Andover MA 02341
Int. Bus. Consultants PO Box 9 Boston MA 02210
I.B. Manufacturing Park Blvd. Boston MA 04106
PART DESCRIPTION
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT ¼ INCH
WING ASSEMBLY, USE 5J868-A HEX BOLT .25” – DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) – DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 HOLES, SECURE W/KL 2301 RIVETS (10 CM)
Spelling ErrorsLack of Standards in
Synonyms, Acronyms, Abbreviations
Error Codes?Assembly
Part SizeInstruction
Blk 1 |First St|05-00
Blk 1 |First St|05-00
1 |First St|#05-00
Blk 1 |First St|#05-00
1 |St |#05-00
Building | Street | Unit
Recommended Best Practices: Data Cleansing
Data Re-Engineering
Blk 1, 1 St, 05-00
05-00 Frist St, Block 1
1 First Str, #05-00
Block 1, First Str, #05-00
1, St, #05-00
Original
Standardize
Blk 1 |First St|05-00
Blk 1 |First St|05-00
1 |First St|#05-00
Blk 1 |First St|#05-00
1 |St |#05-00
Building | Street | Unit
Match Survive
#05-00, Blk 1, First St
#05-00, 1, St
Final Result
Strategy #3 – Share Common Meta Data
CustomerCustomerNumberNameAddressComments
From Data Model
CustomerTblCustomerIDNameAddressAddress1Comments
From ETL Tool
CustomerDetailsCustomerNumberNameAddressRemarks
From BI ToolCustomerIDNameAddress1Address2Descr
From Database
The Identifier of customers that are tracked for ordering purposes. Corporate customer identifiers are assigned by the Sales Data Controller according to the corporate data description and naming policy for reference identifiers. Unique identifier of
customers that are tracked for ordering purposes. Values start with 02 for non-Corporate customers and 01 for Corporate customers.
<NULL>Customer’s identifier numbers. Values start with 01 for Corporate customers, 02 for non-Corporate customers, 03 for overseas-based Customers.
Which meta data is right?
Which one is current?
Which one should be used?
Recommended Best Practices: Create a common repository
Integrated Meta Data Repository
Modeling tool BI tool
BI Repository
COBOL definition files
Other sources’ definition files
ETL Tool + Processes
Integrate by gathering in from
diverse applications and sources
Shared Metadata Server
& Repository
Category: Costs
Term: Tax Expense
Full Name: Tax to be paid on Gross Income
“The expense due to taxes …..”
(John Walsh is responsible for updates. 90% reliable source)
Status: CURRENT
Database = DB2
Schema = NAACCT
Table = DLYTRANS
Column = TAXVL
data type = Decimal (14,2)
Derivation: SUM(TRNTXAMT)
Achieve a common vocabulary between business & technical users!
InfoSphere DataStage InfoSphere Business Glossary
Create a Common Vocabulary
GL Organizational Unit
STEWARD: Controllers OfficeFORMAT: X(7)DEFINITION: A seven digit number designating the organizational unit to which this account belongs.
I’ve noticed that the last two digits
of the GL Organizational
Unit, which indicate the sub-department, are
often blank.
Author Standard Definitions
Annotate and Share
Feedback
Collaborate and Share Feedback
• Categorize Information Assets according to Business Logic• Map Business Terms to Information Assets • Find and view relevant details of Information Assets• View the Stewardship of Information Assets
Extend Business Information
Where does a Field of Data in this Report Come From?
• Import & Browse Full BI Report Metadata• Navigate through report attributes• Visually navigate through data lineage across tools• Combines operational & design viewpoint
IBM Confidential
Metadata Lineage available from Studio & Viewers
Access Business Glossary from Cognos Studios
IBM Confidential
Strategy #4 – Connect to Any System, Anywhere
DB2, Informix, Netezza, ODBC,
Oracle, Red Brick, SAS,
Sybase, Teradata, etc
Adabas, Allbase/SQL, Datacom/DB,
DB2/400, DB2/OS390,
Essbase, FOCUS,
IDMS/SQL, IMS, NonStopSQL,
RDB, VSAM, etc
WebSphere MQ, SeeBeyond, JMS, XML, EJB, Web Services, EXML, XMLS, EDI, SWIFT, etc
Oracle Applications, PeopleSoft, SAP R/3,
SAP BW, Siebel
Recommended Best Practices: Native Connectivity Software
Do you wish to worry what will be your next application or database to connect to?
Do you wish to worry what will be your next application or database to connect to?
Advice:
Go for pre-built connectors with little/no coding
Strategy #5 – Abandon Hand-coding
These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …
These Visual BASIC, Java, C++, UNIX codes can be developed cheaply and they work …
… but what happens when there is a new source or requirement?
Cheap? Works? Maybe not.
… but what happens when there is a new source or requirement?
Cheap? Works? Maybe not.
Recommended Best Practices: Graphical ETL Tools
Benefits:
1. Jobs are easy to develop, understand, debug and maintain
2. Robust, fully-tested, best practices approach to data migration or extraction
Recommended Best Practices: Graphical ETL Tools
Benefits:
1. Complex transformations can be made very simple with mere point-and-click
Workflow Process - Sequences
• Workflow is as important as dataflow.• Dynamic workflow processes can be defined during
the workflow itself.• DataStage can run external processes and perform
complex evaluations inline.• Advanced concepts such as looping are supported.
Physical Machine UtilizationDisk ThroughputAverage Process Distribution
Percent CPU UtilizationFree Memory Whisker Box
Strategy #6 – Implement a Highly Scalable Foundation
Prediction: Your data
volume is not going to get
smaller
Prediction: Your data
volume is not going to get
smaller
as much Data and ContentOver Coming Decade
2009
800,000 petabytes
2020
35 zettabytes
44x
Strategy #6 – Implement a Highly Scalable Foundation
32
Number of Processors1 8 16 24 32 . . .
Processing Time(Hours)
1
8
16
24
.
.
.
Number of Processors1 8 16 24 32 . . .
Processing Throughput
(Hundreds of Gigabytes)
1X
8X
16X
24X
32X
.
.
.
2 considerations in handling growth:
You want these
or
Not these
Strategy #6 – Implement a Highly Scalable Foundation
Three Elements of a Scalable Infrastructure
Scalable Database Platform
Database vendors have offered a scalable parallel relational database for more than 5 years.
Scalable Hardware Platform
Hardware vendorshave offered scalableparallel computers for more than 5 years.
Scalable Data Integration Platform
Data integration vendors are starting to offer “scalable” “parallel” platforms
Recommended Best Practices: Parallelism
Make sure you get this
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared MemoryShared Memory
SMP System
CPU CPUCPU CPUCPUCPUCPU
Shared Disk
S h a r e d M e m o r y
S h a r e d D is k
S h a r e d M e m o r y
S M P S y s t e m
C P U C P U C P U C P U
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Shared Memory
Shared Disk
Shared Memory
SMP System
CPU CPU CPUCPU
Not this
Application Execution: Sequential or Parallel
Sequential 4-Way Parallel 64-Way Parallel
Uniprocessor SMP System MPP, GRID, and Clustered Systems
Source Data
TRANSFORM ENRICH LOADData
Warehouse
Recommended Best Practices: Parallelism
One application assembly
Auto parallel-enabled and parallel-aware run-time execution
Serial
Scan
Join
SortTime toProcess
Parallel Parallel
Strategy #7 – Architect for “Right-Time”
• In an InformationWeek 2003 survey of 467 business professionals about how often their IT systems provide business managers with timely updates of primary products or services:– 3% no such process– 1% annually– 17% monthly– 13% weekly– 36% daily– 5% hourly– 8% every minute
• In that same report:– “Whereas 57% of sites surveyed a
year ago said that real-time business information was a key company focus, 70% see it that way today.”
Recommended Best Practices: Right-Time
campaign initiated tuning
customer churns win-back
website click offer made
fraud committed prevention
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
AcceptableLatency
Event OccursEvent Occurs AwarenessAwareness AppropriateAppropriateResponseResponse
Recognition ResponseBusiness
EventOccurs
Latency Latency
Latency is defined as the elapsed time between when an eventoccurs and when an appropriate response or action is made
Recommended Best Practices: Right-Time
1. Improving the ability to recognize business events
Latency RecognitionBusiness
EventOccurs
Recognition ResponseLatency
2. Improving the ability to respond to those events
Log-Based Change Data Capture
Database Logs
Source Engine Target Engine
TCP/IP
Monitoring and Configuration
Database
Message Queue
Web Services
DB2, Oracle,SQL Server, etc
Flat files
Key Benefits:– Low impact– Flexible implementation
– Heterogeneous platform support– Easy to use
Information Server InfoSphere Information
Server
InfoSphere CDC & InfoSphere DataStage (ETL)
Native
LogDB
Retail
Point Of Sale
“CDC”
Continuous
IBM Information Server
Queue 1
Staging Table
Message
Queue
Direct
Connect
Flat File
Data Stage Consumption
ETL Load
Oracle
Info
rmatio
n S
erver C
han
ge D
ata Cap
ture
IBM Information Server EDW
Out of the box
Out of the box DataStage DSX file format
TCP via Data Stage operator
Teradata, DB2, Oracle,
SQL Server, Sybase…
Including BalOp (ELT)
Strategy #8 – Extend Quality and Transformation Capabilities throughout the Enterprise
1. Hand-coded rules in each project/tool are not re-usable to other projects/tools
2. High costs associated with building & maintaining data access, data quality and transformation rules in each project
Portals
EAI, BPM, EII
Web applications
Dashboards
Legacy Apps
Packaged Apps
Business Partner Data
Data Warehouses
Master Data
Stores
Recommended Best Practices: Data Integration Services
1. Service-Oriented Architecture (SOA) approach packages data integration logic of SOA-friendly applications as services
2. Services can be invoked as Web Services, EJB, JMS by any third-party applications
Java,Application
Servers
MessageQueues,
EAI
Web Services
Business Partner Data
get customer
Service-OrientedArchitecture
Legacy Apps
Packaged Apps
Data Warehouses
Master Data
Stores
SOA Approach
Strategy #9 – Choose a Proven Deployment Methodology designed for Quick Success
• Many available out there• How many and which are workable – who knows?• Be aware there are as much risks in deployment methodology as there in
tools usage
Recommended Best Practices: Iterative Deployment Plan
Establish BusinessDrivers
Deploy Solution
Evaluate Results
Derive BusinessValue
Start
End
12 -
24
Wee
ks
investigate
design
develop deploy
operate
plan
proto-type
unittest
systemtest
UAT
Prod-uctionaudit
regressiontest
maint-enance
etc.
iteration
monitor
manage
A Blueprint Director The GPS for your information project
Palette free form “sketching” elements
Diagram for a blueprint
•Method browser (displaying method content)•Asset browser (browsing metadata repository)•Glossary explorer (showing glossary tree view)
Context specific property view
•Outline (zoom in/out view)•Blueprint explorer (shows tree view of the elements in the blueprint)
Business and IT: Working Together
Business Business RequirementsRequirements
Successful Data Successful Data Integration ProjectIntegration Project Successful Data Successful Data Integration ProjectIntegration Project
Business Analyst
Collects business terms and business requirements; Converts into business rules in a spec
Developer
Takes those business rules and mapping spec and turns them into code, such as a DataStage job.
Business terms
Mapping specification created – critical to collaboration between IT and business
•extract•transform•load
Create DataStage jobs and data flows that reflect business needs.
Track business requirements to application deployment
• Single, central managed infrastructure to track requirements to deployment
• Import Excel mapping spreadsheets
• Define and link business terms to physical structures
• Generate DataStage jobs with annotated to-do tasks for developer
• Generate historical documentation for tracking
Flexible reporting and tracking
Auto-generate DataStage jobs
Define mapping specification with
business rules and terms
Strategy #10 – Ensure Interoperability of Integration Infrastructures
The Goal
Connected, integrated, seamlessly
The Reality
Cobbled, piece-meal, manual-intensive
Data Integration Projects require a Collaborative Effort
Developer
Business Analyst
Data Modeler
Data Analyst
transformation rules
business terms
data flow
data model
•extract•transform•load
businessrequirements
application
Business user
49
Metadata Server
Establish Platform Import & Enhance Industry Model
Assess, Monitor, Manage Data Quality Rules
Information Analyzer
1
2
Business Glossary
Populates
Links
DataStage & QualityStage
Generate Logic to Load Warehouse
Map Sources to Target Model
FastTrack
3
Simplification & Content: reduces project time, risk and cost!
CognosData Architect
Deliver Reports
4 6
7
Define Business Requirement & Glossary 5
Discovery
Understand Data Relationships
Recommended Best Practices: Integrated Tool Suites
Business GlossaryDataStage
Parallel ProcessingRich Connectivity to Applications, Data, and
Content
Enterprise Data Dictionary
Extract, Transform, and Load in Batch or
Real-time
Information Services Director
Metadata Server / Metadata Workbench / FastTrack
Publish SOA services for informationintegration and access
Information Analyzer
Data Source Profiling & Problem Diagnosis
Manage and track consistent metadata across information integration tasks and automate generation of data flow
logic
Federation ServerVirtualize access to
disparate information
CDC & ReplicationDeliver and replicate
changed data
QualityStage
Global Name Recognition
Recognize & ClassifyMulti-cultural names
Data Quality: Standardize,Correct & Match Data
Summary
1. A number of large enterprises have successfully integrated their enterprise systems resulting in business results that drove revenue and lowered costs
2. These enterprises accomplished this through a set of technologies collectively known as Enterprise Data Integration
3. There are 10 proven strategies for success in an enterprise data integration initiative; although no single path is THE panacea to all corporate data problems - multiple approaches must be employed
Test Data Generation
Application Consolidation
Data De-identification
Data Quality
Data Integration
Data Archival
Master Data Management
Data Warehousing
Convert Data into Trusted Information
InfoSphere Information Server
53
Your Choice…
Integrated Platform
++ ++ ++ ++ ++ ++
Point Products
++ ++ ++ ++ ++? ?++Models Cleansing ETL MDM Warehouse BI Mashups