61
cohesion institute December 5, 2005 Speaker: R. Michael Pickering Speaker: R. Michael Pickering President, Cohesion Systems Consulting President, Cohesion Systems Consulting Inc. Inc. Data Warehouse Data Warehouse Architecture Best Architecture Best Practices Practices

Data Warehousing Architecture Best Practices

  • Upload
    sagar

  • View
    18

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Data Warehousing Architecture Best Practices

cohesion institute

December 5, 2005

Speaker: R. Michael PickeringSpeaker: R. Michael PickeringPresident, Cohesion Systems Consulting Inc.President, Cohesion Systems Consulting Inc.

Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

Page 2: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 2

cohesion institute

Agenda

Introductions Business Intelligence Background Architecture Best Practices Questions & Answers

Page 3: Data Warehousing Architecture Best Practices

cohesion institute

Introductions

Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

Page 4: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 4

cohesion institute

Presenter Biography R. Michael Pickering

President and Chief Architect, Cohesion Systems Consulting Inc.

previously, Managing Consultant, BI&W, Oracle Consulting (Canada)

before that, Red Brick Systems, Inc. over 8 years DW experience

Manulife Reinsurance, Bell Canada, USDA, Kraft Foods, LCBO, Telecom Argentina, Nortel Networks, Procter & Gamble, Bayer, Syncrude, OMoHLTC…

Mr. Pickering has had DW articles published in The Handbook of Data Management

Page 5: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 5

cohesion institute

Cohesion Systems Consulting Provides DW and BI services, specializing

in: Architecture & Implementation Consulting Project Management Databases, Appliances & Emerging

Technology Training & Mentoring

Since inception in 2000, clients have included Enbridge, CIBC, The Bank of New York, Loyalty Management Group, Canada Post Borderfree, Katz Group

Page 6: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 6

cohesion institute

Audience Survey By a show of hands, please indicate your

experience with: normalization dimensional modeling operational data store data consolidation Extract Transform Load (ETL) metadata architecture DW appliances

Page 7: Data Warehousing Architecture Best Practices

cohesion institute

Business Intelligence Background

Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

Page 8: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 8

cohesion institute

What is Business Intelligence?

A Data Warehouse is usually one component of an overall business intelligence solution

IT people may be tempted to think in terms of products and technologies

BUT...

Page 9: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 9

cohesion institute

Overarching Goal

The overarching goal of business intelligence is to provide the information necessary to MANAGE a business

This means providing information in support of management decision making, which is why BI is also called “Decision Support”

Page 10: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 10

cohesion institute

BI is about “Data Abstraction” wisdom knowledge information data

audience for a data warehouse typically considers higher slices of data abstraction pyramid

lowest level of pyramid is too detailed & unwieldy

Stages Stages (4)(4)

Page 11: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 11

cohesion institute

It’s Not Technology

Business Intelligence is about delivering business value

provide tangible benefit by answering important questions that can help the business to achieve its strategic focus

Improving profitability Who are our five most profitable clients? What are our least profitable products?

Reducing cost Who are our lowest cost suppliers? Which materials incur highest spoilage costs?

Improving customer satisfaction What factors may lead to lost customers?

Page 12: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 12

cohesion institute

Business of BI

In some cases, legislation such as Sarbanes-Oxley or Basel II makes some kind of BI fundamental to doing business

Many leading companies use BI to achieve competitive advantage

E.g. Walmart, Dell, Amazon.com, Kraft, American Express, etc…

Page 13: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 13

cohesion institute

Data Warehouse Architecture

architecture is about delivering an elegant solution that meets the solution requirements this means really understanding the

problem DW architecture is part art, part

science

Page 14: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 14

cohesion institute

Good Architecture

‘It’s not easy to describe a good design, but I’ll know it when I see it’

Page 15: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 15

cohesion institute

BI Architecture Requirements must recognize change as a

constant take incremental development

approach existing applications must

continue to work need to allow more data and

new types of data to be added

Page 16: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 16

cohesion institute

End User Acceptance

understandability understandability is in the eyes

of the beholder want to hide the complexity try to make it:

intuitive, obvious visible, memorable

Page 17: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 17

cohesion institute

End User Acceptance

performance don’t want to interrupt

the thinking process provide one click,

instantaneous access warehouse must be

available, “production” system

Page 18: Data Warehousing Architecture Best Practices

cohesion institute

Architecture Best Practices

Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

Page 19: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 19

cohesion institute

High Level Architecture remember the different “worlds”

on-line transaction processing (OLTP) business intelligence systems (BIS)

users are different data content is different data structures are different architecture & methodology must be

different

Page 20: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 20

cohesion institute

Two Different Worlds

On-Line Transaction Processing Entity Relational Data Model

created in 1960’s to address performance issues with relational database implementations

normalized to most efficiently get data in divides the data into many discrete entities many relationships between these entities this approach was documented by C.J. Date in

An Introduction to Database Systems

Page 21: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 21

cohesion institute

Two Different Worlds

Business Intelligence Systems Dimensional Data Model

also called star schema designed to easily get information out fewer relationships than ERD, the only

table with multiple joins connecting to other tables is the central table

developed in 1960’s by data service providers, formalized by Ralph Kimball in The Data Warehouse Toolkit

Page 22: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 22

cohesion institute

Entity Relation Disadvantages

all tables look the same people can’t visualize/remember

diagrams software can’t navigate as schema

becomes too complex business processes mixed together many artificial keys created

Page 23: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 23

cohesion institute

Dimensional Model Advantages

simplicity humans can navigate and remember software can navigate

deterministically business process explicitly separated

(Data Mart) not so many keys (keys = # of

attendant tables)

Page 24: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 24

cohesion institute

Best Practice #1

Use a data model that is optimized for information retrieval dimensional model denormalized hybrid approach

Page 25: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 25

cohesion institute

Data Acquisition Processes Extract Transform Load (ETL)

the process of unloading or copying data from the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW

also, a software development tool for building ETL processes (an ETL tool)

many production DWs use COBOL or other general-purpose programming languages to implement ETL

Page 26: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 26

cohesion institute

Data Quality Assurance

data cleansing the process of validating and enriching

the data as it is published to the DW also, a software development tool for

building data cleansing processes (a data cleansing tool)

many production DWs have only very rudimentary data quality assurance processes

Page 27: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 27

cohesion institute

Data Acquisition & Cleansing

getting data loaded efficiently and correctly is critical to the success of your DW implementation of data acquisition &

cleansing processes represents from 50 to 80% of effort on typical DW projects

inaccurate data content can be ‘the kiss of death’ for user acceptance

Page 28: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 28

cohesion institute

Best Practice #2

Carefully design the data acquisition and cleansing processes for your DW Ensure the data is processed

efficiently and accurately Consider acquiring ETL and Data

Cleansing tools Use them well!

Page 29: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 29

cohesion institute

Data Model

Already discussed the benefits of a dimensional model

No matter whether dimensional modeling or any other design approach is used, the data model must be documented

Page 30: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 30

cohesion institute

Documenting the Data Model

The best practice is to use some kind of data modeling tool

CA ERwin Sybase PowerDesigner Oracle Designer IBM Rational Rose Etc.

Different tools support different modeling notations, but they are more or less equivalent anyway

Most tools allow sharing of their metadata with an ETL tool

Page 31: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 31

cohesion institute

Data Model Standards data model standards appropriate for the

environment and tools chosen in your data warehouse should be adopted

considerations should be given to data access tool(s) and integration with overall enterprise standards

standards must be documented and enforced within the DW team

someone must ‘own’ the data model to ensure a quality data model, all changes

should be reviewed thru some formal process

Page 32: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 32

cohesion institute

Data Model Metadata

Business definitions should be recorded for every field (unless they are technical fields only)

Domain of data should be recorded Sample values should be included As more metadata is populated into the

modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools

Page 33: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 33

cohesion institute

Metadata Architecture

The strategy for sharing data model and other metadata should be formalized and documented

Metadata management tools should be considered & the overall metadata architecture should be carefully planned

Page 34: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 34

cohesion institute

Best Practice #3

Design a metadata architecture that allows sharing of metadata between components of your DW consider metadata standards such as

OMG’s Common Warehouse Metamodel (CWM)

Page 35: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 35

cohesion institute

Alternative Architecture Approaches

Bill Inmon: “Corporate Information Factory”

Hub and Spoke philosophy “JBOC” – just a bunch of cubes Let it evolve naturally

Page 36: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 36

cohesion institute

What We Want(Architectural Principal) In most cases, business and IT

agree that the data warehouse should provide a ‘single version of the truth’

Any approach that can result in disparate data marts or cubes is undesireable

This is known as data silos or…

Page 37: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 37

cohesion institute

Enterprise DW Architecture how to design an enterprise data

warehouse and ensure a ‘single version of the truth’?

according to Kimball: start with an overall data architecture

phase use “Data Warehouse Bus” design to

integrate multiple data marts use incremental approach by building one

data mart at a time

Page 38: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 38

cohesion institute

Data Warehouse Bus Architecture

named for the bus in a computer standard interface that allows you to plug

in cdrom, disk drive, etc. these peripherals work together smoothly

provides framework for data marts to fit together

allows separate data marts to be implemented by different groups, even at different times

Page 39: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 39

cohesion institute

Data Mart Definition

data mart is a complete subset of the overall data warehouse a single business process OR a group of related business processes

think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’

Page 40: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 40

cohesion institute

Designing The DW Bus determine which dimensions will be

shared across multiple data marts conform the shared dimensions

produce a master suite of shared dimensions

determine which facts will be shared across data marts

conform the facts standardize the definitions of facts

Page 41: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 41

cohesion institute

Dimension Granularity

conformed dimensions will usually be granular makes it easy to integrate with various

base level fact tables easy to extend fact table by adding

new facts no need to drop or reload fact tables,

and no keys have to be changed

Page 42: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 42

cohesion institute

Conforming Dimensions

by adhering to standards, the separate data marts can be plugged together e.g. customer, product, time

they can even share data usefully, for example in a drill across report

ensures reports or queries from different data marts share the same context

Page 43: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 43

cohesion institute

Conforming Dimensions (cont’d)

accomplish this by adding any dimension attribute(s) needed in any data mart(s) to the standard dimension definition

attributes not needed everywhere can always be ignored

typically harder to determine how to load conformed dimensions than to design them initially

need a single integrated ETL process what is the SOR for each attribute? how do we deal with attributes for which there is

more than one possible SOR?

Page 44: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 44

cohesion institute

Conforming Facts

in an enterprise, some metrics may not have the same generally accepted definition across all business units

conforming facts is generally a bigger design challenge than conforming dimensions why?

Page 45: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 45

cohesion institute

Conforming Facts - Benefits ensures the constituent data marts can as

clearly as possible represent fact data expressed on the same basis using consistent definitions

ensures reports or queries from different data marts share consistent content

success of an Enterprise DW hinges on successfully conformed facts

any perceived inconsistencies in fact definitions across data marts will generally be considered to be a DW bug or data problem by users

if users don’t have full confidence in data quality they may stop using the DW

Page 46: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 46

cohesion institute

Data Consolidation

a current trend in BI/DW is ‘data consolidation’

from a software vendor perspective, it is tempting to simplify this: ‘we can keep all the tables for all your

disparate applications in one physical database’

Page 47: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 47

cohesion institute

Data Integration

To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases

Must integrate data models and establish common terms of reference

Page 48: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 48

cohesion institute

Best Practice #4

Take an approach that consolidates data into ‘a single version of the truth’ Data Warehouse Bus

conformed dimensions & facts OR?

Page 49: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 49

cohesion institute

Operational Data Store (ODS) a single point of integration for disparate

operational systems contains integrated data at the most

detailed level (transactional) may be loaded in ‘near real time’ or

periodically can be used for centralized operational

reporting

Page 50: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 50

cohesion institute

Role of an ODS in DW Architecture

In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture

Can also be used for: Staging area Master/reference data management Etc…

Page 51: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 51

cohesion institute

ODS Data Model

Not clear if any design approach for an ODS data model has emerged as a best practice normalized dimensional denormalized/hybrid any suggestions?

Page 52: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 52

cohesion institute

Best Practice #5 Consider implementing an ODS only

when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed Must ensure that the data model is

integrated, not just consolidated May consider 3NF data model Avoid at all costs a ‘data dumping ground’

Page 53: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 53

cohesion institute

Capacity Planning DW workloads are typically very

demanding, especially for I/O capacity Successful implementations tend to grow

very quickly, both in number of users and data volume

Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance typically based on estimated ‘raw’ data size

of proposed database e.g. 100-150 Gb per modern CPU

Page 54: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 54

cohesion institute

SMP Server Scale Up Scaling performance within a single SMP

server is referred to as ‘scale up’ Database benchmarks suggest Windows

scalability is near that of Linux IBM claims near-linear scalability for Linux

(on commodity hardware) up to about 4 processors Probably not cost effective to scale up Linux

much beyond 4 processors IBM claims near-linear scalability for AIX

on POWER5 up to about 8 processors

Page 55: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 55

cohesion institute

Scale Out

There is an increasing trend in IT to ‘scale out’ processing capacity by deploying many small, commodity servers rather than a single large SMP system

This strategy tends to work well for relatively simple applications such as network or web servers

For very complex workloads such as a data warehouse, this strategy is much more difficult to effectively implement

Especially so for the database server itself

Page 56: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 56

cohesion institute

Scale Up vs. Scale Out

To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out

Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement scaling up can offer lower hardware

investment, if practical however, network bandwidth or latency

issues can limit effectiveness of parallelism

Page 57: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 57

cohesion institute

Best Practice #6

Create a capacity plan for your BI application & monitor it carefully

Consider future additional performance demands Establish standard performance benchmark

queries and regularly run them Implement capacity monitoring tools Build scalability into your architecture May need to allow for scaling both up and

out!

Page 58: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 58

cohesion institute

Open Source Affordability

Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware

this is expected to offer lower total cost of ownership certainly, GNU/Linux and other Open Source initiatives

do provide very good functionality and quality for minimal cost

This trend also applies to BI & DW: most traditional rdbms’s are now supported on Linux however, open source rdbms’s lag behind on

providing good performance for DW queries

Page 59: Data Warehousing Architecture Best Practices

April 8, 2023 DW Architecture Best Practices 59

cohesion institute

DW Appliances

DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance

production experience is limited so far, so this is not yet a ‘best practice’

Page 60: Data Warehousing Architecture Best Practices

cohesion institute

Q & A

Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices

Page 61: Data Warehousing Architecture Best Practices

cohesion institute

the modern art of data abstraction

cohesion systems consulting inc