Four Ways to Build a Data Warehouse

  • Upload
    jonjon

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

  • 8/12/2019 Four Ways to Build a Data Warehouse

    1/4

    6/17/11 7:0our Ways to Build a Data Warehouse

    Page ttp://www.tdan.com/view-articles/4770

    Home Newsletter TDAN.com Picks Events R esources About Contact Search TDAN.com

    Printer-friendly

    E-mail to friend

    THE DATA ADMINISTRATION NEWSLETTER TDAN.comROBERT S. SEINER PUBLISHER

    Current Issue

    Archive

    Special Features

    Featured Columns

    Perspectives

    Contribute

    Subscribe

    > home> newsletter> article

    Four Ways to Build a Data Warehouse

    by Wayne EckersonPublished: May 29, 2007

    It has been said there are as many ways to build data warehouses asthere are companies to build them.

    It has been said there are as many ways to build data warehouses as there arecompanies to build them. Each data warehouse is unique because it must adapt to theneeds of business users in different functional areas, whosecompanies face differentbusiness conditions and competitive pressures.

    Nonetheless, four major approaches to building a data warehousing environment exist.

    These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid,and 4) federated. Most organizationswittingly or notfollow one or another of theseapproaches as a blueprint for development.

    Although we have been building data warehouses since the early 1990s, there is still agreat deal of confusion about the similarities and differences among thesearchitectures. This is especially true of thetop-down andbottom-up approaches,which have existed the longest and occupythe polar ends of the developmentspectrum.

    As a result, some companies fail to adopt a clear vision for the way the datawarehousing environment can and should evolve. Others, paralyzed by confusion orfear of deviating from prescribed tenets for success, cling too rigidly to one approach oranother, undermining their ability to respond flexibly to new or unexpected situations.Ideally, organizations need to borrow concepts and tactics from each approach tocreate environments that uniquely meets their needs.

    Semantic and Substantive Differences The two most influential approaches arechampioned by industry heavyweights Bill Inmon and Ralph Kimball, both prolificauthors and consultants in the data warehousing field. Inmon, who is credited withcoining the term data warehousing in the early 1990s, advocates a top-down

    approach, in which companies first build a data warehouse followed by data marts.Kimballs approach, on the other hand, is often called bottom-up because it starts andends with data marts, negating the need for a physical data warehouse altogether.

    On the surface, there is considerable friction between top-down and bottom-upapproaches. But in reality, the differences are not as stark as they may appear. Bothapproaches advocate building a robust enterprise architecture that adapts easily tochanging business needs and delivers a single version of the truth. In some cases, thedifferences are more semantic than substantive in nature. For example, bothapproaches collect data from source systems into a single data store, from which datamarts are populated. But while top-down subscribers call this a data warehouse,bottom-up adherents often call this a staging area.

    Nonetheless, significant differences exist between the two approaches (see chart.) Datawarehousing professionals need to understand the substantial, subtle, and semanticdifferences among the approaches and which industry gurus or consultants advocateeach approach. This will provide a clearer understanding of the different routes toachieve data warehousing success and how to translate between the advice andrhetoric of the different approaches.

    Top-Down ApproachThe top-down approach views the data warehouse as the linchpin of the entire analyticenvironment. The data warehouse holds atomic or transaction data that is extractedfrom one or more source systems and integrated within a normalized, enterprise data model. From there, the data issummarized, dimensionalized, and distributed to one or more dependent data marts. These data marts are dependentbecause they derive all their data from a centralized data warehouse.

    Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system databefore it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if thereare numerous source systems, large volumes of data, or small batch windows with which to extract data from sourcesystems.

    The major benefit of a top-down approach is that it provides an integrated, flexible architecture to support downstreamanalytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcingconsistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business

    http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.wiscorp.com/http://www.wiscorp.com/http://www.irmuk.co.uk/ba2011http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.tdan.com/issue/http://www.tdan.com/picks/http://www.tdan.com/events/http://www.tdan.com/resources/http://www.tdan.com/about.phphttp://www.tdan.com/contact.phphttp://www.tdan.com/search.phphttp://www.tdan.com/authors/index.php?a=83http://www.tdan.com/issue/http://www.tdan.com/http://www.wiscorp.com/http://www.irmuk.co.uk/ba2011http://www.dgiqconference.com/http://www.kikconsulting.com/http://tdwi.org/sd2011http://www.tdan.com/subscribe.phphttp://www.tdan.com/subscribe.phphttp://www.tdan.com/contributions.phphttp://www.tdan.com/perspectives/http://www.tdan.com/featured_columns/http://www.tdan.com/special_features/http://www.tdan.com/issue/archive.phphttp://www.tdan.com/issue/http://www.tdan.com/http://www.tdan.com/share/index.php?content_id=4770&content_title=Four+Ways+to+Build+a+Data+Warehouse&type=view-articleshttp://www.tdan.com/share/index.php?content_id=4770&content_title=Four+Ways+to+Build+a+Data+Warehouse&type=view-articleshttp://www.tdan.com/print/4770http://www.tdan.com/print/4770http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.tdan.com/search.phphttp://www.tdan.com/contact.phphttp://www.tdan.com/about.phphttp://www.tdan.com/resources/http://www.tdan.com/events/http://www.tdan.com/picks/http://www.tdan.com/issue/http://www.tdan.com/
  • 8/12/2019 Four Ways to Build a Data Warehouse

    2/4

    6/17/11 7:0our Ways to Build a Data Warehouse

    Page ttp://www.tdan.com/view-articles/4770

    needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, osupport operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if theyneed cross-functional or enterprise views of the data.

    On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in thinitial increments. This is because organizations must create a reasonably detailed enterprise data model as well as thephysical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications orreports. (Of course, depending on the size of an implementation, organizations can deploy all three tiers within a singledatabase.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications.Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find thedetails behind the summary data in their reports.

    Bottom-Up ApproachIn a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possib

    Unlike the top-down approach, these data marts contain all the databoth atomic and summarythat users may want oneed, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Eachdata mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain asingle version of the truth as well as both summary and atomic data.

    The bottom-up approach consciously tries to minimize back-office operations, preferring to focus an organizations efforon developing dimensional designs that meet end-user requirements. The bottom-up staging area is non-persistent, anmay simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensiondata marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier textend existing dimensional models to accommodate new subject areas.

    Pros/Cons.The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible datastructures using dimensional, star schema models. It also delivers value rapidly because it doesnt lay down a heavyinfrastructure up front.

    Without an integration infrastructure, the bottom-up approach relies on a dimensional bus to ensure that data marts arlogically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations useconformed dimensions and facts when building new data marts. Thus, each new data mart is integrated with otherswithin a logical enterprise dimensional model.

    Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users

    do not have to drill through from a data mart to another structure to obtain detailed or transaction data. The use of astaging area also eliminates redundant extracts and overhead required to move source data into the dimensional datamarts.

    One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions andfacts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a singlphysical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much toask departments and business units to adhere and reuse references and rules for calculating facts. There can be atendency for organizations to create independent or non-integrated data marts.

    In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus,organizations that use a bottom-up approach need to create additional data structures outside of the bottom-uparchitecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achievedsimply by pulling a subset of data from a data mart at night when users are not active on the system.

    Hybrid ApproachThe hybrid approach tries to blend the best of both top-down and bottom-up approaches. It attempts to capitalize onthe speed and user-orientation of the bottom-up approach without sacrificing the integration enforced by a datawarehouse in a top down approach. Pieter Mimno, an independent consultant who teaches at TDWI conferences, iscurrently the most vocal proponent of this approach.

    The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form beforedeveloping the first data mart. The first several data marts are also designed in third normal form but deployed using staschema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability anquery performance of a star schema.

    The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise andlocal models in the data marts as well as synchronize the differences between them. This lets local groups, for example,develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificinglong-term integration. Organizations also use the ETL tool to extract and load data from source systems into thedimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on thefly.

    After deploying the first few dependent data marts, an organization then backfills a data warehouse behind the datamarts, instantiating the fleshed out version of the enterprise data model. The organization then transfers atomic datafrom the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money,and processing resources. Organizations typically backfill a data warehouse once business users request views of atomicdata across multiple data marts.

    The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecturframework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once its realneeded (e.g. when executives start asking for reports that cross data mart boundaries.)

    However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and thereforemay never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database(i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when toquery which database.

    This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, developaggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools havematured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy forlocal groups to stray irrevocably from the enterprise data model.

    Federated ApproachThe federated approach is sometimes confused with the hybrid approach above or hub-and-spoke data warehousingarchitectures that are a reflection of a top-down approach.

    However, the federated approachas defined by its most vocal proponent, Doug Hackneyis not a methodology or

  • 8/12/2019 Four Ways to Build a Data Warehouse

    3/4

    6/17/11 7:0our Ways to Build a Data Warehouse

    Page ttp://www.tdan.com/view-articles/4770

    architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfectsystem. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meetchanging needs or business conditions. In short, its a salve for the soul of the stressed out data warehousing projectmanager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users.

    Hackney says the federated approach is an architecture of architectures. It recommends how to integrate a multiplicity heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and willcontinue to implement in spite of the IT groups best effort to enforce standards and adhere to a specific architecture.

    Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boardsas an optimal solution. He says it provides the maximum amount of architecture possible in a given political andimplementation reality. The approach merely encourages organizations to share the highest value metrics, dimensionsand measures wherever possible, however possible. This may mean, for example, creating a common staging area toeliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data

    warehouses, or analytic applications.The major problem with the federated approach is that it is not well documented. There are only a few columns written othe subject. But perhaps this is enough, as it doesnt prescribe a specific end-state or approach. Another potential probleis that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization andfragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating metadata is a pernicious problem in a heterogeneous, ever-changing environment.

    SummaryThe four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousinmanagers need to be aware of these methodologies but not wedded to them. These methodologies have shaped thedebate about data warehousing best practices, and comprise the building blocks for methodologies developed by practiciconsultants.

    Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their owway through the data warehousing thicket. Since each organization must respond to unique needs and businessconditions, having a foundation of best practice models to start with augurs a successful outcome.

    Top-Down Bottom-Up Hybrid Federated

    Major Characteristics

    !Emphasizesthe DW.

    !Startsby designing an

    enterprise model for a DW.

    !Deploysmulti-tier architecture

    comprised of a staging area, a

    DW, and dependent data marts.

    !The stagingarea is persistent.

    !The DWis enterprise-oriented;

    data marts are function-specific.

    !The DWhas atomic-level data;

    data marts have summary data.

    !The DWuses an enterprise-

    based normalized model; data

    marts use a subject-specific

    dimensional model.

    !Userscan query the data

    warehouse and data marts.

    !Emphasizes data marts.

    !Startsby designing a

    dimensional model for a data

    mart.

    !Usesa flat architecture

    consisting of a staging area and

    data marts.

    !The stagingarea is largely

    non-persistent.

    !Datamarts contain both atomic

    and summary data.

    !Datamarts can provide both

    enterprise and function-specific

    views.

    !A datamart consists of a single

    star schema, logically or

    physically deployed.

    !Datamarts are deployedincrementally and integrated

    using conformed dimensions.

    !EmphasizesDW and data

    marts; blends top-down and

    bottom-up methods.

    !Startsby designing enterprise

    and local models synchronously.

    !Spends23 weeks creating a

    high-level, normalized, enterprise

    model; fleshes out model with

    initial marts. !Populatesmarts

    with atomic and summary data

    via a non-persistent staging area.

    !Modelsmarts as one or more

    star schemas.

    !UsesETL tool to populate data

    marts and exchange meta data

    between ETL tool and data marts.

    !Backfillsa DW behind the

    marts when users want views atatomic level across marts;

    instantiates the fleshed out

    enterprise model, and moves

    atomic data to the DW.

    !Emphasizesthe need to

    integrate new and existing

    heterogeneous BI environments.

    !An architectureof

    architectures.

    !Acknowledgesthe reality of

    change in organizations and

    systems that make it difficult to

    implement a formalized

    architecture.

    !Rationalizesthe use of

    whatever means possible to

    implement or integrate analytical

    resources to meet changing needs

    or business conditions.

    !Encouragesorganizations to

    share dimensions, facts, rules,

    definitions, and data whereverpossible, however possible.

    Pros

    !Enforces a flexible, enterprise

    architecture.

    !Oncebuilt, minimizes the

    possibility of renegade

    independent data marts.

    !Supportsother analytical

    structures in an architected

    environment, including data

    mining sets, ODSs, and

    operational reports.!Keepsdetailed data in

    normalized form so it can be

    flexibly re-purposed to meet new

    and unexpected needs.

    !Datawarehouse eliminates

    redundant extracts.

    !Focuses on creating user-

    friendly, flexible data structures.

    !Minimizesback office

    operations and redundant data

    structures to accelerate

    deployment and reduce cost.

    !Nodrill-through required since

    atomic data is always stored in

    the data marts.

    !

    Createsnew views byextending existing stars or

    building new ones within the

    same logical model.

    !Stagingarea eliminates

    redundant extracts.

    !Provides rapid development

    within an enterprise architecture

    framework.

    !Avoidscreation of renegade

    independent data marts.

    !Instantiatesenterprise model

    and architecture only when

    needed and once data marts

    deliver real value.

    !

    Synchronizesmeta data anddatabase models between

    enterprise and local definitions.

    !BackfilledDW eliminates

    redundant extracts.

    !Providesa rationale for band

    aid approaches that solve real

    business problems.

    !Alleviatesthe guilt and stress

    data warehousing managers

    might experience by not adhering

    to formalized architectures.

    !Providespragmatic way to

    share data and resources.

    Cons

    !Upfrontmodeling and platform

    deployment mean the first

    increments take longer to deploy

    and cost more.

    !Fewquery tools can easily join

    data across multiple, physically

    distinct marts.

    !Requiresgroups throughout an

    !Requires organizations to

    enforce standard use of entities

    and rules.

    !Backfillinga DW is disruptive,

    !The approach is not fully

    articulated.

    !Withno predefined end-state

    or architecture in mind, it may

  • 8/12/2019 Four Ways to Build a Data Warehouse

    4/4

    6/17/11 7:0our Ways to Build a Data Warehouse

    Page ttp://www.tdan.com/view-articles/4770

    !Requiresbuilding and managing

    multiple data stores and

    platforms.

    !Difficultto drill through from

    summary data in marts to detail

    data in DW.

    !Mightneed to store detail data

    in data marts anyway.

    organization to consistently use

    dimensions and facts to ensure a

    consolidated view.

    !Notdesigned to support

    operational data stores or

    operational reporting data

    structures or processes.

    requiring corporate commitment,

    funding, and application rewrites.

    !Fewquery tools can dynamically

    query atomic and summary data

    in different databases.

    give way to unfettered chaos.

    !Itmight encourage rather than

    reign in independent development

    and perpetuate the disintegration

    of standards and controls.

    Major Proponents

    Bill Inmon and co-authors

    Ralph Kimball and co-authors

    Many practitioners

    Doug Hackney

    Go to Current Issue| Go to Issue Archive

    Recent articles by Wayne Eckerson

    Are You Stuck In BI Adolescence?The Business Intelligence Evangelist

    Wayne Eckerson- Wayne Eckerson has been a thought leader and consultant in the business intelligence (BI) field sinc1995. He has conducted numerous in-depth research studies and is a noted speaker and blogger. He is the author of thebest-selling book Performance Dashboards: Measuring, Monitoring, and Managing Your Business. For many years, heserved as director of education and research at The Data Warehousing Institute (TDWI) where he chaired its BI ExecutiveSummit and created a popular BI Maturity Model and Assessment. Wayne is currently director of research at TechTargetand president of BI Leader Consulting, which provides advisory services to user and vendor organizations. He can bereached at [email protected].

    Quality Content for Data Management Professionals Since 1997

    Copyright 1997-2011, The Data Administration Newsletter, LLC -- www.TDAN.comContact Publisher | Comments and Contributions Welcome | Advertising | Disclaimer

    TDAN.com is an affiliate of the BeyeNETWORK

    http://www.b-eye-network.com/http://www.tdan.com/disclaimer.phphttp://www.tdan.com/advertising.phphttp://www.tdan.com/contributions.phphttp://www.tdan.com/contact.phpmailto:[email protected]://www.tdan.com/view-articles/5247http://www.tdan.com/view-articles/5027http://www.tdan.com/issue/archive.phphttp://www.tdan.com/issue/