Upload
jonjon
View
218
Download
0
Embed Size (px)
Citation preview
8/12/2019 Four Ways to Build a Data Warehouse
1/4
6/17/11 7:0our Ways to Build a Data Warehouse
Page ttp://www.tdan.com/view-articles/4770
Home Newsletter TDAN.com Picks Events R esources About Contact Search TDAN.com
Printer-friendly
E-mail to friend
THE DATA ADMINISTRATION NEWSLETTER TDAN.comROBERT S. SEINER PUBLISHER
Current Issue
Archive
Special Features
Featured Columns
Perspectives
Contribute
Subscribe
> home> newsletter> article
Four Ways to Build a Data Warehouse
by Wayne EckersonPublished: May 29, 2007
It has been said there are as many ways to build data warehouses asthere are companies to build them.
It has been said there are as many ways to build data warehouses as there arecompanies to build them. Each data warehouse is unique because it must adapt to theneeds of business users in different functional areas, whosecompanies face differentbusiness conditions and competitive pressures.
Nonetheless, four major approaches to building a data warehousing environment exist.
These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid,and 4) federated. Most organizationswittingly or notfollow one or another of theseapproaches as a blueprint for development.
Although we have been building data warehouses since the early 1990s, there is still agreat deal of confusion about the similarities and differences among thesearchitectures. This is especially true of thetop-down andbottom-up approaches,which have existed the longest and occupythe polar ends of the developmentspectrum.
As a result, some companies fail to adopt a clear vision for the way the datawarehousing environment can and should evolve. Others, paralyzed by confusion orfear of deviating from prescribed tenets for success, cling too rigidly to one approach oranother, undermining their ability to respond flexibly to new or unexpected situations.Ideally, organizations need to borrow concepts and tactics from each approach tocreate environments that uniquely meets their needs.
Semantic and Substantive Differences The two most influential approaches arechampioned by industry heavyweights Bill Inmon and Ralph Kimball, both prolificauthors and consultants in the data warehousing field. Inmon, who is credited withcoining the term data warehousing in the early 1990s, advocates a top-down
approach, in which companies first build a data warehouse followed by data marts.Kimballs approach, on the other hand, is often called bottom-up because it starts andends with data marts, negating the need for a physical data warehouse altogether.
On the surface, there is considerable friction between top-down and bottom-upapproaches. But in reality, the differences are not as stark as they may appear. Bothapproaches advocate building a robust enterprise architecture that adapts easily tochanging business needs and delivers a single version of the truth. In some cases, thedifferences are more semantic than substantive in nature. For example, bothapproaches collect data from source systems into a single data store, from which datamarts are populated. But while top-down subscribers call this a data warehouse,bottom-up adherents often call this a staging area.
Nonetheless, significant differences exist between the two approaches (see chart.) Datawarehousing professionals need to understand the substantial, subtle, and semanticdifferences among the approaches and which industry gurus or consultants advocateeach approach. This will provide a clearer understanding of the different routes toachieve data warehousing success and how to translate between the advice andrhetoric of the different approaches.
Top-Down ApproachThe top-down approach views the data warehouse as the linchpin of the entire analyticenvironment. The data warehouse holds atomic or transaction data that is extractedfrom one or more source systems and integrated within a normalized, enterprise data model. From there, the data issummarized, dimensionalized, and distributed to one or more dependent data marts. These data marts are dependentbecause they derive all their data from a centralized data warehouse.
Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system databefore it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if thereare numerous source systems, large volumes of data, or small batch windows with which to extract data from sourcesystems.
The major benefit of a top-down approach is that it provides an integrated, flexible architecture to support downstreamanalytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcingconsistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business
http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.wiscorp.com/http://www.wiscorp.com/http://www.irmuk.co.uk/ba2011http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.tdan.com/issue/http://www.tdan.com/picks/http://www.tdan.com/events/http://www.tdan.com/resources/http://www.tdan.com/about.phphttp://www.tdan.com/contact.phphttp://www.tdan.com/search.phphttp://www.tdan.com/authors/index.php?a=83http://www.tdan.com/issue/http://www.tdan.com/http://www.wiscorp.com/http://www.irmuk.co.uk/ba2011http://www.dgiqconference.com/http://www.kikconsulting.com/http://tdwi.org/sd2011http://www.tdan.com/subscribe.phphttp://www.tdan.com/subscribe.phphttp://www.tdan.com/contributions.phphttp://www.tdan.com/perspectives/http://www.tdan.com/featured_columns/http://www.tdan.com/special_features/http://www.tdan.com/issue/archive.phphttp://www.tdan.com/issue/http://www.tdan.com/http://www.tdan.com/share/index.php?content_id=4770&content_title=Four+Ways+to+Build+a+Data+Warehouse&type=view-articleshttp://www.tdan.com/share/index.php?content_id=4770&content_title=Four+Ways+to+Build+a+Data+Warehouse&type=view-articleshttp://www.tdan.com/print/4770http://www.tdan.com/print/4770http://ad.doubleclick.net/click;h=v8/3b29/0/0/%2a/e;223475248;1-0;0;50926787;2321-160/600;36051493/36069379/1;;~sscs=%3fhttp://searchfinancialsecurity.techtarget.com?Offer=mn_eh120909FSECBANR_ronhttp://www.tdan.com/search.phphttp://www.tdan.com/contact.phphttp://www.tdan.com/about.phphttp://www.tdan.com/resources/http://www.tdan.com/events/http://www.tdan.com/picks/http://www.tdan.com/issue/http://www.tdan.com/8/12/2019 Four Ways to Build a Data Warehouse
2/4
6/17/11 7:0our Ways to Build a Data Warehouse
Page ttp://www.tdan.com/view-articles/4770
needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, osupport operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if theyneed cross-functional or enterprise views of the data.
On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in thinitial increments. This is because organizations must create a reasonably detailed enterprise data model as well as thephysical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications orreports. (Of course, depending on the size of an implementation, organizations can deploy all three tiers within a singledatabase.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications.Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find thedetails behind the summary data in their reports.
Bottom-Up ApproachIn a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possib
Unlike the top-down approach, these data marts contain all the databoth atomic and summarythat users may want oneed, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Eachdata mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain asingle version of the truth as well as both summary and atomic data.
The bottom-up approach consciously tries to minimize back-office operations, preferring to focus an organizations efforon developing dimensional designs that meet end-user requirements. The bottom-up staging area is non-persistent, anmay simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensiondata marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier textend existing dimensional models to accommodate new subject areas.
Pros/Cons.The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible datastructures using dimensional, star schema models. It also delivers value rapidly because it doesnt lay down a heavyinfrastructure up front.
Without an integration infrastructure, the bottom-up approach relies on a dimensional bus to ensure that data marts arlogically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations useconformed dimensions and facts when building new data marts. Thus, each new data mart is integrated with otherswithin a logical enterprise dimensional model.
Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users
do not have to drill through from a data mart to another structure to obtain detailed or transaction data. The use of astaging area also eliminates redundant extracts and overhead required to move source data into the dimensional datamarts.
One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions andfacts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a singlphysical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much toask departments and business units to adhere and reuse references and rules for calculating facts. There can be atendency for organizations to create independent or non-integrated data marts.
In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus,organizations that use a bottom-up approach need to create additional data structures outside of the bottom-uparchitecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achievedsimply by pulling a subset of data from a data mart at night when users are not active on the system.
Hybrid ApproachThe hybrid approach tries to blend the best of both top-down and bottom-up approaches. It attempts to capitalize onthe speed and user-orientation of the bottom-up approach without sacrificing the integration enforced by a datawarehouse in a top down approach. Pieter Mimno, an independent consultant who teaches at TDWI conferences, iscurrently the most vocal proponent of this approach.
The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form beforedeveloping the first data mart. The first several data marts are also designed in third normal form but deployed using staschema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability anquery performance of a star schema.
The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise andlocal models in the data marts as well as synchronize the differences between them. This lets local groups, for example,develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificinglong-term integration. Organizations also use the ETL tool to extract and load data from source systems into thedimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on thefly.
After deploying the first few dependent data marts, an organization then backfills a data warehouse behind the datamarts, instantiating the fleshed out version of the enterprise data model. The organization then transfers atomic datafrom the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money,and processing resources. Organizations typically backfill a data warehouse once business users request views of atomicdata across multiple data marts.
The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecturframework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once its realneeded (e.g. when executives start asking for reports that cross data mart boundaries.)
However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and thereforemay never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database(i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when toquery which database.
This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, developaggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools havematured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy forlocal groups to stray irrevocably from the enterprise data model.
Federated ApproachThe federated approach is sometimes confused with the hybrid approach above or hub-and-spoke data warehousingarchitectures that are a reflection of a top-down approach.
However, the federated approachas defined by its most vocal proponent, Doug Hackneyis not a methodology or
8/12/2019 Four Ways to Build a Data Warehouse
3/4
6/17/11 7:0our Ways to Build a Data Warehouse
Page ttp://www.tdan.com/view-articles/4770
architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfectsystem. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meetchanging needs or business conditions. In short, its a salve for the soul of the stressed out data warehousing projectmanager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users.
Hackney says the federated approach is an architecture of architectures. It recommends how to integrate a multiplicity heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and willcontinue to implement in spite of the IT groups best effort to enforce standards and adhere to a specific architecture.
Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boardsas an optimal solution. He says it provides the maximum amount of architecture possible in a given political andimplementation reality. The approach merely encourages organizations to share the highest value metrics, dimensionsand measures wherever possible, however possible. This may mean, for example, creating a common staging area toeliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data
warehouses, or analytic applications.The major problem with the federated approach is that it is not well documented. There are only a few columns written othe subject. But perhaps this is enough, as it doesnt prescribe a specific end-state or approach. Another potential probleis that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization andfragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating metadata is a pernicious problem in a heterogeneous, ever-changing environment.
SummaryThe four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousinmanagers need to be aware of these methodologies but not wedded to them. These methodologies have shaped thedebate about data warehousing best practices, and comprise the building blocks for methodologies developed by practiciconsultants.
Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their owway through the data warehousing thicket. Since each organization must respond to unique needs and businessconditions, having a foundation of best practice models to start with augurs a successful outcome.
Top-Down Bottom-Up Hybrid Federated
Major Characteristics
!Emphasizesthe DW.
!Startsby designing an
enterprise model for a DW.
!Deploysmulti-tier architecture
comprised of a staging area, a
DW, and dependent data marts.
!The stagingarea is persistent.
!The DWis enterprise-oriented;
data marts are function-specific.
!The DWhas atomic-level data;
data marts have summary data.
!The DWuses an enterprise-
based normalized model; data
marts use a subject-specific
dimensional model.
!Userscan query the data
warehouse and data marts.
!Emphasizes data marts.
!Startsby designing a
dimensional model for a data
mart.
!Usesa flat architecture
consisting of a staging area and
data marts.
!The stagingarea is largely
non-persistent.
!Datamarts contain both atomic
and summary data.
!Datamarts can provide both
enterprise and function-specific
views.
!A datamart consists of a single
star schema, logically or
physically deployed.
!Datamarts are deployedincrementally and integrated
using conformed dimensions.
!EmphasizesDW and data
marts; blends top-down and
bottom-up methods.
!Startsby designing enterprise
and local models synchronously.
!Spends23 weeks creating a
high-level, normalized, enterprise
model; fleshes out model with
initial marts. !Populatesmarts
with atomic and summary data
via a non-persistent staging area.
!Modelsmarts as one or more
star schemas.
!UsesETL tool to populate data
marts and exchange meta data
between ETL tool and data marts.
!Backfillsa DW behind the
marts when users want views atatomic level across marts;
instantiates the fleshed out
enterprise model, and moves
atomic data to the DW.
!Emphasizesthe need to
integrate new and existing
heterogeneous BI environments.
!An architectureof
architectures.
!Acknowledgesthe reality of
change in organizations and
systems that make it difficult to
implement a formalized
architecture.
!Rationalizesthe use of
whatever means possible to
implement or integrate analytical
resources to meet changing needs
or business conditions.
!Encouragesorganizations to
share dimensions, facts, rules,
definitions, and data whereverpossible, however possible.
Pros
!Enforces a flexible, enterprise
architecture.
!Oncebuilt, minimizes the
possibility of renegade
independent data marts.
!Supportsother analytical
structures in an architected
environment, including data
mining sets, ODSs, and
operational reports.!Keepsdetailed data in
normalized form so it can be
flexibly re-purposed to meet new
and unexpected needs.
!Datawarehouse eliminates
redundant extracts.
!Focuses on creating user-
friendly, flexible data structures.
!Minimizesback office
operations and redundant data
structures to accelerate
deployment and reduce cost.
!Nodrill-through required since
atomic data is always stored in
the data marts.
!
Createsnew views byextending existing stars or
building new ones within the
same logical model.
!Stagingarea eliminates
redundant extracts.
!Provides rapid development
within an enterprise architecture
framework.
!Avoidscreation of renegade
independent data marts.
!Instantiatesenterprise model
and architecture only when
needed and once data marts
deliver real value.
!
Synchronizesmeta data anddatabase models between
enterprise and local definitions.
!BackfilledDW eliminates
redundant extracts.
!Providesa rationale for band
aid approaches that solve real
business problems.
!Alleviatesthe guilt and stress
data warehousing managers
might experience by not adhering
to formalized architectures.
!Providespragmatic way to
share data and resources.
Cons
!Upfrontmodeling and platform
deployment mean the first
increments take longer to deploy
and cost more.
!Fewquery tools can easily join
data across multiple, physically
distinct marts.
!Requiresgroups throughout an
!Requires organizations to
enforce standard use of entities
and rules.
!Backfillinga DW is disruptive,
!The approach is not fully
articulated.
!Withno predefined end-state
or architecture in mind, it may
8/12/2019 Four Ways to Build a Data Warehouse
4/4
6/17/11 7:0our Ways to Build a Data Warehouse
Page ttp://www.tdan.com/view-articles/4770
!Requiresbuilding and managing
multiple data stores and
platforms.
!Difficultto drill through from
summary data in marts to detail
data in DW.
!Mightneed to store detail data
in data marts anyway.
organization to consistently use
dimensions and facts to ensure a
consolidated view.
!Notdesigned to support
operational data stores or
operational reporting data
structures or processes.
requiring corporate commitment,
funding, and application rewrites.
!Fewquery tools can dynamically
query atomic and summary data
in different databases.
give way to unfettered chaos.
!Itmight encourage rather than
reign in independent development
and perpetuate the disintegration
of standards and controls.
Major Proponents
Bill Inmon and co-authors
Ralph Kimball and co-authors
Many practitioners
Doug Hackney
Go to Current Issue| Go to Issue Archive
Recent articles by Wayne Eckerson
Are You Stuck In BI Adolescence?The Business Intelligence Evangelist
Wayne Eckerson- Wayne Eckerson has been a thought leader and consultant in the business intelligence (BI) field sinc1995. He has conducted numerous in-depth research studies and is a noted speaker and blogger. He is the author of thebest-selling book Performance Dashboards: Measuring, Monitoring, and Managing Your Business. For many years, heserved as director of education and research at The Data Warehousing Institute (TDWI) where he chaired its BI ExecutiveSummit and created a popular BI Maturity Model and Assessment. Wayne is currently director of research at TechTargetand president of BI Leader Consulting, which provides advisory services to user and vendor organizations. He can bereached at [email protected].
Quality Content for Data Management Professionals Since 1997
Copyright 1997-2011, The Data Administration Newsletter, LLC -- www.TDAN.comContact Publisher | Comments and Contributions Welcome | Advertising | Disclaimer
TDAN.com is an affiliate of the BeyeNETWORK
http://www.b-eye-network.com/http://www.tdan.com/disclaimer.phphttp://www.tdan.com/advertising.phphttp://www.tdan.com/contributions.phphttp://www.tdan.com/contact.phpmailto:[email protected]://www.tdan.com/view-articles/5247http://www.tdan.com/view-articles/5027http://www.tdan.com/issue/archive.phphttp://www.tdan.com/issue/