33
Designing a Data Warehouse Issues in DW design

e 06 Warehouse Design

Embed Size (px)

DESCRIPTION

S

Citation preview

  • Designing a Data WarehouseIssues in DW design

  • Data Warehouse

    A read-only database for decision analysisSubject OrientedIntegratedTime variantNonvolatileconsisting of time stamped operational and external data.

  • Data Warehouse vsOperational DatabasesHighly tunedReal time DataDetailed recordsCurrent valuesAccesses small amounts of data in a predictable mannerFlexible accessConsistent timingSummarized as appropriateHistoricalAccess large amounts of data in unexpected ways

  • Data Warehouse PurposeIdentify problems in time to avoid themLocate opportunities you might otherwise miss

  • Data Warehouse:New ApproachAn old idea with a new interest because of:

    Cheap Computing PowerSpecial Purpose HardwareNew Data StructuresIntelligent Software

  • Warehousing ProblemsBusiness IssuesData QuantityData AccuracyMaintenanceOwnershipCost

  • Warehousing Problems Business IssuesDatabase IssuesDBMS SoftwareTechnologyComplexity

  • Warehousing ProblemsBusiness IssuesData IssuesAnalysis IssuesUser InterfaceIntelligent Processing

  • Three ApproachesClassical Enterprise DatabaseContains operational data from all areas of the organization.Data MartExtracted and managerial support data designed for departmental or EUC applicationsData PackageData required for a specific application

  • Classical Warehouse

    Source

    Archived data

    Extraction

    Batch extraction programs

    Data

    Atomic transaction data

    Tool

    VLDB technology

    Analysis

    IT driven software

  • Mart

    Source

    Deposit or External sources

    Extraction

    Batch summary

    Data

    Designed departmental database

    Tool

    OLAP, ROLAP, MDBMS

    Analysis

    IT driven or trained user

  • Package

    Source

    Mart

    Extraction

    Sample and summary

    Data

    Problem specific dataset

    Tool

    PC tools

    Analysis

    Trained user

  • Three Fundamental ProcessesData AcquisitionData StorageData aAccess

  • Data AcquisitionHandles acquisition of data from legacy systems and outside sources. Data is identified, copied, formatted and prepared for loading into the warehouse.

  • Acquisition stepsCatalog the dataDevelop an inventory of where it is and what it means. Clean and prepare the data. Extract from legacy files and reformat to make it usable. Transport data from one location to another.

  • StorageThe storage component holds the data so that the many different data mining, executive information and decision support systems can make use of it effectively.

  • The Storage Area

    Managed byRelational databases like those from Oracle Corp. or Informix Software Inc. Specialized hardwaresymmetric multiprocessor (SMP) or massively parallel processor (MPP) machines

  • StorageThe majority of warehouse storage today is being managed by relational databases running on Unix platforms. Oracle, Sybase Inc., IBM Corp. and Informix control 65 percent of the warehouse storage market. Meta Group Inc. (1996)

  • AccessDifferent end-user PCs and workstations draw data from the warehouse with the help of multidimensional analysis products, neural networks, data discovery tools or analysis tools. These powerful, "smart" software products are the real driving force behind the viability of data warehousing.

  • Access ToolsIntelligent Agents and AgenciesQuery Facilities and Managed Query EnvironmentsStatistical AnalysisData Discovery. (decision support, artificial intelligence and expert systems)OLAPData Visualization

  • Hardware Budget A typical startup warehouse project allocates more than 60 percent of its budget for hardware and software to the creation of a powerful storage component, spending just 30 percent on data mining and user access technologies.

  • Systems Analysis BudgetBudgeting for systems analysis and development, however, follows a very different pattern. More than 50 percent of development dollars are spent on building acquisition capabilities,30 percent fund the development of user solutions and 20 percent are dedicated to the creation of databases in the storage component.

  • Design IssuesRelational and Multidimensional ModelsDenormalized and indexed relational models more flexibleMultidimensional models simpler to use and more efficient

  • Star Schemas in a RDBMS In most companies doing ROLAP, the DBAs have created countless indexes and summary tables in order to avoid I/O-intensive table scans against large fact tables. As the indexes and summary tables proliferate in order to optimize performance for the known queries and aggregations that the users perform, the build times and disk space needed to create them has grown enormously, often requiring more time than is allotted and more space than the original data!

  • Building a Data Warehouse from a Normalized DatabaseThe steps Develop a normalized entity-relationship business model of the data warehouse.Translate this into a dimensional model. This step reflects the information and analytical characteristics of the data warehouse.Translate this into the physical model. This reflects the changes necessary to reach the stated performance objectives.

  • The Business ModelIdentify the data structure, attributes and constraints for the clients data warehousing environment. StableOptimized for updateFlexible

  • Business ModelAs always in life, there are some disadvantages to 3NF: Performance can be truly awful. Most of the work that is performed on denormalizing a data model is an attempt to reach performance objectives.The structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data.

  • Structural DimensionsThe first step is the development of the structural dimensions. This step corresponds very closely to what we normally do in a relational database. The star architecture that we will develop here depends upon taking the central intersection entities as the fact tables and building the foreign key => primary key relations as dimensions.

  • Simple DW pattern.

  • Other DimensionsCategorical dimensions: generated groups (additional key components)Partitioning dimensions: subtypes (planned vs. actual)Informational dimensions: generate different types of data (messy).

    DNA of Data Warehousing, Techguide.com,Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Warehousing Wherewithal, Rob Mattison, CIO, April 1996.Designing the Perfect Data Warehouse (the paper formerly known as: Data Modeling for Data Warehouses), Frank McGuff , http://members.aol.com/fmcguff/dwmodel/