Data Warehousing Quick Guide

Embed Size (px)

Citation preview

  • 8/11/2019 Data Warehousing Quick Guide

    1/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm

    Data Warehousing - Quick Guide

    Advertisements

    Data Warehousing - OverviewThe term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that D

    subject Oriented, Integrated, Time-Variant and nonvolatile collection of data.Th

    supporting decision making process by analyst in an organization

    The operational database undergoes the per day transactions which causes the fre

    the data on daily basis.But if in future the business executive wants to analyse the p

    on any data such as product,supplier,or the consumer data. In this case the analyst

    data available to analyse because the previous data is updated due to transactions.

    The Data Warehouses provide us generalized and consolidated data in multidimens

    with generalize and consolidated view of data the Data Warehouses also provide us

    Processing (OLAP) tools. These tools help us in interactive and effective ana

    multidimensional space. This analysis results in data generalization and data mining

    The data mining functions like association,clustering ,classification, prediction can b

    OLAP operations to enhance interactive mining of knowledge at multiple level of abstrdata warehouse has now become important platform for data analysis and

    processing.

    Understanding Data Warehouse

    The Data Warehouse is that database which is kept separate from the organiza

    database.

    There is no frequent updation done in data warehouse.

    Datawarehouse possess consolidated historical datawhich help the organiza

    business.

    Data warehouse helps theexecutives to organize,understand and use their dat

    decision.

    Data warehouse systems available which helps in integration of diversity of app

    The Data warehouse system al lows analysis of consolidated historical data an

    Definition

    Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile colle

    support management's decision making process.

    HOME JAVA PHP Python Ruby Perl HTML CSS Javascript MySQL C++ UNIX MOR

    K E E N I O A D T E C H D A T A B A S E

    Thecustomizeable backendfor your impressions, clicks, &

    Previous Page NDataWarehousing Tutorial

    DWH- Home

    DWH- Overview

    DWH- Data Warehousing

    DWH- Terminologies

    DWH- Delivery Process

    DWH- System Processes

    DWH- Architecture

    DWH- OLAP

    DWH- Relational OLAP

    DWH- Multidimensional OLAP

    DWH - Schemas

    DWH - Partitioning Strategy

    DWH - Metadata Concepts

    DWH - Data Marting

    DWH - System Managers

    DWH - Process Managers

    DWH - Security

    DWH - Backup

    DWH - Tuning

    DWH - Testing

    DWH - Future Aspects

    DWH - Interview Questions

    DWH Useful Resources

    Data Warehousing Quick

    Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htmhttp://www.tutorialspoint.com/dwh/dwh_interview_questions.htmhttp://www.tutorialspoint.com/dwh/dwh_useful_resources.htmhttp://googleads.g.doubleclick.net/aclk?sa=L&ai=C7q-NJIZFU8byB6KnigasrYDYBcqAndxAgq-5gp4BwI23ARABIIW12wVQ14r6bWDLBMgBAagDAcgDwwSqBIgBT9BxyhfkEdwvZj3og7sVy2ufCGa3uEy0I3ijyKQU8nSJ8vlC5C_o4MjSQ66TRhPGjqLbmb--bGwCA8Kgp4ZpYa4l7SBKEKXLuzp74-MzOlBGp5eXyxEOsgShBWqYSYXt44yhvl5z08gxJTusWhuFhFLOz-2Fmr55A9qqTZ7n8LRcXOfXKltVTIAHyoXnHw&num=1&sig=AOD64_3hh233qcD9Yvwopsqq6QNwNxDl-A&client=ca-pub-7133395778201029&adurl=https://keen.iohttp://www.tutorialspoint.com/perl/inde.htmhttp://www.tutorialspoint.com/html/index.htmhttp://www.tutorialspoint.com/css/index.htmhttp://www.tutorialspoint.com/javascript/index.htmhttp://www.tutorialspoint.com/mysql/index.htmhttp://www.tutorialspoint.com/cplusplus/index.htmhttp://www.tutorialspoint.com/unix/index.htmhttp://www.tutorialspoint.com/more.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/dwh/dwh_quick_guide.htmhttp://www.tutorialspoint.com/dwh/dwh_quick_guide.htmhttp://www.tutorialspoint.com/dwh/dwh_data_marting.htmhttp://www.tutorialspoint.com/dwh/dwh_metadata_concepts.htmhttp://www.tutorialspoint.com/dwh/dwh_partitioning_strategy.htmhttp://www.tutorialspoint.com/dwh/dwh_multidimensional_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_relational_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_architecture.htmhttp://www.tutorialspoint.com/dwh/dwh_delivery_process.htmhttp://www.tutorialspoint.com/dwh/dwh_terminologies.htmhttp://www.tutorialspoint.com/dwh/dwh_data_warehousing.htmhttp://www.tutorialspoint.com/dwh/dwh_overview.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/java/index.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/dwh/dwh_quick_guide.htmhttp://www.tutorialspoint.com/dwh/dwh_interview_questions.htmhttp://www.tutorialspoint.com/dwh/dwh_future_aspects.htmhttp://www.tutorialspoint.com/dwh/dwh_testing.htmhttp://www.tutorialspoint.com/dwh/dwh_tuning.htmhttp://www.tutorialspoint.com/dwh/dwh_backup.htmhttp://www.tutorialspoint.com/dwh/dwh_security.htmhttp://www.tutorialspoint.com/dwh/dwh_process_managers.htmhttp://www.tutorialspoint.com/dwh/dwh_system_managers.htmhttp://www.tutorialspoint.com/dwh/dwh_data_marting.htmhttp://www.tutorialspoint.com/dwh/dwh_metadata_concepts.htmhttp://www.tutorialspoint.com/dwh/dwh_partitioning_strategy.htmhttp://www.tutorialspoint.com/dwh/dwh_schemas.htmhttp://www.tutorialspoint.com/dwh/dwh_multidimensional_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_relational_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_olap.htmhttp://www.tutorialspoint.com/dwh/dwh_architecture.htmhttp://www.tutorialspoint.com/dwh/dwh_system_processes.htmhttp://www.tutorialspoint.com/dwh/dwh_delivery_process.htmhttp://www.tutorialspoint.com/dwh/dwh_terminologies.htmhttp://www.tutorialspoint.com/dwh/dwh_data_warehousing.htmhttp://www.tutorialspoint.com/dwh/dwh_overview.htmhttp://www.tutorialspoint.com/dwh/index.htmhttp://www.tutorialspoint.com/index.htmhttp://www.tutorialspoint.com/dwh/dwh_useful_resources.htmhttp://www.tutorialspoint.com/dwh/dwh_interview_questions.htmhttp://googleads.g.doubleclick.net/aclk?sa=L&ai=C7q-NJIZFU8byB6KnigasrYDYBcqAndxAgq-5gp4BwI23ARABIIW12wVQ14r6bWDLBMgBAagDAcgDwwSqBIgBT9BxyhfkEdwvZj3og7sVy2ufCGa3uEy0I3ijyKQU8nSJ8vlC5C_o4MjSQ66TRhPGjqLbmb--bGwCA8Kgp4ZpYa4l7SBKEKXLuzp74-MzOlBGp5eXyxEOsgShBWqYSYXt44yhvl5z08gxJTusWhuFhFLOz-2Fmr55A9qqTZ7n8LRcXOfXKltVTIAHyoXnHw&num=1&sig=AOD64_3hh233qcD9Yvwopsqq6QNwNxDl-A&client=ca-pub-7133395778201029&adurl=https://keen.iohttp://googleads.g.doubleclick.net/aclk?sa=L&ai=C7q-NJIZFU8byB6KnigasrYDYBcqAndxAgq-5gp4BwI23ARABIIW12wVQ14r6bWDLBMgBAagDAcgDwwSqBIgBT9BxyhfkEdwvZj3og7sVy2ufCGa3uEy0I3ijyKQU8nSJ8vlC5C_o4MjSQ66TRhPGjqLbmb--bGwCA8Kgp4ZpYa4l7SBKEKXLuzp74-MzOlBGp5eXyxEOsgShBWqYSYXt44yhvl5z08gxJTusWhuFhFLOz-2Fmr55A9qqTZ7n8LRcXOfXKltVTIAHyoXnHw&num=1&sig=AOD64_3hh233qcD9Yvwopsqq6QNwNxDl-A&client=ca-pub-7133395778201029&adurl=https://keen.iohttp://www.tutorialspoint.com/more.htmhttp://www.tutorialspoint.com/unix/index.htmhttp://www.tutorialspoint.com/cplusplus/index.htmhttp://www.tutorialspoint.com/mysql/index.htmhttp://www.tutorialspoint.com/javascript/index.htmhttp://www.tutorialspoint.com/css/index.htmhttp://www.tutorialspoint.com/html/index.htmhttp://www.tutorialspoint.com/perl/inde.htmhttp://www.tutorialspoint.com/ruby/index.htmhttp://www.tutorialspoint.com/python/index.htmhttp://www.tutorialspoint.com/php/index.htmhttp://www.tutorialspoint.com/java/index.htmhttp://www.tutorialspoint.com/index.htm
  • 8/11/2019 Data Warehousing Quick Guide

    2/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 2

    Why Data Warehouse Separated from Operational Data

    The following are the reasons why Data Warehouse are kept separate from operation

    The operational database is constructed for well known tasks and workload s

    particular records, indexing etc but the data warehouse queries are often

    presents the general form of data.

    Operational databases supports the concurrent process ing of multiple transact

    control and recovery mechanism are required for operational databases to en

    and consistency of database.

    Operational database query allow to read, modify operations while the OLAP

    read onlyaccess of stored data.

    Operational database maintain the current data on the other hand data wareho

    historical data.

    Data Warehouse Features

    The key features of Data Warehouse such as Subject Oriented, Integrated, Nonv

    Variant are are discussed below:

    Subject Oriented - The Data Warehouse is Subject Oriented because i

    information around a subject rather the organization's ongoing operations. Th

    be product, customers, suppl iers, sales, revenue etc. The data warehouse doe

    ongoing operations rather it focuses on modelling and analysis of data for deci

    Integrated- Data Warehouse is constructed by integration of data from hetero

    such as relational databases, flat files etc. This integration enhance the eff

    data.

    Time-Variant- The Data in Data Warehouse is identified with a particular time

    in data warehouse provide information from historical point of view.

    Non Volatile- Non volatile means that the previous data is not removed when n

    to it. The data warehouse is kept separate from the operational database tchanges in operational database is not reflected in data warehouse.

    Note: - Data Warehouse does not require transaction processing, recovery and co

    because i t is physically stored separate from the operational database.

    Data Warehouse Applications

    As discussed before Data Warehouse helps the bus ines s executives in organize,

    their data for decision making. Data Warehouse serves as a soul part of a plan

    "closed-loop" feedback system for enterprise management. Data Warehouse is w

    following fields:

    financial s ervices

    Banking Services

    Consumer goods

    Retail sectors.

    Controlled manufacturing

    Data Warehouse Types

    Information process ing, Analytical process ing and Data Mining are the three types o

    Data Warehousing Useful

    Resources

    Selected Reading

    Developer's Best Practices

    Computer Glossary

    Who is Who

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htmhttp://www.tutorialspoint.com/computer_whoiswho.htmhttp://www.tutorialspoint.com/computer_glossary.htmhttp://www.tutorialspoint.com/developers_best_practices/index.htmhttp://www.tutorialspoint.com/dwh/dwh_useful_resources.htmhttp://www.tutorialspoint.com/dwh/dwh_quick_guide.htm
  • 8/11/2019 Data Warehousing Quick Guide

    3/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 3

    applications that are discussed below:

    Information processing- Data Warehouse allow us to process the informatio

    information can be processed by means of querying, basic statistical analysis

    crosstabs, tables, charts, or graphs.

    Analytical Processing - Data Warehouse supports analytical processing o

    stored in it.The data can be analysed by means of basic OLAP operations,inc

    dice,drill down,drill up, and pivoting.

    Data Mining - Data Mining supports knowledge discovery by finding the hidassociations, constructing analytical models, performing classification and

    mining results can be presented using the visualization tools.

    SN Data Warehouse (OLAP) Operational Database(OLTP)

    1This involves historical processing of

    information.This involves day to day process i

    2

    OLAP systems are used by knowledge

    workers such as executive, manager and

    analyst.

    OLTP system are used by clerk,

    database profess ionals.

    3 This is used to analys is the bus iness . This is used to run the bus iness

    4 It focuses on Information out. It focuses on Data in.

    5This is based on Star Schema, Snowflake

    Schema and Fact Constellation Schema.This is based on Entity Relations

    6 It focuses on Information out. This is application oriented.

    7 This contains historical data. This contains current data.

    8This provides summarized and

    consolidated data.This provide primitive and highly

    9This provide summarized and

    multidimens ional view of data.

    This provides detailed and flat re

    data.

    10 The number or users are in Hundreds. The number of users are in thou

    11The number of records accessed are in

    millions.The number of records accessed

    12 The database size is from 100GB to TB The database size is from 100 M

    13 This are highly flexible. This provide high performance.

    Data Warehousing - Concepts

    What is Data Warehousing?Data Warehousing is the process of constructing and using the data warehouse. The

    is constructed by integrating the data from multiple heterogeneous sources. This

    supports analytical reporting, structured and/or ad hoc queries and decisio

    Warehousing involves data cleaning, data integration and data consolidations .

    Using Data Warehouse Information

    There are decision support technologies available which help to utilize the data w

    technologies helps the executives to use the warehouse quickly and effectively. The

    data, analyse it and take the decisions based on the information in the warehouse

    gathered from the warehouse can be used in any of the following dom ains:

  • 8/11/2019 Data Warehousing Quick Guide

    4/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 4

    Tuning production strategies- The product strategies can be well tuned by

    products and managing product portfolios by comparing the sales quarterly or y

    Customer Analysis - The customer analysis is done by analyzing the cu

    preferences, buying time, budget cycles etc.

    Operations Analysis - Data Warehousing also helps in customer relationsh

    making environmental corrections.The Information also allow us to analy

    operations.

    Integrating Heterogeneous Databases

    To integrate heterogeneous databases we have the two approaches as follows:

    Query Driven Approach

    Update Driven Approach

    Query Driven Approach

    This is the traditional approach to integrate heterogeneous databases. This appro

    build wrappers and integrators on the top of multiple heterogeneous databases. The

    also known as m ediators.

    PROCESS OF QUERY DRIVEN APPROACH:

    when the query is issued to a client side, a metadata dictionary translate t

    queries appropriate for the individual heterogeneous site involved.

    Now these queries are mapped and sent to the local query processor.

    The results from heterogeneous s ites are integrated into a global answer set.

    DISADVANTAGES

    The Query Driven Approach needs complex integration and filtering processes .

    This approach is very inefficient.

    This approach is very expensive for frequent queries.

    This approach is also very expensive for queries that requires aggregations.

    Update Driven Approach

    We are provided with the alternative approach to traditional approach. Today's Data W

    follows update driven approach rather than the traditional approach discussed earlier

    approach the information from multiple heterogeneous sources is integrated in adva

    a warehouse. This information is available for direct querying and analysis .

    ADVANTAGES

    This approach has the following advantages:

    This approach provide high performance.

    The data are copied, processed, integrated, annotated, summarized and

    semantic data store in advance.

    Query process ing does not require interface with the process ing at local source

    Data Warehouse Tools and Utilities Functions

  • 8/11/2019 Data Warehousing Quick Guide

    5/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 5

    The following are the functions of Data Warehouse tools and Utilities:

    Data Extraction - Data Extraction involves gathering the data from multipl

    sources.

    Data Cleaning- Data Cleaning involves finding and correcting the errors in data

    Data Transformation - Data Transformation involves converting data from

    warehouse format.

    Data Loading - Data Loading involves sorting, summarizing, consolidating,

    and building indices and partitions.

    Refreshing- Refreshing involves updating from data sources to warehouse.

    Note:Data Cleaning and Data Transformation are important steps in improving the q

    data mining results.

    Data Warehousing - Terminologies

    In this article, we will discuss some of the comm only used terms in Data Warehouse.

    Data Warehouse

    Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile colle

    support of management's decis ion making process. Let's explore this Definition of da

    Subject Oriented - The Data warehouse is subject oriented because it

    information around a subject rather the organization's ongoing operations. Th

    be product, customers, suppl iers, sales, revenue etc. The data warehouse doe

    ongoing operations rather it focuses on modelling and analysis of data for deci

    Integrated- Data Warehouse is constructed by integration of data from hetero

    such as relational databases, flat files etc. This integration enhance the eff

    data.

    Time-Variant- The Data in Data Warehouse is identified with a particular time

    in data warehouse provide information from historical point of view.

    Non Volatile- Non volatile means that the previous data is not removed when n

    to it. The data warehouse is kept separate from the operational database t

    changes in operational database is not reflected in data warehouse.

    Metadata- Metadata is s imply defined as data about data. The data that are u

    other data is known as metadata. For example the index of a book serve as

    contents in the book.In other words we can say that metadata is the summari

    us to the detailed data.

    In terms of data warehouse we can define metadata as following:

    Metadata is a road map to data warehouse.

    Metadata in data warehouse define the warehouse objects.

    The metadata act as a directory.This directory helps the decision support sys

    contents of data warehouse.

    Metadata Respiratory

    The Metadata Respiratory is an integral part of data warehouse system. The Met

    contains the following metadata:

  • 8/11/2019 Data Warehousing Quick Guide

    6/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 6

    Business Metadata - This metadata has the data ownership information, bu

    and changing policies.

    Operational Metadata-This metadata includes currency of data and data lin

    data means whether data is active, archived or purged. Lineage of data mea

    migrated and transformation applied on i t.

    Data for mapping from operational environment to data warehouse-This m

    source databases and their contents, data extraction,data partition, cleanin

    rules, data refresh and purging rules .

    The algorithms for summarization- This includes dimension algorithms, da

    aggregation, summ arizing etc.

    Data cube

    Data cube help us to represent the data in multiple dimensions. The data cu

    dimensions and facts. The dimensions are the entities with respect to which an en

    records.

    Illustration of Data cube

    Suppose a company wants to keep track of sales records with help of sales data

    respect to time, item, branch and location. These dimensions allow to keep track of m

    at which branch the items were sold.There is a table associated with each dimens

    known as dimension table. This dimension table further describes the dimensions. F

    dimension table may have attributes such as item_name, item_type and item_brand.

    The following table represents 2-D view of Sales Data for a company with respect

    location dimens ions.

    But here in this 2-D table we have records with respect to time and item only. The sa

    are shown with respect to time and item dimensions according to type of item sold.

    the sales data with one new dimens ion say the location dimens ion. The 3-D view of t

    respect to time, item, and location is s hown in the table below:

  • 8/11/2019 Data Warehousing Quick Guide

    7/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 7

    The above 3-D table can be represented as 3-D data cube as shown in the following f

    Data mart

    Data mart contains the subset of organisation-wide data. This subset of data is va

    group of an organisation. in other words we can say that data mart contains only t

    specific to a particular group. For example the marketing data mart may contain on

    item, customers and sales. The data mart are confined to subjects.

    Points to remember about data marts:

    window based or Unix/Linux based servers are used to implement data

  • 8/11/2019 Data Warehousing Quick Guide

    8/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 8

    implemented on low cost server.

    The implementation cycle of data mart is measured in short period of time i.e

    than months or years.

    The life cycle of a data mart may be complex in long run if it's planning an

    organisation-wide.

    Data mart are small in size.

    Data mart are customized by department.

    The source of data mart is departmentally structured data warehouse.

    Data mart are flexible.

    Graphical Representation of data mart.

    Virtual Warehouse

    The view over a operational data warehouse is known as virtual warehouse. It is eas y

    warehouse. Building the virtual warehouse requires excess capacity on operational d

    Data Warehousing - Delivery Proces

    Introduction

    The data warehouse are never static. It evolves as the business increases. The tod

    different from the future needs.We must design the data warehouse to change co

    problem is that business itself is not aware of its requirement for information in the fuevolves it's need also changes therefore the data warehuose must be designed t

    changes. Hence the data warehouse systems need to be flexible.

    There should be a delivery process to deliver the data warehouse.But there are ma

    warehouse projects that it is very difficult to complete the task and deliverables in t

    fashion demanded by waterfall method because the requirements are hardly fully un

    when the requirements are completed only then the architectures des igns, and build

    be completed.

    Delivery Method

    The delivery method is a variant of the joint application development approach, adop

  • 8/11/2019 Data Warehousing Quick Guide

    9/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 9

    data warehouse. We staged the data warehouse delivery process to minim ize the r

    that i will discuss does not reduce the overall delivery time-scales but ensures busi

    delivered incrementally through the development process.

    Note: The delivery process is broken into phases to reduce the project and delivery ris

    Following diagram Explain the Stages in delivery process:

    IT Strategy

    Data warehouse are strategic investments, that require business process to gen

    benefits. IT Strategy is required to procure and retain funding for the project.

    Business Case

    The objective of Business case is to know the projected business benefits that shou

    using the data warehouse. These benefits may not be quantifiable but the projected

    be clearly stated.. If the data warehouse does not have a clear bus iness case then t

    to suffer from the credibility problems at some stage during the delivery process.T

    warehouse project we need to understand the busines s case for investment.

    Education and PrototypingThe organization will experiment with the concept of data analysis and educate th

    value of data warehouse before determining that a data warehouse is prior solution. T

    by prototyping. This prototyping activity helps in understanding the feasibility and b

    warehouse. The Prototyping activity on a sm all scale can further the educational proce

    The prototype address a defined technical objective.

    The prototype can be thrown away after the feasibility concept has been shown.

    The activity addresses a smal l subset of eventual data content if the data wareh

    The activity timescale is non- critical.

  • 8/11/2019 Data Warehousing Quick Guide

    10/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 10

    Points to remember to produce an early release of a part of a data warehouse to

    benefits.

    Identify the architecture that is capable of evolving.

    Focus on the business requirements and technical blueprint phases.

    Limit the scope of the first build phase to the minimum that delivers bus iness b

    Understand the short term and medium term requirements of the data warehou

    Business Requirements

    To provide the quality deliverables we should make sure that overall requirements are

    business requirements and the technical blueprint stages are required because

    reasons:

    If we understand the business requirements for both short and medium te

    design a solution that satis fies the short term need.

    This would be capable of growing to the full solution.

    Things to determine in this s tage are following.

    The business rule to be applied on data.

    The logical model for information within the data warehouse.

    The query profiles for the immediate requirement.

    The source systems that provide this data.

    Technical Blueprint

    This phase need to deliver an overall architecture satisfying the long term requirem

    also deliver the components that must be implem ented in a short term to derive any

    The blueprint need to identify the followings .

    The overall system architecture.

    The data retention policy.

    The backup and recovery strategy.

    The server and data mart architecture.

    The capacity plan for hardware and infrastructure.

    The components of database design.

    Building the version

    In this s tage the first production deliverable is produced.

    This production deliverable smalles t component of data warehouse.

    This s malles t component adds business benefit.

    History Load

    This is the phase where the remainder of the required history is loaded into the data w

    phase we do not add the new entities but additional physical tables would probably b

  • 8/11/2019 Data Warehousing Quick Guide

    11/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 1

    the increased data volumes.

    Let's have an example, Suppose the build version phase has delivered a retail sa

    warehouse with 2 months worth of history. This information will allow the user to

    recent trends and address the short term issues . The user can not identify the ann

    trends. So the 2 years worth of sales history could be loaded from the archive to mak

    the sales trend yearly and seasonal. Now the 40GB data is extended to 400GB.

    Note:The backup and recovery procedures may become complex therefore it is re

    perform this activity within separate phase.

    Ad hoc Query

    In this phase we configure an ad hoc query tool.

    This ad hoc query tool is used to operate the data warehouse.

    These tools can generate the database query.

    Note:It is recommended that not to use these access tolls when database is b

    modified.

    Automation

    In this phas e operational management processes are fully automated. These would i

    Transforming the data into a form suitable for analysis.

    Monitoring query profiles and determining the appropriate aggregations to

    performance.

    Extracting and loading the data from different source systems.

    Generating aggregations from predefined definitions within the data warehouse

    Backing Up, restoring and archiving the data.

    Extending Scope

    In this phase the data warehouse is extended to address a new set of business re

    scope can be extended in two ways:

    By loading additional data into the data warehouse.

    By introducing new data marts using the existing information.

    Note:This phase should be performed separately since this phase involves subst

    complexity.

    Requirements EvolutionFrom the perspective of delivery process the requirement are always changeab

    static.The delivery process must support this and allow these changes to be re

    system.

    This issue is addressed by designing the data warehouse around the use of data

    processes, as oppos ed to the data requirements of existing queries .

    The architecture is designed to change and grow to match the business needs,the

    as a pseudo application development process, where the new requirements are co

    the development activities. The partial deliverables are produced.These partial del

    back to users and then reworked ensuring that overall system is continually upd

  • 8/11/2019 Data Warehousing Quick Guide

    12/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 12

    business needs.

    Data Warehousing - System Process

    We have fixed number of operations to be applied on operational databases and we

    techniques such as use normalized data,keep table small etc. These techniques

    delivering a solution. But in case of decision support system we do not know what qu

    need to be executed in future. Therefore techniques applied on operational database

    for data warehouses.

    In this chapter We'll focus on designing data warehousing solution built on the

    technologies like Unix and relational databases .

    Process Flow in Data Warehouse

    There are four major processes that build a data warehous e. Here is the lis t of four pr

    Extract and load data.

    Cleaning and transforming the data.

    Backup and Archive the data.

    Managing queries & directing them to the appropriate data sources.

    Extract and Load Process

    The Data Extraction takes data from the source s ystems.

    Data load takes extracted data and loads it into data warehouse.

    Note:Before loading the data into data warehouse the information extracted from

    mus t be reconstructed.

    Points to remember while extract and load process :

    Controlling the process

    When to Initiate Extract

    Loading the Data

    CONTROLLING THE PROCESS

  • 8/11/2019 Data Warehousing Quick Guide

    13/66

  • 8/11/2019 Data Warehousing Quick Guide

    14/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 14

    For example in a retail sales analysis data warehouse, it may be required to keep da

    latest 6 months data being kept online. In this kind of scenario there is often requirem

    do month-on-month comparisons for this year and last year. In this case we require

    restored from the archive.

    Query Management Process

    This process performs the following functions

    This process manages the queries.

    This process speed up the queries execution.

    This Process direct the queries to most effective data sources.

    This process s hould also ensure that all system sources are used in most effe

    This process is also required to monitor actual query profiles.

    Information in this process is used by warehouse management process to

    aggregations to generate.

    This process does not generally operate during regular load of information into

    Data Warehousing - ArchitectureIn this article, we will discuss the business analysis framework for data wareh

    architecture of a data warehouse.

    Business Analysis Framework

    The business analyst get the information from the data warehouses to measure the

    make critical adjustments in order to win over other business holders in the ma

    warehouse has the following advantages for the busines s.

    Since the data warehouse can gather the information quickly and efficiently

    enhance the bus iness productivity.

    The data warehouse provides us the consistent view of customers and items

    manage the customer relationship.

    The data warehouse also helps in bringing cost reduction by tracking trends

    long period in a consistent and reliable manner.

    To design an effective and efficient data warehouse we are required to understand

    business needs and construct a business analysis framework. Each person ha

    regarding the design of a data warehouse. These views are as follows:

    The top-down view - This view allows the selection of relevant information

    warehouse.

    The data source view - This view presents the information being captu

    managed by operational system.

    The data warehouse view - This view includes the fact tables and dime

    represent the information stored inside the data warehouse.

    The Business Query view- It is the view of the data from the viewpoint of the en

    Three-Tier Data Warehouse Architecture

    Generally the data warehouses adopt the three-tier architecture. Following are the t

  • 8/11/2019 Data Warehousing Quick Guide

    15/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 15

    warehouse architecture.

    Bottom Tier- The bottom tier of the architecture is the data warehouse databa

    relational database system.We use the back end tools and utilities to feed

    tier.these back end tools and utilities performs the Extract, Clean, Load, and ref

    Middle Tier- In the middle tier we have OLAp Server. the OLAP Server can be

    either of the following ways.

    By relational OLAP (ROLAP), which is an extended relational databa

    system. The ROLAP maps the operations on multidimensional data to soperations.

    By Multidimensional OLAP (MOLAP) model, which directly implements

    data and operations.

    Top-Tier- This tier is the front-end client layer. This layer hold the query tools a

    analysis tools and data mining tools.

    Following diagram explains the Three-tier Architecture of Data warehouse:

    Data Warehouse Models

    From the perspective of data warehouse architecture we have the following data ware

    Virtual Warehouse

    Data mart

    Enterprise Warehouse

    VIRTUAL WAREHOUSE

  • 8/11/2019 Data Warehousing Quick Guide

    16/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 16

    The view over a operational data warehouse is known as virtual warehouse. It

    virtual warehouse.

    Building the virtual warehouse requires excess capacity on operational databas

    DATA MART

    Data mart contains the subset of organisation-wide data.

    This subset of data is valuable to specific group of an organisation

    Note:in other words we can say that data mart contains only that data which is spec

    group. For example the marketing data mart may contain only data related to item

    sales. The data mart are confined to subjects.

    Points to remember about data marts

    window based or Unix/Linux based servers are used to implement data

    implemented on low cost server.

    The implementation cycle of data mart is measured in short period of time i.e

    than months or years.

    The life cycle of a data mart may be complex in long run if it's planning an

    organisation-wide.

    Data mart are small in size.

    Data mart are customized by department.

    The source of data mart is departmentally structured data warehouse.

    Data mart are flexible.

    ENTERPRISE WAREHOUSE

    The enterprise warehouse collects all the information all the subjects sp

    organization

    This provide us the enterprise-wide data integration.

    This provide us the enterprise-wide data integration.

    The data is integrated from operational systems and external information provid

    This information can vary from a few gigabytes to hundreds of gigabytes, teraby

    Load Manager

    This Component performs the operations required to extract and load process .

    The size and complexity of load manager varies between specific soluwarehouse to data warehouse.

    LOAD MANAGER ARCHITECTURE

    The load manager performs the following functions:

    Extract the data from source system.

    Fast Load the extracted data into temporary data store.

    Perform s imple transformations into structure similar to the one in the data war

  • 8/11/2019 Data Warehousing Quick Guide

    17/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 17

    EXTRACT DATA FROM SOURCE

    The data is extracted from the operational databases or the external information provi

    the application programs that are used to extract data. It is supported by underlying

    client program to generate SQL to be executed at a server. Open Database Connect

    Database Connection (JDBC), are examples of gateway.

    FAST LOAD

    In order to minimize the total load window the data need to be loaded into the

    fastest poss ible time.

    The transformations affects the speed of data process ing.

    It is more effective to load the data into relational database prior to applying tra

    checks.

    Gateway technology proves to be not suitable, since they tend not be performan

    volumes are involved.

    SIMPLE TRANSFORMATIONS

    While loading it may be required to perform s imple transformations. After this has be

    are in position to do the complex checks. Suppose we are loading the EPOS sale

    need to perform the following checks:

    Strip out all the columns that are not required within the warehouse.

    Convert all the values to required data types.

    Warehouse Manager

    Warehouse manager is respons ible for the warehouse management process.

    The warehouse manager consis t of third party system software, C programs an

    The size and complexity of warehouse manager varies between specific solutio

    WAREHOUSE MANAGER ARCHITECTURE

  • 8/11/2019 Data Warehousing Quick Guide

    18/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 18

    The warehouse manager includes the following:

    The Controlling process

    Stored procedures or C with SQL

    Backup/Recovery tool

    SQL Scripts

    OPERATIONS PERFORMED BY WAREHOUSE MANAGER

    Warehouse manager analyses the data to perform consistency and referential

    Creates the indexes, bus iness views, partition views agains t the base data.

    Generates the new aggregations and also updates the existing aggregatio

    normalizations.

    Warehouse manager Warehouse manager transforms and merge the sou

    temporary store into the published data warehouse.

    Backup the data in the data warehous e.

    Warehouse Manager archives the data that has reached the end of its captured

    Note: Warehouse Manager also analyses query profiles to determine index and

    appropriate.

    Query Manager

    Query Manager is respons ible for directing the queries to the suitable tables.

    By directing the queries to appropriate table the query request and response

    up.

    Query Manager is respons ible for scheduling the execution of the queries pose

  • 8/11/2019 Data Warehousing Quick Guide

    19/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 19

    QUERY MANAGER ARCHITECTURE

    Query Manager includes the following:

    The query redirection via C tool or RDBMS.

    Stored procedures.

    Query Management tool.

    Query Scheduling via C tool or RDBMS.

    Query Schedul ing via third party Software.

    Detailed information

    The following diagram shows the detailed information

  • 8/11/2019 Data Warehousing Quick Guide

    20/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 20

    The detailed information is not kept online rather is aggregated to the next level o

    archived to the tape. The detailed infomation part of data warehouse keep the detai

    the starflake schema. the detailed information is loaded into the data warehouse to

    aggregated data.

    Note:If the detailed information is held offline to minimize the disk storage we shou

    the data has been extracted, cleaned up, and transformed then into starflake sch

    archived.

    Summary Information

    In this area of data warehouse the predefined aggregations are kept.

    These aggregations are generated by warehouse manager.

    This area changes on ongoing basis in order to respond to the changing query

    This area of data warehouse mus t be treated as transient.

    Points to remem ber about summary information.

    The summ ary data speed up the performance of common queries.

    It increases the operational cost.

    It need to be updated whenever new data is loaded into the data warehouse.

    It may not have been backed up, since it can be generated fresh from the detaile

    Data Warehousing - OLAP

    Introduction

    Online Analytical Processing Server (OLAP) is based on multidimensional data mo

    managers , analysts to get insight the information through fast, consistent, inte

  • 8/11/2019 Data Warehousing Quick Guide

    21/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 2

    information. In this chapter we will discuss about types of OLAP, operations on

    between OLAP and Statistical Databases and OLTP.

    Types of OLAP Servers

    We have four types of OLAP servers that are lis ted below.

    Relational OLAP(ROLAP)

    Multidim ens ional OLAP (MOLAP)

    Hybrid OLAP (HOLAP)

    Special ized SQL Servers

    Relational OLAP(ROLAP)

    The Relational OLAP servers are placed between relational back-end server and clie

    To store and manage warehouse data the Relational OLAP use relational or ex

    DBMS.

    ROLAP includes the following.

    implementation of aggregation navigation logic.

    optimization for each DBMS back end.

    additional tools and services.

    Multidimensional OLAP (MOLAP)

    Multidimens ional OLAP (MOLAP) uses the array-based multidimens ional stor

    multidimensional views of data.With multidimensional data s tores, the storage utiliza

    the data set is sparse. Therefore many MOLAP Server uses the two level of data storag

    to handle dense and sparse data sets.

    Hybrid OLAP (HOLAP)

    The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the h

    ROLAP and faster computation of MOLAP. HOLAP server allows to store the large

    detail data. the aggregations are stored separated in MOLAP store.

    Specialized SQL Servers

    specialized SQL servers provides advanced query language and query processing

    queries over star and snowflake schemas in a read-only environment.

    OLAP Operations

    As we know that the OLAP server is based on the multidim ens ional view of data henc

    the OLAP operations in multidimens ional data.

    Here is the list of OLAP operations.

    Roll-up

    Drill-down

    Slice and dice

  • 8/11/2019 Data Warehousing Quick Guide

    22/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 22

    Pivot (rotate)

    ROLL-UP

    This operation performs aggregation on a data cube in any of the following way:

    By climbing up a concept hierarchy for a dimension

    By dimension reduction.

    Consider the following diagram showing the roll-up operation.

    The roll-up operation is performed by climbing up a concept hierarchy for the di

    Initially the concept hierarchy was "s treet < city < province < country".

    On rolling up the data is aggregated by ascending the location hierarchy from

    level of country.

    The data is grouped into cities rather than countries.

    When roll-up operation is performed then one or more dimensions from th

    removed.

    DRILL-DOWN

    Drill-down operation is reverse of the roll-up. This operation is performed by either of t

  • 8/11/2019 Data Warehousing Quick Guide

    23/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 23

    By stepping down a concept hierarchy for a dimension.

    By introducing new dimens ion.

    Consider the following diagram showing the drill-down operation:

    The drill-down operation is performed by stepping down a concept hierarchy f

    time.

    Initially the concept hierarchy was "day < month < quarter < year."

    On drill-up the time dimension is descended from the level quarter to the level o

    When drill-down operation is performed then one or more dimensions from t

    added.

    It navigates the data from less detailed data to highly detailed data.

    SLICE

    The slice operation performs selection of one dimension on a given cube and give us

    Consider the following diagram showing the slice operation.

  • 8/11/2019 Data Warehousing Quick Guide

    24/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 24

    The Slice operation is performed for the dimension time us ing the criterion time

    It will form a new sub cube by selecting one or more dimensions.

    DICE

    The Dice operation performs selection of two or more dimension on a given cube a

    subcube. Consider the following diagram showing the dice operation:

  • 8/11/2019 Data Warehousing Quick Guide

    25/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 25

    The dice operation on the cube based on the following selection criteria that involve th

    (location = "Toronto" or "Vancouver")

    (time = "Q1" or "Q2")

    (item =" Mobile" or "Modem").

    PIVOT

    The pivot operation is also known as rotation.It rotates the data axes in view in or

    alternative presentation of data.Consider the following diagram showing the pivot ope

  • 8/11/2019 Data Warehousing Quick Guide

    26/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 26

    In this the item and location axes in 2-D s lice are rotated.

    OLAP vs OLTP

    SN Data Warehouse (OLAP) Operational Database(OLTP)

    1

    This involves historical processing of

    information. This involves day to day process i

    2

    OLAP systems are used by knowledge

    workers such as executive, manager and

    analyst.

    OLTP system are used by clerk,

    database profess ionals.

    3 This is used to analys is the bus iness . This is used to run the bus iness

    4 It focuses on Information out. It focuses on Data in.

    5This is based on Star Schema, Snowflake

    Schema and Fact Constellation Schema.This is based on Entity Relations

    6 It focuses on Information out. This is application oriented.

  • 8/11/2019 Data Warehousing Quick Guide

    27/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 27

    7 This contains historical data. This contains current data.

    8This provides summarized and

    consolidated data.This provide primitive and highly

    9This provide summarized and

    multidimens ional view of data.

    This provides detailed and flat re

    data.

    10 The number or users are in Hundreds. The number of users are in thou

    11 The number of records accessed are inmillions. The number of records accessed

    12 The database size is from 100GB to TB The database size is from 100 M

    13 This are highly flexible. This provide high performance.

    Data Warehousing - Relational OLA

    Introduction

    The Relational OLAP servers are placed between relational back-end server and clie

    To store and manage warehouse data the Relational OLAP use relational or ex

    DBMS.

    ROLAP includes the following.

    implementation of aggregation navigation logic.

    optimization for each DBMS back end.

    additional tools and services.

    Note:The ROLAP servers are highly scalable.

    Points to remember

    The ROLAP tools need to analyze large volume of data across multiple dimens

    The ROLAP tools need to store and analyze highly volatile and changeable data

    Relational OLAP Architecture

    The ROLAP includes the following.

    Database Server

    ROLAP Server

    Front end tool

  • 8/11/2019 Data Warehousing Quick Guide

    28/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 28

    Advantages

    The ROLAP servers are highly scalable.

    They can be easily used with the existing RDBMS.

    Data Can be s tored efficiently since no zero facts can be stored.

    ROLAP tools do not use pre-calculated data cubes.

    DSS server of microstrategy adopts the ROLAP approach.

    Disadvantages

    Poor query performance.

    Some limitations of scalabil ity depending on the technology architecture that is

    Data Warehousing - Multidimensional O

    Introduction

    Multidimens ional OLAP (MOLAP) uses the array-based multidimens ional stor

    multidimensional views of data. With multidimensional data stores, the storage utiliza

    the data set is sparse. Therefore many MOLAP Server uses the two level of data storag

    to handle dense and sparse data sets.

    Points to remember:

    MOLAP tools need to process information with consistent response time rega

    summarizing or calculations selected.

    The MOLAP tools need to avoid many of the complexities of creating a relat

    store data for analysis .

    The MOLAP tools need fastest possib le performance.

    MOLAP Server adopts two level of storage representation to handle dense and

    Denser subcubes are identified and stored as array structure.

    Sparse subcubes employs compression technology.

    MOLAP Architecture

  • 8/11/2019 Data Warehousing Quick Guide

    29/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 29

    MOLAP includes the following components.

    Database server

    MOLAP server

    Front end tool

    Advantages

    Here is the list of advantages of Multidimens ional OLAP

    MOLAP allows fastest indexing to the precomputed summ arized data.

    Helps the user who are connected to a network and need to analyze larger, less

    Easier to us e therefore MOLAP is best suitable for inexperienced user.

    Disadvantages

    MOLAP are not capable of containing detailed data.

    The storage utilization may be low if the data set is sparse.

    MOLAP vs ROLAP

    SN MOLAP ROLAP

    1 The inform ation retrieval is fas t. Inform ation retrieval is com parati

    2It uses the sparse array to store the data

    sets.It uses relational table.

    3MOLAP is best suited for inexperienced

    users since it is very easy to use.ROLAP is best suited for experie

    4 The separate database for data cube.It may not require space other tha

    Data warehouse.

    5 DBMS facility is weak. DBMS facility is strong.

    Data Warehousing - Schemas

    Introduction

  • 8/11/2019 Data Warehousing Quick Guide

    30/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 30

    The schema is a logical description of the entire database. The schema include

    description of records of all record types including all associated data-items and agg

    the database the data warehouse also require the schema. The database uses the

    on the other hand the data warehouse uses the Stars, snowflake and fact constellatio

    chapter we will discuss the schemas used in data warehouse.

    Star Schema

    In star schema each dimension is represented with only one dimension table.

    This dimension table contains the set of attributes.

    In the following diagram we have shown the sales data of a company with r

    dimensions namely, time, item, branch and location.

    There is a fact table at the centre. This fact table contains the keys to each of fou

    The fact table also contain the attributes namely, dollars sold and units sold.

    Note:Each dimension has only one dimension table and each table holds a set

    example the location dimension table contains the

    {location_key,street,city,province_or_state,country}. This constraint may cause data

    example the "Vancouver" and "Victoria" both cities are both in Canadian province of

    The entries for such cities may cause data redundancy along the attributes provin

    country.

    Snowflake Schema

    In Snowflake schema some dimension tables are normalized.

    The normalization split up the data into additional tables.

    Unlike Star schema the dimensions table in snowflake schema are normalize

    item dimension table in s tar schema is normalized and split into two dimensi

    item and supplier table.

  • 8/11/2019 Data Warehousing Quick Guide

    31/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 3

    Therefore now the item dimension table contains the attributes item_key,

    brand, and s upplier-key.

    The supplier key is linked to supplier dimension table. The supplier dimensio

    the attributes supplier_key, and suppl ier_type.

    Note: Due to normalization in Snowflake schema the redundancy is reduced therefore

    to maintain and save storage space.

    Fact Constellation Schema

    In fact Constellation there are multiple fact tables. This schema is also

    schema.

    In the following diagram we have two fact tables namely, sales and shipping.

  • 8/11/2019 Data Warehousing Quick Guide

    32/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 32

    The sale fact table is sam e as that in star schema.

    The shipping fact table has the five dimensions namely, item_key, time_key, s

    location.

    The shipping fact table also contains two measures namely, dollars sold and u

    It is also possible for dimension table to share between fact tables. For examp

    location dimension tables are shared between sales and shipping fact table.

    Schema DefinitionThe Multidimensional schema is defined using Data Mining Query Language(

    primitives namely, cube definition and dimension definition can be used for d

    warehouses and data marts.

    SYNTAX FOR CUBE DEFINITION

    define cube [}:

    SYNTAX FOR DIMENSION DEFINITION

    define dimension as(

  • 8/11/2019 Data Warehousing Quick Guide

    33/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 33

    The Snowflake schema that we have discussed can be defined us ing the Data Mining

    (DMQL) as follows :

    define cube sales [time,item,branch,location]:

    dollars sold =sum(sales indollars),units sold =count(*)

    define dimension time as(time key,day,day of week,month,quart

    define dimension item as(item key,item name,brand,type,supplidefine dimension branch as(branch key,branch name,branch type)

    define dimension location as(location key,street,city,province

    define cube shipping [time,item,shipper,fromlocation,to locat

    dollars cost =sum(cost indollars),units shipped =count(*)

    define dimension time astime incube sales

    define dimension item asitem incube sales

    define dimension shipper as(shipper key,shipper name,location a

    location incube sales,shipper type)

    define dimension fromlocation aslocation incube sales

    define dimension to locationaslocation

    incube sales

    Data Warehousing - Partitioning Strate

    Introduction

    The partitioning is done to enhance the performance and make the management

    also helps in balancing the various requirements of the system. It will optimi

    performance and s implify the managem ent of data warehouse. In this we partition ea

    multiple separate partitions. In this chapter we will discuss about the partitioning strat

    Why to PartitionHere is the list of reasons.

    For easy management

    To assis t backup/recovery

    To enhance performance

    FOR EASY MANAGEMENT

    The fact table in data warehouse can grow to many hundreds of gigabytes in s ize. Th

    fact table is very hard to manage as a s ingle entity. Therefore it needs partition.

    TO ASSIST BACKUP/RECOVERY

    If we do not have partitioned the fact table then we have to load the complete fact

    data.Partitioning allow us to load that data which is required on regular bas is. This w

    to load and also enhances the performance of the system.

    Note:To cut down on the backup size all partitions other than the current partitions ca

    only. We can then put these partition into a state where they can not be modified.

    backed up .This means that only the current partition is to be backed up.

    TO ENHANCE PERFORMANCE

  • 8/11/2019 Data Warehousing Quick Guide

    34/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 34

    By partitioning the fact table into sets of data the query procedures can be enha

    performance is enhanced because now the query scans the partitions that are rel

    have to scan the large amount of data.

    Horizontal Partitioning

    There are various way in which fact table can be partitioned. In horizontal partitioning w

    mind the requirements for manageability of the data warehouse.

    PARTITIONING BY TIME INTO EQUAL SEGMENTSIn this partitioning strategy the fact table is partitioned on the bases of time period

    period represents a significant retention period within the bus iness . For example if th

    month to date data then it is appropriate to partition into monthly segments. W

    partitioned tables by removing the data in them.

    PARTITIONING BY TIME INTO DIFFERENT-SIZED SEGMENTS

    This kind of partition is done where the aged data is accessed infrequently.

    implemented as a set of small partitions for relatively current data, larger partition for i

    Following is the list of advantages.

    The detailed information remains available online.

    The number of physical tables is kept relatively small, which reduces the opera

    This technique is sui table where the mix of data dipping recent history, and da

    entire history is required.

    Following is the lis t of disadvantages.

    This technique is not useful where the partitioning profile changes on regular b

    repartitioning wil l increase the operation cost of data warehouse.

    PARTITION ON A DIFFERENT DIMENSION

    The fact table can also be partitioned on basis of dimensions other than time

    group,region,suppl ier, or any other dimens ions. Let's have an example.

    Suppose a market function which is structured into distinct regional departments for

    state basis. If each region wants to query on information captured within its region,

    be more effective to partition the fact table into regional partitions. This will cause the

    up because it does not require to scan information that is not relevant.

    Following is the list of advantages.

    Since the query does not have to scan the irrelevant data which speed up the qu

    Following is the lis t of disadvantages.

    This technique is not appropriate where the dimens ions are unlikely to change

  • 8/11/2019 Data Warehousing Quick Guide

    35/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 35

    worth determining that the dimension does not change in future.

    If the dimension changes then the entire fact table would have to be repartitione

    Note:We recommend that do the partition only on the basis of time dimension unle

    that the suggested dimens ion grouping will not change within the life of data warehou

    PARTITION BY SIZE OF TABLE

    When there are no clear basis for partitioning the fact table on any dimension then we

    the fact table on the basis of their size.We can set the predetermined size as a c

    the table exceeds the predetermined s ize a new table partition is created.

    Following is the lis t of disadvantages.

    This partitioning is complex to manage.

    Note:This partitioning required metadata to identify what data stored in each partition

    PARTITIONING DIMENSIONS

    If the dimens ion contain the large number of entries then it is required to partition dim

    have to check the size of dimens ion.

    Suppose a large design which changes over time. If we need to store all the variation

    comparisons, that dimension may be very large. This would definitely affect the respo

    ROUND ROBIN PARTITIONS

    In round robin technique when the new partition is needed the old one is archived.

    metadata is used to allow us er access tool to refer to the correct table partition.

    Following is the list of advantages.

    This technique make it easy to automate table management facilities within the

    Vertical Partition

    In Vertical Partitioning the data is split vertically.

  • 8/11/2019 Data Warehousing Quick Guide

    36/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 36

    The Vertical Partitioning can be performed in the following two ways.

    Normalization

    Row Splitting

    NORMALIZATION

    Normalization method is the standard relational method of database organization. I

    rows are collapsed into single row, hence reduce the space.

    Table before normalization

    Product_id Quantity Value sales_date Store_id Store_name Locat

    30 5 3.67 3-Aug-13 16 sunny Bang

    35 4 5.33 3-Sep-13 16 sunny Bang

    40 5 2.50 3-Sep-13 64 san Mumb

    45 7 5.66 3-Sep-13 16 sunny Bang

    Table after normalization

    Store_id Store_name Location R

    16 sunny Bangalore W

    64 san Mumbai S

    Product_id Quantity Value sales_date S

    30 5 3.67 3-Aug-13 1

    35 4 5.33 3-Sep-13 1

    40 5 2.50 3-Sep-13 6

    45 7 5.66 3-Sep-13 1

    ROW SPLITTING

    The row spl itting tend to leave a one-to-one map between partitions. The motive of

    speed the access to large table by reducing its s ize.

    Note:while using vertical partitioning make sure that there is no requirement to p

    operations between two partitions.

    Identify Key to Partition

    It is very crucial to choose the right partition key.Choosing wrong partition key will leadthe fact table. Let's have an example. Suppose we want to partition the following table

    Account_Txn_Table

    transaction_id

    account_id

    transaction_type

    value

    transaction_date

    region

    branch_name

  • 8/11/2019 Data Warehousing Quick Guide

    37/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 37

    We can choose to partition on any key. The two poss ible keys could be

    region

    transaction_date

    Now suppose the business is organised in 30 geographical regions and each reg

    number of branches.That will give us 30 partitions, which is reasonable. This pa

    enough because our requirements capture has shown that vast majority of queries ar

    user's own business region.

    Now If we partition by transaction_date instead of region. Then it means that the lates

    every region will be in one partition. Now the user who wants to look at data within hi

    to query across multiple partition.

    Hence it is worth determining the right partitioning key.

    Data Warehousing - Metadata Concep

    What is Metadata

    Metadata is simply defined as data about data. The data that are used to represent ot

    as metadata. For example the index of a book serve as metadata for the contents in

    words we can say that metadata is the summarized data that leads us to the detaileddata warehouse we can define metadata as following.

    Metadata is a road map to data warehouse.

    Metadata in data warehouse define the warehouse objects.

    The metadata act as a directory.This directory helps the decision support sys

    contents of data warehouse.

    Note: In data warehouse we create metadata for the data names and definitions

    warehouse. Along with this metadata additional metadata are also created for t

    extracted data, the s ource of extracted data.

    Categories of Metadata

    The metadata can be broadly categorized into three categories:

    Business Metadata - This metadata has the data ownership information, bu

    and changing policies.

    Technical Metadata- Technical metadata includes database system names,

    names and sizes, data types and allowed values. Technical metadata also in

    information such as primary and foreign key attributes and indices .

    Operational Metadata- This metadata includes currency of data and data lin

    data means whether data is active, archived or purged. Lineage of data mea

    migrated and transformation applied on i t.

  • 8/11/2019 Data Warehousing Quick Guide

    38/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 38

    Role of Metadata

    Metadata has very important role in data warehouse. The role of metadata in ware

    from the warehouse data yet it has very important role. The various roles of metad

    below.

    The metadata act as a directory.

    This directory helps the decision support system to locate the contents of data w

    Metadata helps in decis ion support system for mapping of data when data are

    operational environment to data warehouse environment.

    Metadata helps in summ arization between current detailed data and highly sum

    Metadata also helps in s ummarization between lightly detailed data and hi

    data.

    Metadata are also used for query tools.

    Metadata are used in reporting tools.

    Metadata are used in extraction and cleansing tools.

    Metadata are used in transformation tools.

    Metadata also plays important role in loading functions.

    Diagram to understand role of Metadata.

  • 8/11/2019 Data Warehousing Quick Guide

    39/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 39

    Metadata RespiratoryThe Metadata Respiratory is an integral part of data warehouse system. The Metadata

    the following metadata:

    Definition of data warehouse- This includes the description of structure of data

    description is defined by schema, view, hierarchies, derived data definition

    locations and contents.

    Business Metadata - This metadata has the data ownership information, bu

    and changing policies.

    Operational Metadata- This metadata includes currency of data and data lin

    data means whether data is active, archived or purged. Lineage of data meamigrated and transformation applied on i t.

    Data for mapping from operational environment to data warehouse- This m

    source databases and their contents, data extraction,data partition cleanin

    rules, data refresh and purging rules .

    The algorithms for summarization- This includes dimension algorithms, da

    aggregation, summ arizing etc.

    Challenges for Metadata Management

    The importance of metadata can not be overstated. Metadata helps in driving the ac

    validates data transformation and ensures the accuracy of calculations. The metadthe consistent definition of business terms to business end users. With all these u

    also has challenges for metadata management. The some of the challenges are disc

    The Metadata in a big organization is scattered across the organization. T

    spreaded in spreadsheets, databases, and applications.

    The metadata could present in text file or multimedia file. To use this dat

    management solution, this data need to be correctly defined.

    There are no industry wide accepted standards. The data management solut

    narrow focus.

  • 8/11/2019 Data Warehousing Quick Guide

    40/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 40

    There is no easy and accepted methods of passing metadata.

    Data Warehousing - Data Marting

    Why to create Datamart

    The following are the reasons to create datamart:

    To partition data in order to imposeaccess control strategies.

    To speed up the queries by reducing the volume of data to be scanned.

    To segment data into different hardware platforms .

    To structure data in a form suitable for a user access tool.

    Note:Donot data mart for any other reason since the operation cost of data marting c

    Before data marting, make sure that data marting strategy is appropriate for your parti

    Steps to determine that data mart appears to fit the bill

    Following s teps need to be followed to make cost effective data marting:

    Identify the Functional Splits

    Identify User Access Tool Requirements

    Identify Access Control Issues

    IDENTIFY THE FUNCTIONAL SPLITS

    In this step we determine that whether the natural functional split is there in the orga

    for departmental splits, and we determine whether the way in which department use

    to be in isolation from the rest of the organization. Let's have an example...

    suppose in a retail organization where the each merchant is accountable for maximiz

    group of products. For this the information that is valuable is :

    sales transaction on daily basis

    sales forecast on weekly basis

    stock position on daily basis

    stock movements on daily basis

    As the merchant is not interes ted in the products they are not dealing with, so th

    subset of the data dealing which the product group of interest. Following diagram sh

    for different users .

  • 8/11/2019 Data Warehousing Quick Guide

    41/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 4

    Issues in determining the functional split:

    The structure of the department may change.

    The products m ight switch from one department to other.

    The merchant could query the sales trend of other products to analyse what is

    sales.

    These are issues that need to be taken into account while determining the functional

    Note:we need to determine the busines s benefits and technical feasibility of using da

    IDENTIFY USER ACCESS TOOL REQUIREMENTS

    For the user access toolsthat require the internal data structures we need data ma

    tools. The data in such s tructures are outside the control of data warehouse but nee

    and updated on regular basis.

    There are some tools that populated directly from the source system but some c

    additional requirements outside the scope of the tool are needed to be identified for fu

    Note: In order to ensure consistency of data across all access tools the data shou

    populated from the data warehouse rather each tool mus t have its own data mart.

    IDENTIFY ACCESS CONTROL ISSUES

    There need to be privacy rules to ensure the data is accessed by the authorised

    example in data warehouse for retail baking institution ensure that all the accounts be

    legal entity. Privacy laws can force you to totally prevent access to information that is

    specific bank.

    Data mart allow us to build complete wall by physically separating data segment

    warehouse. To avoid possible privacy problems the detailed data can be remove

    warehouse.We can create data mart for each legal entity and load it via data wareho

    account data.

  • 8/11/2019 Data Warehousing Quick Guide

    42/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 42

    Designing Data Marts

    The data marts should be designed as smaller version of starflake schema

    warehouse and should match to the database design of the data warehouse

    maintaining control on database instances.

    The summaries are data marted in the same way as they would have been designewarehouse. Summ ary tables helps to utilize all dimension data in the starflake schem

    Cost Of Data Marting

    The following are the cost measures for Data marting:

    Hardware and Software Cost

    Network Access

    Time Window Constraints

    HARDWARE AND SOFTWARE COSTAlthough the data marts are created on the same hardware even then they require

    hardware and software.To handle the user queries there is need of additional proce

    disk s torage. If the detailed data and the data mart exist within the data warehouse th

    additional cost to store and manage replicated data.

    Note: The data marting is more expensive than aggregations therefore it should

    additional strategy not as an alternative strategy.

    NETWORK ACCESS

    The data mart could be on different locations from the data warehouse so we shou

  • 8/11/2019 Data Warehousing Quick Guide

    43/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 43

    LAN or WAN has the capacity to handle the data volumes being transferred within th

    process.

    TIME WINDOW CONSTRAINTS

    The extent to which the data mart loading process will eat into the available time wi

    on the complexity of the transformations and the data volumes being shipped. Feas

    data mart depend on.

    Network Capacity.

    Time Window Available

    Volume of data being transferred

    Mechanisms being used to insert data into data mart

    Data Warehousing - System Manage

    Introduction

    The system management is must for the successful implementation of data wa

    chapter we will discuss the most important system managers such as following men

    System Configuration Manager

    System Scheduling Manager

    System Event Manager

    System Database Manager

    System Backup Recovery Manager

    System Configuration Manager

    The system configuration manager is responsible for the management oconfiguration of data warehouse.

    The Structure of configuration manager varies from the operating system to ope

    In unix structure of configuration manager varies from vendor to vendor.

    Configuration manager have the s ingle user interface.

    The interface of configuration manager allow us to control of all as pects of the s

    Note:The most important configuration tool is the I/O manager.

    System Scheduling ManagerThe System Scheduling Manager is also responsible for the successful implemen

    warehouse. The purpose of this scheduling manager is to schedule the ad ho

    operating system has its own scheduler with some form of batch control mechan

    System Scheduling Manager are following.

    Work across cluster or MPP boundaries.

    Deal with international time differences.

    Handle job failure.

    Handle multiple queries.

  • 8/11/2019 Data Warehousing Quick Guide

    44/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 44

    Supports job priorities.

    Restart or requeue the failed jobs .

    Notify the user or a process when job is completed.

    Maintain the job schedules across s ystem outages.

    Requeue jobs to other queues.

    Support the stopping and starting of queues.

    Log Queued jobs.

    Deal with interqueue processing.

    Note:The above are the evaluation parameters for evaluation of a good scheduler.

    Some important jobs that the scheduler must be able to handle are as followed:

    Daily and ad hoc query scheduling.

    execution of regular report requirements.

    Data load

    Data Processing

    Index creation

    Backup

    Aggregation creation

    data transformation

    Note: If the data warehouse is running on a cluster or MPP architecture, then the s

    manager mus t be capable of running across the architecture.

    System Event Manager

    The event manager is a kind of a software. The event manager manages the events th

    the data warehouse system. We cannot manage the data warehouse manually beca

    of data warehouse is very complex. Therefore we need a tool that automatically han

    without intervention of the user.

    Note:The Event manager monitor the events occurrences and deal with them. the ev

    track the myriad of things that can go wrong on this complex data warehouse system.

    EVENTS

    The question arises is What is an event? event is nothing but the action that are gene

    or the system itself. It may be noted that the event is measurable, observable, occuaction.

    The following are the comm on events that are required to be tracked.

    hardware failure.

    Running out of space on certain key disks.

    A process dying.

    A process returning an error.

    CPU usage exceeding an 805 threshold.

  • 8/11/2019 Data Warehousing Quick Guide

    45/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 45

    Internal contention on database serialization points.

    Buffer cache hit ratios exceeding or failure below threshold.

    A table reaching to maximum of its size.

    Excess ive mem ory swapping.

    A table failing to extend due to lack of space.

    Disk exhibiting I/O bottlenecks.

    Usage of temporary or sort area reaching a certain thresholds .

    Any other database shared memory usage.

    The most important thing about is that they should be capable of executing on their

    packages that defined the procedures for the predefined events. The code associate

    is known as event handler. This code is executed whenever an event occurs.

    System and Database Manager

    System and Database manager are the two separate piece of software but they do t

    objective of these tools is to automate the certain processes and to simplify the ex

    The Criteria of choosing the system and database m anager are an abitlity to:

    increase user's Quota.

    ass ign and deassign role to the users.

    ass ign and deass ign the profiles to the users.

    perform database space management

    monitor and report on space usage.

    tidy up fragmented and unused space.

    add and expand the space.

    add and remove users.

    manage user password.

    manage summary or temporary tables.

    ass ign or deass ign temporary space to and from the user.

    reclaim the s pace form old or outofdate temporary tables.

    manage error and trace logs.

    to browse log and trace files.

    redirect error or trace information.

    switch on and off error and trace logging.

    perform s ystem space management.

    monitor and report on space usage.

    clean up old and unused file directories.

    add or expand space.

  • 8/11/2019 Data Warehousing Quick Guide

    46/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 46

    System Backup Recovery Manager

    The backup and recovery tool make it easy for operations and management staff to b

    is worth noted that the system backup manager must be integrated with the sc

    software being used. The important features that are required for the managemen

    following.

    Scheduling

    Backup data tracking

    Database awareness.

    The backup are taken only to protect the data against loss. Following are the im

    remember.

    The backup software will keep some from of database of where and when the

    backed up.

    The backup recovery manager must have a good front end to that database.

    The backup recovery software should be database aware.

    Being aware of database the software then can be addressed in database te

    perform backups that would not be viable.

    Data Warehousing - Process Manage

    Data Warehouse Load Manager

    This Component performs the operations required to extract and load process .

    The size and complexity of load manager varies between specific solu

    warehouse to data warehouse.

    LOAD MANAGER ARCHITECTURE

    The load manager does the following functions.

    Extract the data from source system.

    Fast Load the extracted data into temporary data store.

    Perform s imple transformations into structure similar to the one in the data war

  • 8/11/2019 Data Warehousing Quick Guide

    47/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 47

    EXTRACT DATA FROM SOURCE

    The data is extracted from the operational databases or the external information provi

    the application programs that are used to extract data. It is supported by underlying

    client program to generate SQL to be executed at a server. Open Database Connect

    Database Connection (JDBC), are examples of gateway.

    FAST LOAD

    In order to minimize the total load window the data need to be loaded into the

    fastest poss ible time.

    The transformations affects the speed of data process ing.

    It is more effective to load the data into relational database prior to applying tra

    checks.

    Gateway technology proves to be not suitable, since they tend not be performan

    volumes are involved.

    SIMPLE TRANSFORMATIONS

    While loading it may be required to perform s imple transformations. After this has be

    are in position to do the complex checks. Suppose we are loading the EPOS sale

    need to perform the following checks.

    Strip out all the columns that are not required within the warehouse.

    Convert all the values to required data types.

    Warehouse Manager

    Warehouse manager is respons ible for the warehouse management process.

    The warehouse manager consis t of third party system software, C programs an

    The size and complexity of warehouse manager varies between specific solutio

    WAREHOUSE MANAGER ARCHITECTURE

  • 8/11/2019 Data Warehousing Quick Guide

    48/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 48

    The warehouse manager includes the following.

    The Controlling process

    Stored procedures or C with SQL

    Backup/Recovery tool

    SQL Scripts

    OPERATIONS PERFORMED BY WAREHOUSE MANAGER

    Warehouse manager analyses the data to perform consistency and referential

    Creates the indexes, bus iness views, partition views agains t the base data.

    Generates the new aggregations and also updates the existing aggregation

    Generates the normalizations.

    Warehouse manager Warehouse manager transforms and merge the sou

    temporary store into the published data warehouse.

    Backup the data in the data warehous e.

    Warehouse Manager archives the data that has reached the end of its captured

    Note: Warehouse Manager also analyses query profiles to determine index and

    appropriate.

    Query Manager

    Query Manager is respons ible for directing the queries to the suitable tables.

    By directing the queries to appropriate table the query request and response

    up.

    Query Manager is respons ible for scheduling the execution of the queries pose

  • 8/11/2019 Data Warehousing Quick Guide

    49/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 49

    QUERY MANAGER ARCHITECTURE

    Query Manager includes the following.

    The query redirection via C tool or RDBMS.

    Stored procedures.

    Query Management tool.

    Query Scheduling via C tool or RDBMS.

    Query Schedul ing via third party Software.

    OPERATIONS PERFORMED BY QUERY MANAGER

    Query Manager direct to the appropriate tables.

    Query Manager schedule the execution of the queries posed by the end user.

    Query Manager stores query profiles to allow the warehouse manager to

    indexes and aggregations are appropriate.

    Data Warehousing - Security

    Introduction

    The objective data warehouse is to allow large amount of data to be easily access

    Hence allowing user to extract the information about the business as a whole. But w

    could be some security restrictions applied on the data which can prove an obstacle

    information. If the analyst has the restricted view of data then it is impossible to ca

    picture of the trends within the business .

    The data from each analyst can be summarised and passed onto management w

    summarise can be created. As the aggregations of summaries cannot be same as t

    as a whole so It is possible to miss some information trends in the data unless som

    the data as a whole.

    Requirements

    Adding the security will affect the performance of the data warehous e, therefore it is w

    the security requirements early as possible. Adding the security after the data ware

    live, is very difficult.

    During the design phase of data warehouse we should keep in mind that what data

    added later and what would be the impact of adding those data sources. We sho

    following poss ibilities during the design phase.

    Whether the new data sources will require new security and/or audit r

    implemented?

    Whether the new users added who have restricted access to data that is

    available?

    This situation arises when the future users and the data sources are not well k

    situation we need to use the knowledge of business and the objective of data wareho

    requirements.

    Factor to Consider for Security requirements

    The following are the parts that are affected by the security hence it is worth consider t

    User Access

  • 8/11/2019 Data Warehousing Quick Guide

    50/66

    4/9/2014 Data Warehousing Quick Guide

    http://www.tutorialspoint.com/dwh/dwh_quick_guide.htm 50

    Data Load

    Data Movement

    Query Generation

    USER ACCESS

    We need to classify the data first and then the users by what data they can access.

    users are class ified according to the data, they can access.

    Data Classification

    The