Data Warehousing Design Considerations

Embed Size (px)

Citation preview

  • 7/31/2019 Data Warehousing Design Considerations

    1/32

    Data Warehouse Design Considerations

    M. Tech. Course Seminar Report

    Submitted in partial fulfillment of the requirements

    for the degree of

    Master of Technology

    by

    Abhishek Sugandhi

    Roll No: 04305016

    under the guidance of

    Prof. N.L.Sarda

    Department of Computer Science and Engineering

    Indian Institute of Technology, Bombay

    Mumbai

  • 7/31/2019 Data Warehousing Design Considerations

    2/32

    Acknowledgment

    I would like to thank my seminar guide, Prof. N. L. Sarda for his valuable guidance,

    and encouragement without which, it would not be possible for me to complete my work.

    1

  • 7/31/2019 Data Warehousing Design Considerations

    3/32

    Abstract

    Data warehouse is a complex information system primarily used in decision making pro-

    cess by means of On-Line Analytical Processing (OLAP) applications.Over the last years,

    data warehouses are getting a lot of attention both from the industrial and the researchcommunity. The reason lies in their great importance: making predictions about the

    (near) future, has always been desirable for business companies. In chapter 1, I will dis-

    cuss the basics of data warehouse and its modeling techniques.

    Decision support places some rather different requirements on database technology

    compared to traditional on-line transaction processing. Data Warehouses are usually

    modeled using Dimensional Modeling, for better understandability and easy extendibil-

    ity. As Data Warehouses store huge amount of both current and historical data, special

    attention should be given to changing dimensions, time and date dimensions, hierarchal

    dimensions, while modeling data warehouse.In this discussion,in chapter 2, I am going to

    focus on handling this issues while modeling the Data warehouse.

    Software vendors have quickly developed products and services for improving the ef-

    ficiency of querying on Data Warehouses.In chapter 3, I will discuss the querying feature

    provided by Oracle 9i for improving efficiency of aggregate queries, and querying feature

    provided by MDX.MDX stands for the Multidimensional Expressions (MDX). It is a lan-

    guage used to manipulate multidimensional information in Microsoft SQL Server 2000

    Analysis Services.

    2

  • 7/31/2019 Data Warehousing Design Considerations

    4/32

    Contents

    1 Introduction 4

    1.1 What is Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2 Warehouse Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Dimensional Model Vs. ER Model . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Data Warehouse Design Issues 7

    2.1 How to model time and date dimension . . . . . . . . . . . . . . . . . . . . 72.2 Dimension normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 Surrogate keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.4 Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.4.1 Type 1: Overwrite the Value . . . . . . . . . . . . . . . . . . . . . . 9

    2.4.2 Type 2: Add a new Dimension Row . . . . . . . . . . . . . . . . . . 10

    2.4.3 Type 3: Add a new Dimension Column . . . . . . . . . . . . . . . . 10

    2.5 Rapidly Changing Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.6 Handling Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.6.1 Fixed Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.6.2 Variable Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 12

    2.7 Multivalued Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.8 Heterogeneous Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.9 Dimension Role Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.10 Conformed Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3 Querying on Data Warehouse 18

    3.1 Oracles 9i SQL extension for Aggregation Queries in data warehouse . . . 18

    3.1.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.1.2 Applications in building Cross-Tabular Report . . . . . . . . . . . . 193.2 Writing MDX queries for Data Warehouse . . . . . . . . . . . . . . . . . . 22

    3.2.1 Common Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.2.3 MDX Query Structure . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.2.4 Specifying Axis Dimensions . . . . . . . . . . . . . . . . . . . . . . 26

    3

  • 7/31/2019 Data Warehousing Design Considerations

    5/32

    3.2.5 Establishing Cube Context . . . . . . . . . . . . . . . . . . . . . . . 26

    3.2.6 Specifying Slicer Dimensions . . . . . . . . . . . . . . . . . . . . . . 26

    3.2.7 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2.8 Difference of MDX with SQL . . . . . . . . . . . . . . . . . . . . . 28

    4 Conclusion 29

    4

  • 7/31/2019 Data Warehousing Design Considerations

    6/32

  • 7/31/2019 Data Warehousing Design Considerations

    7/32

    Sales

    Customer anne s

    Promotion

    time_id

    customer_id

    Time

    channel_id

    Promotion_id

    Other Attributes

    Other Attributes

    Other Attributes

    Other Attributes

    customer_id

    time_id

    channel_id

    Promotion_idquantity sold

    costamount

    Figure 1.1: Star Schema [?]

    requirements, dimension attributes are usually short identifiers that are foreign key in

    some other tables called dimensional tables. Usually, a fact table is associated with many

    dimension table and contain foreign key for each of these dimensional table. Fact table is

    kept highly normalized to reduce space requirement whereas Dimension tables are highly

    denormalized to ease the browsing among different attributes of a dimension and to enable

    us to write simple and easily understandable queries.

    Resultant Schema with a fact table and multiple dimensional tables, and foreign keysfrom the fact table to dimensional tables is called a star schema. If we normalize the

    dimension table, so that a dimension table contain foreign key to other dimensional table

    then, the resultant schema have a multiple level of dimensional tables, then such schema

    are called snowflake schemas. Some Complex Data Warehouse may have more than one

    fact table [?].

    1.3 Dimensional Model Vs. ER Model

    The main difference between Dimensional Model and ER Model lies in the fact thatdimension tables in dimensional model are denormalized, whereas Dimension tables in ER

    model are highly normalized. ER design technique seeks to remove the redundancy in data

    by normalizing the relations so that there is less disk space wastage and there are no insert,

    or update anomaly, but in case of data warehouses, if the dimension tables are normalized

    into typical snowflake (normalized) structures, two bad things happen. First, the data

    6

  • 7/31/2019 Data Warehousing Design Considerations

    8/32

    model becomes too complex to be presented to the user. Second, linking the elements

    among the various branches of the snowflake compromises browsing performance. Even

    when a long text string appears redundantly in the dimension table and can be moved

    to an outrigger table(table that is formed after normalization), you wont save enough

    disk space to justify moving it because the major amount of the disk space is consumed

    by the fact table(which is highly normalized) [?].

    In many cases, normalization can actually increase the storage requirements. If thecardinality of the repeated dimension data element is high (in other words, there are just

    a few duplications), the outrigger table may be nearly as big as the main dimension table.

    But we have introduced another key structure that is now repeated in both tables [?].

    Another argument given for normalizing the dimensions is to improve insert or update

    performance. This is rarely important in a decision-support environment. You update

    the dimension tables only once per night (typically), and the processing associated with

    loading perhaps millions of fact records dominates the really minor processing associated

    with inserting or updating dimension records.A dimensional database design has a fixed

    structure that has no alternative join paths. This greatly simplifies the optimization andevaluation of queries on these schema [?].

    Fact table in Dimensional model represent many-many relationship among the dimen-

    sional table. We can convert an ER model into dimensional model in presence of such a

    many-many relationship, and such relationship is always present in Data Warehouses.

    7

  • 7/31/2019 Data Warehousing Design Considerations

    9/32

    Chapter 2

    Data Warehouse Design Issues

    2.1 How to model time and date dimension

    Date Dimension is one of the most important dimensions in data warehouse. It is guar-

    anteed to be present in every data mart because virtually every data mart is time se-ries.Instead of keeping date as an attribute in fact table, or other dimensional table, we

    should build a separate dimension table, because it will allow the analysts to query the

    data warehouse, on some special attributes like a holiday or major event etc. SQL date

    functions do no support filtering by these attributes, so if the business process need to

    slice the data by these nonstandard date attributes, then an explicit dimensional table

    is essential.Calendar logic should belong in Dimension table, rather than in application

    code [?]. Unlike most of other dimension we may build date dimension in advance. Every

    day is represented as row in date dimension. For keeping history of 10 years only 3650

    rows will be needed. Date dimension key should be an integer rather than a date data

    type. This is explained further in surrogate key section.

    If we wanted access to time of transaction for day part analysis, instead of keeping time

    of day attributes like hour, min etc as fields in Date dimension table, we should handle

    it through separate Time Of Day dimension joined to fact table [?].This can save a good

    amount of space, as now instead of keeping 24 * 60 = 1440 rows to keep information about

    every minute in Date dimensional table for every day (means 3650 * 1440 for 10 years,

    which is very large for any dimensional table), we can build only one Time Of Day table

    which will contain only 1440 rows as total. Date dimension and Time Of Day dimension

    are completely independent.

    8

  • 7/31/2019 Data Warehousing Design Considerations

    10/32

    2.2 Dimension normalization

    Dimension table normalization is usually referred as snowflaking. Redundant attributes

    are removed from from flat denormalized dimension table and placed in normalized sec-

    ondary dimension tables.But we should generally avoid snowflaking due to following rea-

    sons [?] :

    1. Snowflake tables make much more complex representation

    2. Numerous tables and joins usually translate into slow query performance

    3. The minor space savings associated with snowflaked dimension tables are insignif-

    icant as dimension tables are generally smaller and most of the space is consumed

    by the fact table.

    4. It slows down user ability to browse within a dimension.

    5. Finally snowflaking defeats the use of bitmap index. Bit map indexes are very usefulwhen indexing low cardinality fields in our dimension tables.

    But there are times when snowflaking is permissible, when a clump of correlated at-

    tributes is used repeatedly in various independent roles. for example, in promotion dimen-

    sion we would need to store promotion begin date and promotion end date attribute[ ?].

    One more example, when we have to store multivalued attribute then we would need

    bridge table.

    2.3 Surrogate keysSurrogate keys are integers that are assigned sequentially as needed to populate a dimen-

    sion. the surrogate keys merely serve to join dimensional tables to the fact table.surrogate

    keys are beneficial as the following reasons [?] :

    1. We should avoid operational code or other smart keys as data warehouse keys,

    because normally these operation codes are recycled after some period say one year

    but data warehouse will retain data for years. One of primary benefit of surrogate

    keys is that they buffer the data warehouse environment from operational changes.

    If we rely on operational code, we are also vulnerable to key overlap problems.

    2. The smaller surrogate keys translate into smaller fact tables, smaller fact table

    indices and more fact table rows per i/o operation.

    3. Surrogate keys can be used to record dimension conditions that have no operational

    code. For example, when our dimensional model have dates that are yet to be

    9

  • 7/31/2019 Data Warehousing Design Considerations

    11/32

    determined. There are no SQL date value for it, but it can be handled in case of

    surrogate keys.We can just keep one more row in date dimensional table with its

    unique key, to identify YET TO DETERMINE condition, and avoid a null date

    dimension key in fact table.

    2.4 Slowly Changing Dimensions

    While dimension attributes are relatively static, they are not fixed forever.Dimension at-

    tributes change, albeit rather slowly, over time. Tracking of accurate change is necessary

    so that business user can see the impact of each and every dimension change.When we

    need to keep track, it is unacceptable to put everything in fact table and make every

    dimension time-dependent to deal with these changes. Instead, we can take advantage of

    the fact that most of the dimensions are constant over time. We can preserve independent

    dimensional structure with only relatively minor changes to contend with changes. We

    refer to these nearly constant dimension as slowly changing dimensions [?].

    For each attribute in dimensional table, we must specify a strategy to handle change.

    There are 3 basics technique for dealing with attribute changes [?].

    1. Overwrite the value

    2. Add a Dimension Row

    3. Add a Dimension Column

    for Example, Suppose that manufacturing operations makes a slight change in packag-

    ing of SKU 38 (unique product no given by organization ), and the packaging description

    changes from glued box to pasted box. Along with this change, manufacturing oper-

    ations decides not to change the SKU number of the product, or bar code (UPC) that is

    printed on the box.Let us see, how the issue of handling this changing dimension is taken

    care of in all the above methods.

    2.4.1 Type 1: Overwrite the Value

    With the type 1, we merely overwrite the old attribute value in the dimension row,

    replacing it with the current value. In doing so attribute always reflect the most recentassignment.Type 1 response is simple to implement but it does not maintain any history

    of prior attribute value [?].

    Type 1 technique is the simplest and fastest. But it doesnt maintain past history!

    Nevertheless, overwriting is frequently used when the data warehouse team legitimately

    decides that the old value of the changed dimension attribute is not interesting.[ ?]. In

    10

  • 7/31/2019 Data Warehousing Design Considerations

    12/32

    above example, Original row

    Product Key Produce Description Packaging SKU No.

    12345 Scent glued Box ABC922

    will be updated as

    Product Key Produce Description Packaging SKU No.

    12345 Scent pasted Box ABC922

    2.4.2 Type 2: Add a new Dimension Row

    The second technique is the most common and has a number of powerful advantages.If the

    data warehouse team decides to track the change of an attribute issue another record(row

    in dimensional table), with the changed value of attribute. The only difference betweenrecords is in the changed attribute. Even the operational codes are the same.

    This technique for tracking slowly changing dimensions is very powerful because new

    dimension records automatically partition history in the fact table. The old version of

    the dimension record points to all history in the fact table prior to the change. The new

    version of the dimension record points to all history after the change [?].There is no need

    for a time-stamp in the product table to record the change. This is best recorded by a

    fact table record with the correct key of newly added record [?].

    Another advantage of this technique is that you can gracefully track as many changes

    to a dimensional item as you wish. Each change generates a new dimension record, andeach record partitions history perfectly. The main drawbacks of the technique are the

    requirement to generalize the dimension key, and the growth of the dimension table itself

    [?].

    Using Type 2 technique for previous example, we would have 2 product dimension

    rows (both original and updated ) as

    Product Key Produce Description Packaging SKU No.

    12345 Scent glued Box ABC922

    34567 Scent pasted Box ABC922

    2.4.3 Type 3: Add a new Dimension Column

    With Type 2 response partitions history, it does not allow us to associate the new attribute

    value with old fact history or vice-versa. However, we sometimes want the ability to see

    11

  • 7/31/2019 Data Warehousing Design Considerations

    13/32

    fact data as if the change never occurred. We can attack this requirement, not by creating

    a new dimension record as in the Type 2 technique, but by creating a new current value

    field. The type 3 technique allow us to see new and historical fact data by either the new

    or prior attribute values [?].

    Using Type 3 technique for previous example, we would have update original row as

    Product Key Produce Description current Packaging previous packaging SKU No.

    12345 Scent pasted Box glued box ABC922

    2.5 Rapidly Changing Dimension

    Normally, we will not use any of the techniques mentioned previously for handling chang-

    ing dimension, if the dimension already contains million of the rows.Unfortunately, huge

    dimensions are also more likely to change than moderately sized dimension. We sometimes

    calls this situation rapidly changing monster dimensions [?].The solution to handle such problem, is to break frequently analyzed or frequently

    changing dimensions into separate dimension, referred as minidimension [?]. There would

    be one row in minidimension for every unique combination of frequently analyzed at-

    tribute Level encountered in the data (not one row per customer). We leave behind

    more constant or less frequently queried attributes in original huge customer table.When

    creating the minidimension, continuously changing variable should be converted to banded

    ranges.In other words, we force the attributes in minidimension to take relatively small

    number of dimension values [?]. Although these restricts the use of predefined bands, it

    drastically reduces the number of combinations in the minidimension.

    Every time, we build the fact table row, we include 2 foreign keys related to the

    dimension: the regular dimension key and minidimension key. The minidimension key

    should be the part of fact tables set of foreign keys to provide efficient access to the fact

    table.

    This design delivers browsing and constraining performance benefits by providing a

    slower point of entries to the fact table, and we can avoid joins to huge dimensional

    table if attributes(static) from that table are not constrained. When the minidimension

    key participates as foreign key in fact table, another benefit is that the fact table serves

    to capture the minidimensions attribute changes.We can keep track of loading which

    minidimension key when we want to change attribute of dimension. Earlier rows would

    be still using the old values of minidimension key. Thus we could keep track of history as

    well [?].

    12

  • 7/31/2019 Data Warehousing Design Considerations

    14/32

    Customer KeyCustomer IDCustomer Name.............Age

    GenderAnnual Income

    Becomes

    Customer KeyCustomer IDCustomer Name..............

    Customer Minidimension KeyCustomer Age BandCustomer GenderCustomer Income Band

    Customer KeyCustomer Minidimension KeyMore Foreign Keys..........Facts............

    Customer Dimension

    Customer DimensionFact Table

    Customer Minidimension Dimension

    Figure 2.1: Example of Handling Rapidly Changing large Dimension [?]

    2.6 Handling HierarchiesIn many dimensions, hierarchy is inherent. We will take 2 approaches to handle hierar-

    chies. The first is straightforward and handle the hierarchy adequately with simplistic

    approach. The second approach is much more advanced and complicated but also much

    more extensible.

    2.6.1 Fixed Depth Hierarchy

    This happens rare, if we are confronted with a dimension that is highly predictable with

    fixed number of levels (say N). In this case, we can keep N attributes in dimension cor-responding to these N levels [?].If some other records from the dimension table are not

    having hierarchy up to the maximum no of levels, then we would duplicate lower level

    attributes to higher level attributes.In this way, we can report hierarchy to any level of

    hierarchy. for every record of that dimension.

    2.6.2 Variable Depth Hierarchy

    Representing an arbitrary variable depth hierarchy is an inherently difficult task in a

    relational environment.A simple computer science approach to storing such information

    would add a Parent Key field to the Customer dimension. The Parent Key field would be

    a recursive pointer that would contain the proper key value for the parent of any given

    customer. A special null value would be required for the topmost Customer in any given

    overall enterprise [?] .

    The problem with this recursive pointer approach is that, it cannot be used effectively

    with standard SQL. Standard SQL GROUP BY clause cannot follow the recursive pointers

    13

  • 7/31/2019 Data Warehousing Design Considerations

    15/32

    downward, for aggregating an additive fact in the fact table [?]. Instead of using a

    recursive pointer, we can solve this modeling problem by inserting a bridge table(helper

    table) between the dimension table and the fact table.The bridge table contains one record

    for each separate path from each node in the organization tree to itself and to every node

    below it. Each Pathway row contains key of key of parent roll-up entity, no of levels

    between parent and the subsidiary, bottom-most flag indicating that there are no further

    nodes beneath it and finally, a top-most flag to indicate there are no further nodes abovethe parent [?].

    Now, if we want to descend the hierarchy, we join the dimension table with bridge table

    by connecting dimensions primary key with bridge tables parent dimension key. Now we

    can constrain any particular dimension and request an aggregate measure of all dimensions

    at or below it.We can use no of level attribute to control depth of analysis. Similarly when

    we want to ascend the hierarchy, we reverse the join by connecting dimension key with

    the bridge table subsidiary dimension key [?].

    When a group of nodes is moved from one part of an organization to another, only the

    bridge table rows that refer to paths outside the parent to the moved structures need tobe altered.All rows referring to paths within the moved structure need not be affected.We

    need to add rows, if the moved structure had new parent.

    When issuing the SQL statement using bridge table, we need to be cautious about

    over counting the facts.When connecting the tables, we must constrain the customer di-

    mensions to a single value and then join to the bridge table [?].

    Customer Key

    Customer Key

    ParentSubsdiary

    Leval Name

    Bottom flag

    Top Flag

    Customer KeyCustomer Key Date Key

    Customer

    Customer

    ID

    Name

    Customer Dimension Hierarchy Bridge Fact table

    Figure 2.2: Handling hierarchy through bridge table [?]

    14

  • 7/31/2019 Data Warehousing Design Considerations

    16/32

    2.7 Multivalued Dimension

    There are situations where we need to attach a multivalued dimension table to the fact

    table. Example of these situation is when we associate many customers to account, when

    multiple diagnoses are associated with single patient etc. Database designers usually take

    one of following approaches for handling Multivalued Dimension attributes [?] :

    Choose one value and omit the other values

    Extend the dimension list to have a fixed number of Multivalued dimensions

    Put a bridge (helper) table in between this fact table and the Multivalued dimension

    table.

    Frequently, designers choose a single value (first approach). If we take these approach,

    the modeling problem goes away, but we will still be in doubt whether the Multivalued

    dimension data is useful.

    The second approach of creating a fixed number of additional Multivalued dimension

    slots in the fact table key is also not a good idea, as there can be some situation where

    the number of Multivalued dimension exceed slots we have allocated. Also, we cannot

    easily query the multiple separate Multivalued dimensions [?].

    Bridge table placed between the Multivalued dimension and the fact table is the best

    solution. The Multivalued dimension key in the fact table is changed to be a Multival-

    ued dimension Group key. The helper table in the middle is the Multivalued dimension

    Group table. It has one record for each Multivalued dimension in a group of Multivalued

    dimensions [?].

    The Multivalued dimension Group table is joined to the original Multivalued dimension

    on the Multivalued dimension key. The Multivalued dimension Group table contains a

    very important numeric attribute: the weighting factor. The weighting factor allows

    reports to be created that dont double count the Billed Amount in the fact table.

    We can assign the weighting factors equally within a Multivalued dimension Group.

    If there are three Multivalued dimensions, then each gets a weighting factor of 1/3. If we

    have some other rational basis for assigning the weighting factors differently, then we can

    change the factors, as long as all the factors in a Diagnosis Group always add up to one.

    2.8 Heterogeneous Dimension

    Many a times, in real world the situation arises when the business provides heterogeneous

    services or products. For example, a retail Bank offers variety of products like mortgage

    15

  • 7/31/2019 Data Warehousing Design Considerations

    17/32

    Patient Fact table

    Table

    Diganosis Dimension

    Diagnosis group

    Helper Table

    Digosis group key

    Diagnosis key

    Diagnosis key

    Digosis group key

    Other Attributes

    Other Attributes

    Weight

    Figure 2.3: Handling Multivalued Diagnosis Dimension through bridge table [?]

    or checking accounts to the same customer. These products have specific attributes and

    facts related to them only, and also general attributes, and fact that are common among

    them. In this case, Business users typically require two different perspective that are

    difficult to present in single fact table. The first perspective is global view, including the

    ability to slice and dice all general facts simultaneously, regardless of their product type.

    The second perspective required by users is specific line-of business view that focuses onin-depth details of one business such as checking or mortgage [?].

    There is a long list of attribute specifically for any specific line of business. We cannot

    add these spatial facts in one fact table; if we did it for each line of business, we would

    end up with several hundred facts, most of which include nulls in any specific row.

    Like wise, if we attempt to include specific line of business attributes in any dimension

    table, we would have hundred of attribute, almost all of which would be empty for any

    given row.

    The solution to this dilemma is to create a custom schema for the checking line of

    business that is just limited to just checking accounts.Now both the custom checking facttable and corresponding product dimension are widened to describe all specific facts and

    attributes that make sense only for checking products [?].

    These custom tables also contain the core attributes and facts so that, we can avoid

    join from the core and custom schema in order to get complete set of facts and attributes.

    The keys of custom product dimension is same as used in core product dimension, which

    16

  • 7/31/2019 Data Warehousing Design Considerations

    18/32

    contains all possible product keys [?]. As conformed dimensions are is essential, each

    custom product dimension is subset of rows from core product dimension table.

    A family of core and custom fact table is needed when a business has heterogeneous

    products that have naturally different facts and descriptors but a simple customer base

    demands an integrated view.

    We can consider handling of the specific line of business attributes as context depen-

    dent outrigger to the core dimension. We can isolate the core attributes in in the baseproduct dimension table, and we can include a snowflake key in each base record that

    points to that point to its proper custom dimension outrigger [?].

    If line of business of of custom and core dimension are separate, they cannot reside

    in same space, in this case, data in core fact table need to be duplicated only once to

    implement all custom tables. Otherwise, we can avoid duplicating both the core fact keys

    and core facts in the custom line of business fact tables [?].

    General Dimension1 Fact Table

    Core Attributes ....Dimension 1 Key Date Key

    Dimension 1 KeyDimension2 keyMore Foreign KeysCore Facts........

    Dimension 2 KeyCore Attributes......

    Custom facts.......

    Custom Attributes.....

    Specific Dimension 1Key Specific Dimension 2KeyCustom Attributes.....

    Dimension 1Specific line of Business

    General Dimension 2

    Dimension 2Specific line of Business

    Figure 2.4: Handling Heterogeneous Dimensions [?]

    2.9 Dimension Role Playing

    A role in a data warehouse is a situation in which a single dimension appears several times

    in the same fact table.In certain kinds of fact tables, Date can appear repeatedly. For

    example, a typical Fact table can include Order Date, Packaging Date, Shipping Date,Delivery Date, Payment Date, Return Date, Refer to Collection Date, and other facts [ ?].

    We cannot join these seven foreign keys to the same table. SQL would interpret such

    a seven-way simultaneous join as requiring that all of the dates be the same. Instead of a

    seven-way join, we have to create an illusion of seven independent Date dimension tables.

    We even need to go to the length of labeling all of the columns in each of the tables

    17

  • 7/31/2019 Data Warehousing Design Considerations

    19/32

    uniquely. If we dont label the columns uniquely, we will not be able to differentiate the

    columns apart if several of them have been dragged into a report [ ?].

    For the user, we can create the illusion of seven independent time tables in a couple

    of ways. We can either make seven identical physical copies of the time table, or we

    can create seven virtual copies of the time table with the SQL SYNONYM command.

    Regardless of the approach, once we have made these seven clones, we still have to define

    a SQL view on each copy in order to make the field names uniquely different [?].Now that we have seven differently described Time dimensions, they can be used as

    if they were independent. They can have completely unrelated constraints, and they can

    play different roles in a report.

    2.10 Conformed Dimensions

    A conformed dimension is a dimension that means the same thing with every possible

    fact table to which it can be joined. Generally this means that a conformed dimension is

    identical in each data mart. A major responsibility of the central data warehouse designteam is to establish, publish, maintain, and enforce the conformed dimensions.

    Conformed dimensions are enormously important to the data warehouse. Without

    a strict adherence to conformed dimensions, the data warehouse cannot function as an

    integrated whole. Conformed dimensions make possible a single dimension table to be

    used against multiple fact tables in the same database space, consistent user interfaces and

    data content whenever the dimension is used, and a consistent interpretation of attributes

    and, therefore, roll ups across data marts [?].

    It is possible to create a subset of a conformed dimension table for certain data marts

    if you know that the domain of the associated fact table only contains that subset. Forexample, the master Product table can be restricted to just those products manufactured

    at a particular location if the data mart in question pertains only to that location. We

    could call this a simple data subset, because the reduced dimension table preserves all

    the attributes of the original dimension and exists at the original granularity [?].

    18

  • 7/31/2019 Data Warehousing Design Considerations

    20/32

    Chapter 3

    Querying on Data Warehouse

    3.1 Oracles 9i SQL extension for Aggregation Queries

    in data warehouse

    In this section, all example queries Will be performed on Sales History Schema in figure1.1. All examples, and theory is taken from [?].

    Aggregation is a fundamental part of data warehousing. To improve aggregation

    performance in your warehouse, Oracle provides the following extensions to the GROUP

    BY clause to make query reporting faster and easier:

    ROLLUP Extension ROLL UP calculates aggregate functions such as SUM, COUNT,

    MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed

    up to a grand total.It is very helpful for subtotaling along a hierarchical dimension

    such as time or geography. It creates subtotals that roll up from the most detailed

    level to a grand total, following a grouping list specified in the ROLL UP clause.

    CUBE Extension CUBE is an extension similar to ROLL UP, enabling a single state-

    ment to calculate all possible combinations of aggregations. CUBE can generate the

    information needed in cross-tabulation reports with a single query. CUBE is typi-

    cally most suitable in queries that use columns from multiple dimensions rather than

    columns representing different levels of a single dimension. CUBE takes a specified

    set of grouping columns and creates subtotals for all of their possible combinations.

    In terms of multidimensional analysis, CUBE generates all the subtotals that could

    be calculated for a data cube with the specified dimensions. Multiple SELECTstatements combined with UNION ALL statements could provide the same infor-

    mation gathered through CUBE or ROLL UP. However, this might require many

    SELECT statements.The more columns used in a CUBE or ROLLUP clause, the

    greater the savings compared to the UNION ALL approach.

    GROUPING Functions The GROUPING functions help you identify the group each

    19

  • 7/31/2019 Data Warehousing Design Considerations

    21/32

    row belongs to and enable sorting subtotal rows and filtering results. Grouping

    helps in differentiating NULL values created by CUBE or ROLLUP and stored

    NULL values.Secondly it helps in finding out programattically what is a level of

    aggregation for a given subtotal.GROUPING returns 1 when it encounters a NULL

    value created by a ROLLUP or CUBE operation. That is, if the NULL indicates

    the row is a subtotal, GROUPING returns a 1. Any other type of value, including

    a stored NULL, returns a 0.

    Grouping ID Function GROUPING ID returns a single number that enables you to

    determine the exact GROUP BY level. For each row, GROUPING ID takes the

    set of 1 s and 0 s that would be generated if you used the appropriate GROUPING

    functions and concatenates them, forming a bit vector. The bit vector is treated as a

    binary number, and the number s base-10 value is returned by the GROUPING ID

    function.

    GROUPING SETS Expression Computing a full cube creates a heavy processing

    load, so replacing cubes with grouping sets can significantly increase performance.Youcan selectively specify the set of groups that you want to create using a GROUP-

    ING SETS expression within a GROUP BY clause. This allows precise specifica-

    tion across multiple dimensions without computing the whole CUBE.CUBE and

    ROLLUP can be thought of as grouping sets with very specific semantics.

    3.1.1 Syntax

    Extension Syntax

    ROLLUP SELECT..... GROUP BY ROLLUP(grouping column reference list)

    PARTIAL ROLLUP GROUP BY expr1, ROLLUP(expr2, expr3)

    CUBE SELECT..... GROUP BY CUBE (grouping column reference list)

    PARTIAL CUBE GROUP BY expr1, CUBE(expr2, expr3)

    GROUPING SELECT.. [GROUPING(dimension column)..]..

    GROUP BY.. CUBE ROLLUP (dimension column)

    GROUPING SETS GROUP BY [GROUPING sets(dimension column).. ]

    3.1.2 Applications in building Cross-Tabular Report

    These extensions are used to generate cross-tabular reports easily and efficiently.

    For example, in figure, for a cross-tabular report showing, the total sales by country id

    and channel desc for the US and UK through the Internet and Direct Sales in September

    20

  • 7/31/2019 Data Warehousing Design Considerations

    22/32

    Direct Sales

    Internet

    UK US Total

    75000 45000

    100000 200000

    175000 245000

    300000

    120000

    420000Total

    Country

    Channel

    Figure 3.1: Cross Tabular Report

    2004, we will need to calculate 4 subtotals and one grand total. Half of the values needed

    for this report would not be calculated with a query that requested SUM(amount sold)

    and did a GROUP BY(channel desc, country id). To get the higher-level aggregates we

    would require additional queries.But we can easily generate all these subtotals and grandtotal by giving only one query using CUBE extension in its GROUPBY clause.

    The Query will be

    SELECT channel desc, calendar month desc, country id, SUM(amount sold)

    FROM sales, customers, times, channels

    WHERE sales.time id=times.time id AND

    sales.cust id=customers.cust id AND

    sales.channel id= channels.channel id ANDchannels.channel desc IN (Direct Sales, Internet) AND

    times.calendar month desc = 2004-09 AND

    country id IN (UK, US)

    GROUP BY CUBE(channel desc,country id);

    Result of these Query will appear as in table shown below

    21

  • 7/31/2019 Data Warehousing Design Considerations

    23/32

  • 7/31/2019 Data Warehousing Design Considerations

    24/32

    We can also generate above tables using GROUPING SETS extension. With GROUP-

    ING SETS expression, we have to explicitly specify the levels of aggregation we wish to

    perform.The Query will be

    SELECT channel desc,country id, SUM(amount sold)

    FROM sales, customers, times, channels

    WHERE sales.time id=times.time id AND

    sales.cust id=customers.cust id AND

    sales.channel id= channels.channel id AND

    channels.channel desc IN (Direct Sales, Internet) AND

    times.calendar month desc = 2004-09 AND

    country id IN (UK, US)

    GROUP BY GROUPING SETS((channel desc, country id), (channel desc),(country id),());

    Both CUBE and ROLLUP can be thought of as GROUPING SETS with very specific

    semantics.

    CUBE(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c),

    (a), (b), (c), ())

    ROLLUP(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), ())

    3.2 Writing MDX queries for Data Warehouse

    MDX, stands for Multidimensional Expressions.It is a syntax that supports the definition

    and manipulation of multidimensional objects and data. MDX is similar in many ways

    to the Structured Query Language (SQL) syntax, but is not an extension of the SQL

    language; in fact, some of the functionality that is supplied by MDX can be supplied,

    although not as efficiently or intuitively, by SQL.As with an SQL query, each MDX query requires the SELECT clause,the FROM

    clause and the WHERE clause. These and other keywords provide the tools used to

    extract specific portions of data from a cube (multidimensional structure) for analysis.

    MDX also supplies a robust set of functions for the manipulation of retrieved data, as

    well as the ability to extend MDX with user-defined functions.

    23

  • 7/31/2019 Data Warehousing Design Considerations

    25/32

    Figure 3.2: Multidimensional Structure : Cube [?]

    3.2.1 Common Terms

    Cube Cube is a multidimensional structure that contains dimensions and measures. Di-

    mensions define the structure of the cube, while measures provide the numerical

    values of interest to the end user. Cell positions in the cube are defined by the in-

    tersection of dimension members, and the measure values are aggregated to provide

    the values in the cells [?].

    Member A member is the lowest level of reference when describing cell data in a Cube.A

    member is an item in a dimension representing one or more occurrences of data.

    Members are combined to form Tuples and Tuples are combined to form Sets. These

    Sets are used in SELECT clause of SQL for retrieving data from Cube [?].

    Tuples A Tuple is used to define a slice of data from a Cube; it is composed of an ordered

    24

  • 7/31/2019 Data Warehousing Design Considerations

    26/32

    collection of one Member from one or more dimensions. A Tuple is used to identify

    specific sections of multidimensional data from a cube; a tuple composed of one

    member from each dimension in a cube completely describes a cell value [?].

    Sets A Set is an ordered collection of zero, one or more Tuples. A Set is most commonly

    used to define Axis and Slicer dimensions in an MDX query, and as such may have

    only a single Tuple or may be, in certain cases, empty. In MDX syntax, tuples are

    enclosed in braces to construct a set.A set is most commonly used to define axis

    and slicer dimensions in an MDX query [?].

    Axis and Slicer Dimensions A SELECT statement is used to select the Dimensions

    and Members to be returned, referred to as Axis dimensions. The WHERE state-

    ment is used to restrict the returned data to specific Dimension and Member criteria,

    referred to as a slicer dimension. An axis dimension is expected to return data for

    multiple members, while a slicer dimension is expected to return data for a single

    member [?].

    3.2.2 Rules

    Rules for specifying Members

    1. By specifying the actual name or the alias. for example [Packages]

    If the member name starts with number or contains spaces, it should be within

    braces

    2. By specifying dimension name or any one of the ancestor member names as a prefix

    to the member name. for example,[Measures].[Packages]. (Measure dimension is

    associated with all the facts)

    3. By specifying the name of a calculated member defined in the WITH section [?].

    Rules for specifying Tuples

    1. Tuple consist of one or more member

    2. If a tuple is composed of members from more than one dimension, the members

    represented by the tuple must be enclosed in parentheses. for example, (Time.[2nd

    half], Route.nonground.air)

    3. If a tuple consist of only one member, we can omit parenthesis [?].

    25

  • 7/31/2019 Data Warehousing Design Considerations

    27/32

    Rules for specifying Sets

    1. A set consist of one or more tuples enclosed in braces. except in some cases where

    the set is represented by an MDX function which returns a set [?]. For example,

    { (Time.[1st half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) }

    [?].

    2. A set can contain more than one occurrence of the same tuple. for example,{Time.[2nd half], Time.[2nd half] }

    3. When a set has more than one tuple,the in each tuple of the set, members must

    represent the same dimensions as do the members of other tuples of the set. Addi-

    tionally, the dimensions must be represented in the same order. In other words, each

    tuple of the set must have the same dimensionality [?]. For example { (Time.[1st

    half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) } [?].

    4. A set can also be a collection of sets, and it can also be empty (containing no tuples)

    [?].

    3.2.3 MDX Query Structure

    A basic Multidimensional Expressions (MDX) query is structured in a fashion similar to

    the following example [?] :

    SELECT

    FROM

    WHERE

    In MDX, the SELECT statement is used to specify a dataset containing a subset of

    multidimensional data. To specify a dataset, an MDX query must contain information

    about

    The number of axes. You can specify up to 128 axes in an MDX query.

    The members from each dimension to include on each axis of the MDX query.

    The name of the cube that sets the context of the MDX query.

    The members from a slicer dimension on which data is sliced for members from the

    axis dimensions [?].

    26

  • 7/31/2019 Data Warehousing Design Considerations

    28/32

    3.2.4 Specifying Axis Dimensions

    Axis dimensions determine layout of query results from a database.Multidimensional Ex-

    pressions (MDX) uses the SELECT clause to specify axis dimensions by assigning a set

    to a particular axis. In the following syntax example, each value

    defines one axis dimension. The number of axes in the dataset is equal to the number of

    values in the Multidimensional Expressions (MDX) query. An MDX

    query can support up to 128 specified axes, but very few MDX queries will use more than

    5 axes [?].

    The breakdown of the syntax is:

    [axis specification] ::= [set] ON [axis name]

    [axis name] ::= COLUMNS ROWS PAGES SECTIONS CHAPTERS

    AXIS([index])

    Each axis dimension is associated with a number: 0 for the x-axis, 1 for the y-axis, 2

    for the z-axis, and so on. The value is the axis number. For the first 5 axes, thealiases COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be used in place

    of AXIS(0), AXIS(1), AXIS(2), AXIS(3), and AXIS(4), respectively [?].

    An MDX query cannot skip axes. That is, a query that includes one or more

    values must not exclude lower-numbered or intermediate axes. For example, a query can-

    not have a ROWS axis without a COLUMNS axis, or have COLUMNS and PAGES axes

    without a ROWS axis [?].

    3.2.5 Establishing Cube Context

    To establish cube context, indicate the cube on which you want the Multidimensional

    Expressions (MDX) query to run. The FROM clause in an MDX query determines the

    cube context. The following syntax indicates which cube supplies the context for the

    MDX query [?] :

    FROM cube specification

    The cube specification is completed with the name of a single cube.

    For example, if an MDX query is to be run against the SalesCube cube, the FROM

    clause would be:

    FROM SalesCube

    3.2.6 Specifying Slicer Dimensions

    Slicer dimensions are used optionally in the WHERE Clause of the query,to limit a query

    to apply only to a specific area of the database. Dimensions that are not explicitly

    assigned to an axis are assumed to be slicer dimensions and filter with their default

    27

  • 7/31/2019 Data Warehousing Design Considerations

    29/32

    members.Default Member is usually the All member if an (All) level exists, or else an

    arbitrary member of the highest level.The breakdown of the WHERE clause syntax is [ ?]:

    WHERE [(slicer specification)]

    A slicer dimension can accept only expressions that evaluate into a single tuple. This

    does not mean that only a single tuple can be explicitly stated in the slicer dimension.

    for example, WHERE ( [Time].[1st half], [Route].[nonground] )

    If the slicer specification cannot be resolved into a single tuple, an error will occur [?].

    3.2.7 Example Queries

    For example in the cube shown in figure, if we want to calculate total Unit Sales and

    total Store Sales for all USA CA Stores in year 1997 and 1998 for a sales schema, then

    we would give the following query

    SELECT

    { [Measures].[Unit Sales], [Measures].[Store Sales] }

    ON COLUMNS,

    { [Time].[1997], [Time].[1998] }

    ON ROWS

    FROM Sales

    WHERE( [Store].[USA].[CA] )

    This query will return the result as shown in the following table :

    Unit Sales Store Sales

    1997 75000 100000

    1998 140000 200000

    We can also rewrite the above query as

    SELECT

    { [Measures].[Unit Sales], [Measures].[Store Sales] }

    ON AXIS(0),

    { [Time].[1997], [Time].[1998] }

    ON AXIS(1)FROM Sales

    WHERE( [Store].[USA].[CA] )

    28

  • 7/31/2019 Data Warehousing Design Considerations

    30/32

    3.2.8 Difference of MDX with SQL

    Here are the main list of differences between MDX and SQL :

    1. The principal difference between SQL and MDX is the ability of MDX to reference

    multiple dimensions.SQL refers to only two dimensions, columns and rows, when

    processing queries. Because SQL was designed to handle only two-dimensional tab-ular data, the terms column and row have meaning in SQL syntax.MDX, in

    comparison, can process one, two, three, or more dimensions in queries. Because

    multiple dimensions can be used in MDX, each dimension is referred to as an axis

    [?].

    2. In SQL, the SELECT clause is used to define the column layout for a query, while

    the WHERE clause is used to define the row layout. However, in MDX the SELECT

    clause can be used to define several axis dimensions, while the WHERE clause is

    used to restrict multidimensional data to a specific dimension or member [ ?].

    3. In SQL, the WHERE clause is used to filter the data returned by a query. In

    MDX, the WHERE clause is used to provide a slice of the data returned by a query.

    While the two concepts are similar, they are not equivalent.The SQL query uses the

    WHERE clause to contain an arbitrary list of items that should (or should not) be

    returned in the result set. While a long list of conditions in the filter can narrow

    the scope of the data that is retrieved, there is no requirement that the elements

    in the clause will produce a clear and concise subset of data.In MDX, however,

    the concept of a slice means that each member in the WHERE clause identifies a

    distinct portion of data from a different dimension. Because of the organizationalstructure of multidimensional data, it is not possible to request a slice for multiple

    members of the same dimension. Because of this, the WHERE clause in MDX can

    provide a clear and concise subset of data [?].

    29

  • 7/31/2019 Data Warehousing Design Considerations

    31/32

    Chapter 4

    Conclusion

    Design of the data warehouse greatly influences the quality of the analysis that is possible

    with data in it. If invalid or corrupt data is allowed to get into the data warehouse, the

    analysis done with this data is likely to be invalid. So, special attention should be given

    to the issues like slowly changing dimensions, rapidly changing dimensions, multivalued

    dimensions etc. that are discussed here while designing the data warehouse.

    Dimensional modeling should be used for designing Data Warehouse instead of ER

    Modeling because main focus here in Data Warehouse is not for removing redundancy

    from dimensions but focus is on queries that are simple to understand and easier to write.

    After the rapid acceptance of data warehousing systems during past three years, there

    will continue to be many more enhancements and adjustments to the data warehous-

    ing system model. Further evolution of the hardware and software technology will also

    continue to greatly influence the capabilities that are built into data warehouses.

    30

  • 7/31/2019 Data Warehousing Design Considerations

    32/32

    Bibliography

    [1] Basic MDX. World Wide Web, http://www.msdn.microsoft.com/library.

    [2] Essbase Analytic Services Database Administrators Guide. World Wide Web,

    http://dev.hyperion.com/techdocs/essbase/essbase 71/Docs/dbag/frameset.htm.

    [3] Ralph Kimball. Dimensional Modelling Manisfesto. World Wide Web,

    http://www.dbmsmag.com.

    [4] Ralph Kimball and Margy Ross. The Data Warehouse ToolKit. second edition, 2004.

    [5] Paul Lane. Oracle 9i Data Warehousing Guide. Release 1 (9.0.1) edition, 2001.

    [6] Michael J. Corey,Michael Abbey , Ian Abramson and Ben Taub. Oracle 8 Data Ware-

    housing. Oracle press edition, 1998.

    [7] Korth SilberSchatz and Sudarshan. Database System Concepts. fourth edition, 2002.