Olap-Oct2013 Data Mining

Embed Size (px)

Citation preview

  • 8/13/2019 Olap-Oct2013 Data Mining

    1/32

    Unit - 2

    Online Analytical Processing (OLAP)G. K Gupta, Second Edition

  • 8/13/2019 Olap-Oct2013 Data Mining

    2/32

    2

    OLAP Dimension is an attribute or an ordinate within multidimensional structure with a

    list of values.

    On-line Analytical Processing (OLAP) is a technique used for providing

    management decision support using historical and summarized data that is

    consolidated in the data warehouse.

    A fact determined by combining dimension values.

    Fact table is multidimensional and is a way to flatten a cube values of measures

    SQL command GROUP BY used as aggregation operator

    OLAP systems are data warehouse front-end to make aggregate data

    DW and OLAP based on a multidimensional conceptual view of enterprise data.

  • 8/13/2019 Olap-Oct2013 Data Mining

    3/32

    University student data with the dimensions degree, country, scholarship and year

    given below(Multidimensional view):

    The above table is for the year 2000. We have collect data for other years as well.

    Use of spreadsheet has scalability problem as it is difficult to represent millions of

    rows or with thousands of formulas.

    Data cubes generalize spreadsheets to any number of dimensions.

    3

    Degreecountry

    BSc LLB MBBS B.com BIT ALLAustralia 5 20 15 50 11 101

    India 10 0 15 25 17 67

    Malaysia 5 1 10 12 23 51

  • 8/13/2019 Olap-Oct2013 Data Mining

    4/32

    1. Definition - E. F Codd

    Is a dynamic enterprise analysis required to create,manipulate, animate and synthesisinformation fromexegetical, contemplative and formulaic data analysismodels.

    Information is manipulated from point of view of a

    manager (exegetical), from the point of view of someonewho has thought about it (contemplative) and accordingto some formula (formulaic)

    4

  • 8/13/2019 Olap-Oct2013 Data Mining

    5/32

    5

    OLAP FeaturesFour enterprise data model

    Categorical- comparison of historical values

    Exegetical- discovering reasons for what categorical

    model found

    Contemplative- what if analysis of the data

    Formulaic- how to reach a desired goal

  • 8/13/2019 Olap-Oct2013 Data Mining

    6/32

    2. Characteristics of OLAP systems1.Users: OLTP systems designed for office workers while the

    OLAP systems are deigned for decision makers.

    2.Functions : OLTP systems are mission-critical whichsupport enterprise day-to-day operations; OLAP systems are

    called management-critical to support enterprise decision-

    support functions.

    3.Nature: OLTP designed to process one record at a time;OLAP is to deal with many customer records at a time to

    provide summary or aggregate data to a manager.

    6

  • 8/13/2019 Olap-Oct2013 Data Mining

    7/32

    4.Design : OLTP is application-oriented which viewenterprise data as a collection of tables; OLAP system is

    subject-oriented which view enterprise as multidimensional

    5. Data : OLTP systems deal with current status ofinformation eg: employee who left three years ago; OLAP

    requires historical data over several years

    6. Kind of use: OLTP systems are used for read and writeoperations; OLAP systems do not update the data.

    7

  • 8/13/2019 Olap-Oct2013 Data Mining

    8/32

    8

    OLTP OLAP

    users clerk, IT professional knowledge worker

    function day to day operations decision support

    DB design application-oriented subject-oriented

    data current, up-to-date

    detailed, flat relational

    isolated

    historical,

    summarized, multidimensional

    integrated, consolidated

    usage repetitive ad-hocaccess read/write

    index/hash on prim. key

    lots of scans

    unit of work short, simple transaction complex query

    # records accessed tens millions#users thousands hundreds

    DB size 100MB-GB 100GB-TB

    metric transaction throughput query throughput, response

  • 8/13/2019 Olap-Oct2013 Data Mining

    9/32

    FASMI Characteristics

    9

    Derived from first letters of OLAP systems:

    FastOLAP queries to be answered quickly like searchengine. To achieve such performance is difficult. So a good

    data structure and hardware to precompute most commonly

    queried aggregates.

    AnalyticOLAP queries to be answered without anyprogramming. Vendor tool used to cope with any relevant

    queries for application and user.

  • 8/13/2019 Olap-Oct2013 Data Mining

    10/32

    10

    sharedOLAP system is a shared resource but not shred by manypeople; while accessed only by a group of managers and used by

    selected users. Concurrency control needed if users write or update

    data in the database

    Multidimensional: OLAP to provide multidimensional conceptual viewof data that refers data as a cube with dimensions shown as parent/

    child relationships.

    Information: OLAP obtain information from a data warehouse so as tohandle large amount of input data.

  • 8/13/2019 Olap-Oct2013 Data Mining

    11/32

    11

    Codds OL P CharacteristicsCodd in his 1993 paper lists the following 12 rules for evaluating OLAP

    products:

    1. Multidimensional conceptual view- to make a variety of manipulations(e.g. slice and dice) relatively easy.

    2. AccessibilityShould be in between data sources and an OLAP front-end.3. Batch extraction vs interpretiveOLAP system to provide

    multidimensional data staging plus partial pre-calculations of aggregates

    4. Multi user supportTo provide normal database operations like retrieval,update, concurrency control, integrity and security

    5. Storing OLAP resultsResults not to be kept separate from source data.Read-write OLAP applications should not be implemented directly on

    transaction data.

  • 8/13/2019 Olap-Oct2013 Data Mining

    12/32

    12

    6. Extraction of missing values: Should distinguish missing valuesfrom zero values as aggregates will be computed incorrectly.

    Large cubes may have large number of zeroes.7. Treatment of missing values : Should ignore all missing values of

    regardless of their source

    8. Uniform reporting performance- consistent reportingperformance as the number of dimensions grows

    9. Generic dimensionality- different dimensions should not betreated differently.

    10. Unlimited dimensions and aggregation levels- someapplications need as many as 15-20 dimensions. Allow unlimited

    dimensions.

  • 8/13/2019 Olap-Oct2013 Data Mining

    13/32

    13

    3. Multidimensional view and Data CubeA data warehouse is based on a multidimensional data model which

    views data in the form of a data cube.

    Consider the following database:

    Student(sid, name1, stu_name, country, DOB, address)

    Enrolment(sid, Degree_id, SSemester) Degree( Degree_id, Degree_name, Degree_length, Fee, Dept)

    Detailed Example in page 413 We consider a two-dimensional view is considered. ie country X degree

  • 8/13/2019 Olap-Oct2013 Data Mining

    14/32

    14

    Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 5 20 15 50 11 101

    India 10 0 15 25 17 67

    Malaysia 5 1 10 12 23 51

    For year 2000

    Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 7 10 16 53 10 96

    India 9 0 17 22 13 61

    Malaysia 5 1 19 19 20 64

    For year 2001

    Degreecountry BSc LLB MBBS B.com BIT ALLAustralia 12 30 31 103 21 197

    India 19 0 32 47 30 128

    Malaysia 10 2 29 31 43 115

    Aggregates for bothsemesters

  • 8/13/2019 Olap-Oct2013 Data Mining

    15/32

    15

    4. Data Cube Number of students as a function of country, degree and

    semester

    country

    semester

    Dimensions: country, degree, semHierarchical summarization pathscontinent school Yearregion ug/pgcountry degree semester

  • 8/13/2019 Olap-Oct2013 Data Mining

    16/32

    16

    Each edge of the cube is called a dimension.

    A user therefore has a multidimensional conceptual viewof the data

    which is represented by the cube.

    The points inside a cube provide aggregations. For example, a point

    may provide the number of students from Malaysia admitted to BCom

    in year 1998.

    The cube is not always three-dimensional

    Each dimension may be associated with a table that describes the

    dimension.

    For example, a dimension table for country would contain the country

    names and could contain other information e.g. category.

    Other dimensions like time do not naturally have such table of

    information.

  • 8/13/2019 Olap-Oct2013 Data Mining

    17/32

    A cube is represented in three dimensions: country X degree X semester for any country (x), any degree (y) and any start semester (z).

    17

    Degree

    ouryAll

    BIT

    71 681019

    20

    1922932

    3163147

    2174330

    12 30 31 103 212002

    20002001

    BSc LLB MBBS B.com

    Australia

    India

    Malaysia

    sum

    India

    All

    863

    115

    128

    197

    For the query : SELECT degree_id, count(*) FROM enrolment GROUP BY

    degree_id. B.sc LLB MBBS B.Com BIT71 68 192 315 217

  • 8/13/2019 Olap-Oct2013 Data Mining

    18/32

    Each of the edges in cube represents a dimension with members in degree B.Sc,

    LLB, MBBS, B.com, BIT.

    All space gives total number of students joined in each course in respective

    country.

    Measures called as semi-additive or non-additive as they cannot be combined.

    A data cube allows data to be modeled and viewed in multiple dimensions

    Dimension tables, such as item (item_name, brand, type), or time(day, week,

    month, quarter, year)

    Fact table contains measures (such as dollars_sold) and keys to each of the

    related dimension tables

    Eight types of aggregations or queries possible are: Null, degrees, semester,

    country, degrees& semester , Semester & country , degrees & country, all

    2naggregation possible in n dimensions.18

  • 8/13/2019 Olap-Oct2013 Data Mining

    19/32

    Solutions to aggregate and store the data are:

    1. Pre_compute and store all: Millions of aggregates need to be computed andstored. So indexing large amounts of data is also expensive.

    2. Pre_compute (and store) none : done when a query is executed; does not needextra space for storing the cube but query response time is very poor.

    3. Pre_compute and store some : Pre_compute and store most frequently queriedaggregates.

    Let a be degree dimension, b be country, c be the starting semester the queries

    will be based on (ALL, ALL, ALL) , (a, ALL, ALL) , (ALL, ALL, c) , (ALL, b, ALL),

    (a, ALL, c) , (ALL, b, c), (a, b, ALL), (a, b, c)

    Data cube uses many techniques for pre_computing aggregates and store them.

    19

    5. Data cube implementations

  • 8/13/2019 Olap-Oct2013 Data Mining

    20/32

    20

  • 8/13/2019 Olap-Oct2013 Data Mining

    21/32

    November 13 GKGupta 21

    Example of fact dimension tabletime_keydayday_of_the_weekmonthquarter

    year

    time

    location_key

    streetcityprovince_or_streetcountry

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_solddollars_sold

    avg_sales

    Measures

    item_keyitem_namebrandtypesupplier_type

    item

    branch_key

    branch_namebranch_type

    branch

  • 8/13/2019 Olap-Oct2013 Data Mining

    22/32

    22

    OLAP implementation modelsRelational OLAP (ROLAP)

    Use relational or extended-relational DBMS to store andmanage warehouse data and OLAP middle ware to support

    missing pieces

    Include optimization of DBMS backend, implementation of

    aggregation navigation logic, and additional tools and services. Data warehouse provides multidimensional capabilities by

    representing data in fact table and dimension table.

    Advantage is it more easily used with existing RDBMS and

    data is stored without any fact table storage Disadvantage is it poor query performance

  • 8/13/2019 Olap-Oct2013 Data Mining

    23/32

    23

    OLAP implementation models

    Multidimensional OLAP (MOLAP)

    Based on Multidimensional DBMS (top-down approach)

    No standard approach to store and maintain data.

    Array-based multidimensional storage engine (sparse matrix

    techniques)

    fast indexing to pre-computed summarized data

    Hybrid OLAP (HOLAP)

    User flexibility, e.g., low level: relational, high-level: array

    Specialized SQL servers

  • 8/13/2019 Olap-Oct2013 Data Mining

    24/32

    24

    Data Cube

    But in the two-dimensional situation, we dont just want

    to find out the number of students for any country (x)and any degree (y). We may have many other queriese.g.

    1. How many students are doing MIT?

    2. How many students from Thailand?

    3. How many Asian students doing Law degrees?

    Thus there is kind of hierarchy that we wish to use, for

    example, the world, the continents, the regions, thecountries etc. In degrees, we may want a hierarchy ofuniversity, Schools, UG and PG, individual degrees.

  • 8/13/2019 Olap-Oct2013 Data Mining

    25/32

    25

    Data Cube operationsA number of operations may be applied to data

    cubes. The common ones are:

    - roll-up (increasing the level of abstraction)

    - drill-down (increasing detail)

    - slice and dice (selection and projection)

    - pivot (re-orienting the view)

  • 8/13/2019 Olap-Oct2013 Data Mining

    26/32

    Roll-up(less detail)

    26

    Zooming out the data cube ie it performs further aggregation on the data

    Used in further abstraction (i.e. less detail). Eg: single degree programs to all programs offered by a school in single

    countries to Continents or from three dimensions to two dimensions.

    Drill-down(increasing detail)

    reverse of roll up, when we wish to partition more finely or want to

    focus on some particular values of certain dimensions.

    Drill-down adds more detail to the data, it may involve adding anotherdimension.

  • 8/13/2019 Olap-Oct2013 Data Mining

    27/32

  • 8/13/2019 Olap-Oct2013 Data Mining

    28/32

    Slice and dice(selection and projection)

    28

    Slice operation performs a selection on one dimension of the cube

    (e.g. degree = MIT).

    The dice operation performs a selection on two or more dimensions

    (e.g. degree = BIT and country = Australia or India)

    Pivot (re-orienting the view)

    An alternate presentation of the data e.g. rotating the axes in a

    3-D cube.

  • 8/13/2019 Olap-Oct2013 Data Mining

    29/32

    November 13 GKGupta 29

  • 8/13/2019 Olap-Oct2013 Data Mining

    30/32

    30

    Data Cube Operations

  • 8/13/2019 Olap-Oct2013 Data Mining

    31/32

    Guidelines for OLAP1.Vision: To be consulted with users with a clear vision including clearly defined,

    understood business objectives which is shared by stakeholders.

    2. Senior management support : Supported by senior managers3.Selecting an OLAP tool : Familiar with ROLAP & MOLAP tools required for

    enterprise. Some times combination of ROLAP & MOLAP that is cost effective

    4.Corporate strategy: OLAP strategy to fit with the enterprise strategy andbusiness objectives.

    5.Focus on the users: Should be based on the technical or non-technical usersbased on personal skill & information needs

    6.Join management : Jointly managed by IT and business professionals.Committee of people to be involved to provide ideas.

    7.Review and adapt: Regular reviews of project required to ensure that theproject meets the current need of enterprise.

    31

  • 8/13/2019 Olap-Oct2013 Data Mining

    32/32

    Consider a university which spread across 5

    countries whose number of students admitted for

    the courses like BSc, LLB, MBBS, B.com in 3

    years from 2010. Construct the 2-Dimensional

    View for each years course entry and theAggregates after three years. Finally develop a

    data cube with dimension country X degree Xyear.

    32