07_dw

Embed Size (px)

Citation preview

  • 7/28/2019 07_dw

    1/28

    1

    Cube Computation and

    Indexes for Data WarehousesCPS 196.03

    Notes 7

  • 7/28/2019 07_dw

    2/28

    2

    Processing

    ROLAP servers vs. MOLAP servers

    Index Structures

    Cube computation What to Materialize?

    Algorithms

    Client Client

    Warehouse

    Source Source Source

    Query & Analysis

    Integration

    Metadata

  • 7/28/2019 07_dw

    3/28

    3

    ROLAP Server

    Relational OLAP Server

    relational

    DBMS

    ROLAP

    server

    tools

    utilities

    sale prodId date sum

    p1 1 62

    p2 1 19

    p1 2 48

    Special indices, tuning;

    Schema is denormalized

  • 7/28/2019 07_dw

    4/28

    4

    MOLAP Server

    Multi-Dimensional OLAP Server

    multi-dimensional

    server

    M.D. tools

    utilities

    could alsosit on

    relational

    DBMS

    Prod

    uct

    Date1 2 3 4

    milk

    soda

    eggs

    soap

    AB

    Sales

  • 7/28/2019 07_dw

    5/28

    5

    MOLAP

    Total annual salesof TV in U.S.A.

    Date

    Countr

    ysum

    sumTV

    VCRPC

    1Qtr 2Qtr 3Qtr 4Qtr

    U.S.A

    Canada

    Mexico

    sum

  • 7/28/2019 07_dw

    6/28

    6

    MOLAP

    A

    B

    29 30 31 32

    1 2 3 4

    5

    9

    13 14 15 16

    6463626148474645

    a1a0

    c3c2

    c1

    c 0b3

    b2

    b1

    b0

    a2 a3

    C

    4428

    5640

    2452

    3620

    60

    B

  • 7/28/2019 07_dw

    7/287

    Challenges in MOLAP

    Storing large arrays for efficient access

    Row-major, column major

    Chunking

    Compressing sparse arrays

    Creating array data from data in tables

    Efficient techniques for Cube computation

    Topics are discussed in the paper for reading

  • 7/28/2019 07_dw

    8/288

    Index Structures

    Traditional Access Methods

    B-trees, hash tables, R-trees, grids,

    Popular in Warehousesinverted lists

    bit map indexes

    join indexes

    text indexes

  • 7/28/2019 07_dw

    9/28

    9

    Inverted Lists

    20

    23

    18

    19

    20

    2122

    23

    25

    26

    r4

    r18

    r34

    r35

    r5

    r19

    r37

    r40

    rId name age

    r4 joe 20

    r18 fred 20

    r19 sally 21r34 nancy 20

    r35 tom 20

    r36 pat 25

    r5 dave 21

    r41 jeff 26

    ...

    age

    index

    inverted

    lists

    data

    records

  • 7/28/2019 07_dw

    10/28

    10

    Using Inverted Lists

    Query:

    Get people with age = 20 and name = fred

    List for age = 20: r4, r18, r34, r35

    List for name = fred: r18, r52

    Answer is intersection: r18

  • 7/28/2019 07_dw

    11/28

    11

    Bit Maps

    20

    23

    18

    19

    20

    2122

    23

    25

    26

    id name age

    1 joe 20

    2 fred 20

    3 sally 21

    4 nancy 20

    5 tom 20

    6 pat 25

    7 dave 21

    8 jeff 26

    ..

    .

    age

    indexbit

    maps

    data

    records

    1

    1

    0

    1

    1

    0

    00

    0

    0

    01

    0

    0

    0

    1

    01

    1

  • 7/28/2019 07_dw

    12/28

    12

    Bitmap Index

    Index on a particular column Each value in the column has a bit vector: bit-op is fast

    The length of the bit vector: # of records in the base table

    The i-th bit is set if the i-th row of the base table has the

    value for the indexed column

    not suitable for high cardinality domains

    Cust Region Type

    C1 Asia Retail

    C2 Europe Dealer

    C3 Asia Dealer

    C4 America Retail

    C5 Europe Dealer

    RecID Retail Dealer

    1 1 0

    2 0 1

    3 0 1

    4 1 0

    5 0 1

    ecI Asia Europe America

    1 1 0 0

    2 0 1 0

    3 1 0 0

    4 0 0 1

    5 0 1 0

    Base table Index on Region Index on Type

  • 7/28/2019 07_dw

    13/28

    13

    Using Bit Maps

    Query:

    Get people with age = 20 and name = fred

    List for age = 20: 1101100000

    List for name = fred: 0100000001

    Answer is intersection: 010000000000

    Good if domain cardinality small

    Bit vectors can be compressed

  • 7/28/2019 07_dw

    14/28

    14

    Join

    sale prodId storeId date amt

    p1 c1 1 12

    p2 c1 1 11p1 c3 1 50

    p2 c2 1 8

    p1 c1 2 44

    p1 c2 2 4

    Combine SALE, PRODUCT relations

    In SQL: SELECT * FROM SALE, PRODUCT WHERE ...

    product id name price

    p1 bolt 10

    p2 nut 5

    joinTb prodId name price storeId date amt

    p1 bolt 10 c1 1 12

    p2 nut 5 c1 1 11

    p1 bolt 10 c3 1 50

    p2 nut 5 c2 1 8

    p1 bolt 10 c1 2 44

    p1 bolt 10 c2 2 4

  • 7/28/2019 07_dw

    15/28

    15

    Join Indexes

    product id name price jIndex

    p1 bolt 10 r1,r3,r5,r6

    p2 nut 5 r2,r4

    sale rId prodId storeId date amt

    r1 p1 c1 1 12

    r2 p2 c1 1 11

    r3 p1 c3 1 50

    r4 p2 c2 1 8r5 p1 c1 2 44

    r6 p1 c2 2 4

    join index

  • 7/28/2019 07_dw

    16/28

    16

    Cube Computation for DataWarehouses

  • 7/28/2019 07_dw

    17/28

    17

    Counting Exercise

    How many cuboids are there in a cube?

    The full or nothing case

    When dimension hierarchies are present

    What is the size of each cuboid?

  • 7/28/2019 07_dw

    18/28

    18

    Lattice of Cuboids

    city, product, date

    city, product city, date product, date

    city product date

    all

    day 2c1 c2 c3

    p1 44 4

    p2 c1 c2 c3

    p1 12 50

    p2 11 8

    day 1

    c1 c2 c3

    p1 56 4 50

    p2 11 8

    c1 c2 c3

    p1 67 12 50

    129

  • 7/28/2019 07_dw

    19/28

    19

    Dimension Hierarchies

    all

    state

    city

    cities city state

    c1 CA

    c2 NY

  • 7/28/2019 07_dw

    20/28

    20

    Dimension Hierarchies

    city, product

    city, product, date

    city, date product, date

    city product date

    all

    state, product, date

    state, date

    state, product

    state

    not all arcs shown...

  • 7/28/2019 07_dw

    21/28

    21

    Efficient Data Cube Computation

    Data cube can be viewed as a lattice of cuboids

    The bottom-most cuboid is the base cuboid

    The top-most cuboid (apex) contains only one cell

    How many cuboids in an n-dimensional cube with Llevels?

    Materialization of data cube

    Materialize every (cuboid) (full materialization), none

    (no materialization), orsome (partial materialization)

    Selection of which cuboids to materialize

    Based on size, sharing, access frequency, etc.

    )11(

    n

    ii

    LT

  • 7/28/2019 07_dw

    22/28

    22

    Derived Data

    Derived Warehouse Data

    indexes

    aggregates

    materialized views (next slide)

    When to update derived data?

    Incremental vs. refresh

  • 7/28/2019 07_dw

    23/28

    23

    Idea of Materialized Views

    Define new warehouse tables/arrays

    sale prodId storeId date amt

    p1 c1 1 12

    p2 c1 1 11

    p1 c3 1 50

    p2 c2 1 8

    p1 c1 2 44

    p1 c2 2 4

    product id name price

    p1 bolt 10

    p2 nut 5

    joinTb prodId name price storeId date amt

    p1 bolt 10 c1 1 12p2 nut 5 c1 1 11

    p1 bolt 10 c3 1 50

    p2 nut 5 c2 1 8

    p1 bolt 10 c1 2 44

    p1 bolt 10 c2 2 4

    does not exist

    at any source

  • 7/28/2019 07_dw

    24/28

    24

    Efficient OLAP Processing

    Determine which operations should be performed on available cuboids Transform drill, roll, etc. into corresponding SQL and/or OLAP operations,

    e.g., dice = selection + projection

    Determine which materialized cuboid(s) should be selected for OLAP:

    Let the query to be processed be on {brand, province_or_state} with the

    condition year = 2004, and there are 4 materialized cuboids available:

    1) {year, item_name, city}

    2) {year, brand, country}

    3) {year, brand, province_or_state}

    4) {item_name, province_or_state} where year = 2004

    Which should be selected to process the query?

    Explore indexing structures & compressed vs. dense arrays in MOLAP

  • 7/28/2019 07_dw

    25/28

    25

    What to Materialize?

    Store in warehouse results useful for

    common queries

    Example:day 2

    c1 c2 c3

    p1 44 4

    p2 c1 c2 c3

    p1 12 50

    p2 11 8

    day 1

    c1 c2 c3p1 56 4 50

    p2 11 8

    c1 c2 c3

    p1 67 12 50

    c1

    p1 110

    p2 19

    129

    . . .total sales

    materialize

  • 7/28/2019 07_dw

    26/28

    26

    Materialization Factors

    Type/frequency of queries

    Query response time

    Storage cost Update cost

    Will study a concrete algorithm later

  • 7/28/2019 07_dw

    27/28

    27

    Iceberg Cube

    Computing only the cuboid cellswhose count or other aggregates

    satisfying the condition like

    HAVING COUNT(*) >= minsup Motivation

    Only a small portion of cube cells may be above the

    water in a sparse cube

    Only calculate interesting cellsdata above certainthreshold

  • 7/28/2019 07_dw

    28/28

    28

    Challenges in MOLAP

    Storing large arrays for efficient access

    Row-major, column major

    Chunking

    Compressing sparse arrays

    Creating array data from data in tables

    Efficient techniques for Cube computation

    Topics are discussed in the paper for reading