DWH_VeryUseful

Embed Size (px)

Citation preview

  • 8/10/2019 DWH_VeryUseful

    1/28

    1

    IS 4420Database Fundamentals

    Chapter 11:Data Warehousing

    Leon Chen

  • 8/10/2019 DWH_VeryUseful

    2/28

    2

    Overview

    What is data warehouse?Why data warehouse?Data reconciliation ETL processData warehouse architecturesStar schema dimensional modelingData analysis

  • 8/10/2019 DWH_VeryUseful

    3/28

    3

    DefinitionData Warehouse :

    A subject-oriented, integrated, time-variant, non-updatable collection of data used in support ofmanagement decision-making processesSubject-oriented: e.g. customers, patients,students, productsIntegrated: Consistent naming conventions,formats, encoding structures; from multiple datasources

    Time-variant: Can study trends and changesNonupdatable: Read-only, periodically refreshedData Mart :

    A data warehouse that is limited in scope

  • 8/10/2019 DWH_VeryUseful

    4/28

    4

    Need for Data WarehousingIntegrated, company-wide view of high-quality

    information (from disparate databases)Separation of operational and informational systemsand data (for improved performance)

  • 8/10/2019 DWH_VeryUseful

    5/28

    5

    Source : adapted from Strange (1997).

  • 8/10/2019 DWH_VeryUseful

    6/28

    6

    Data ReconciliationTypical operational data is:

    Transient not historicalNot normalized (perhaps due to denormalization forperformance)Restricted in scope not comprehensiveSometimes poor quality inconsistencies and errors

    After ETL, data should be:Detailed not summarized yetHistorical periodicNormalized 3 rd normal form or higherComprehensive enterprise-wide perspectiveTimely data should be current enough to assist decision-makingQuality controlled accurate with full integrity

  • 8/10/2019 DWH_VeryUseful

    7/28

    7

    The ETL Process

    Capture/ Extract

    Scrub or data cleansingTransform

    Load and Index

  • 8/10/2019 DWH_VeryUseful

    8/28

    8

    Static extract = capturing asnapshot of the source data at a point

    in time

    Incremental extract =capturing changes that have occurred

    since the last static extract

  • 8/10/2019 DWH_VeryUseful

    9/28

    9

    Fixing errors: misspellings,erroneous dates, incorrect field usage,

    mismatched addresses, missing data,duplicate data, inconsistencies

    Also: decoding, reformatting, timestamping, conversion, key generation,

    merging, error detection/logging, locatingmissing data

  • 8/10/2019 DWH_VeryUseful

    10/28

    10

    Record-level:Selection data partitioning

    Joining data combining Aggregation data summarization

    Field-level: single-field from one field to one field

    multi-field from many fields to one, orone field to many

  • 8/10/2019 DWH_VeryUseful

    11/28

    11

    Refresh mode: bulk rewriting oftarget data at periodic intervals

    Update mode: only changes insource data are written to datawarehouse

  • 8/10/2019 DWH_VeryUseful

    12/28

    12

    Data Warehouse ArchitecturesGeneric Two-Level ArchitectureIndependent Data MartDependent Data Mart and OperationalData StoreLogical Data Mart and @ctiveWarehouse

    Three-Layer architecture

  • 8/10/2019 DWH_VeryUseful

    13/28

  • 8/10/2019 DWH_VeryUseful

    14/28

    14

    Independent data mart Data marts:Mini-warehouses, limited in scope

    E

    T

    L

    Separate ETL for each

    independent data mart

    Data access complexity

    due to multiple data marts

  • 8/10/2019 DWH_VeryUseful

    15/28

    15

    Dependent data mart withoperational data store

    E

    T

    L

    Single ETL for

    enterprise data wareho use (EDW)

    ODS provides option forobtaining current data

    Dependent data marts

    loaded from EDW

  • 8/10/2019 DWH_VeryUseful

    16/28

    16

    E

    T

    L

    Near real-time ETL for@active Data Warehouse

    ODS and data warehouse are one and the same

    Data marts are NOT separatedatabases, but logical views of the

    data warehouse Easier to create new data marts

  • 8/10/2019 DWH_VeryUseful

    17/28

    17Three-layer data architecture

  • 8/10/2019 DWH_VeryUseful

    18/28

    18

    Data CharacteristicsStatus vs. Event Data

    Status

    Status

    Event a database action(create/update/delete) thatresults from a transaction

    h

  • 8/10/2019 DWH_VeryUseful

    19/28

    19

    Data CharacteristicsTransient vs.Periodic Data

    Changes to existingrecords are writtenover previousrecords, thusdestroying the

    previous data content

    Data are never physically altered or

    deleted once theyhave been added to

    the store

  • 8/10/2019 DWH_VeryUseful

    20/28

  • 8/10/2019 DWH_VeryUseful

    21/28

    21

    Star schema example

    F act table provides statistics for sales brokendown by product, period and store dimensions

  • 8/10/2019 DWH_VeryUseful

    22/28

    22

    Modeling dates

    Fact tables contain time-period data Date dimensions are important

  • 8/10/2019 DWH_VeryUseful

    23/28

    23

  • 8/10/2019 DWH_VeryUseful

    24/28

    24

    Issues Regarding Star SchemaDimension table keys must be surrogate (non-intelligent and non-business related), because:Keys may change over time

    Length/format consistency

    Granularity of Fact Table what level of detail doyou want?

    Transactional grain finest level Aggregated grain more summarizedFiner grains better market basket analysis capabilityFiner grain more dimension tables, more rows in fact table

    Duration of the database how much history shouldbe kept?

    Natural duration 13 months or 5 quartersFinancial institutions may need longer durationOlder data is more difficult to source and cleanse

  • 8/10/2019 DWH_VeryUseful

    25/28

    25

    On-Line Analytical Processing (OLAP)The use of a set of graphical tools that

    provides users with multidimensional views oftheir data and allows them to analyze thedata using simple windowing techniquesRelational OLAP (ROLAP)

    Traditional relational representationMultidimensional OLAP (MOLAP)

    Cube structureOLAP Operations

    Cube slicing come up with 2-D view of dataDrill-down going from summary to moredetailed views

  • 8/10/2019 DWH_VeryUseful

    26/28

    26

    Slicing a data cube

  • 8/10/2019 DWH_VeryUseful

    27/28

    27

    Example:

    Drill-down

    Summary report

    Drill-down with color added

  • 8/10/2019 DWH_VeryUseful

    28/28

    28

    Data Mining and VisualizationKnowledge discovery using a blend of statistical, AI, andcomputer graphics techniquesGoals:

    Explain observed events or conditions

    Confirm hypothesesExplore data for new or unexpected relationshipsData mining techniques

    Statistical regression Associate rule

    ClassificationClustering

    Data visualization representing data in graphical /multimedia formats for analysis