Upload
laura-head
View
215
Download
0
Embed Size (px)
Citation preview
8/10/2019 DWH_VeryUseful
1/28
1
IS 4420Database Fundamentals
Chapter 11:Data Warehousing
Leon Chen
8/10/2019 DWH_VeryUseful
2/28
2
Overview
What is data warehouse?Why data warehouse?Data reconciliation ETL processData warehouse architecturesStar schema dimensional modelingData analysis
8/10/2019 DWH_VeryUseful
3/28
3
DefinitionData Warehouse :
A subject-oriented, integrated, time-variant, non-updatable collection of data used in support ofmanagement decision-making processesSubject-oriented: e.g. customers, patients,students, productsIntegrated: Consistent naming conventions,formats, encoding structures; from multiple datasources
Time-variant: Can study trends and changesNonupdatable: Read-only, periodically refreshedData Mart :
A data warehouse that is limited in scope
8/10/2019 DWH_VeryUseful
4/28
4
Need for Data WarehousingIntegrated, company-wide view of high-quality
information (from disparate databases)Separation of operational and informational systemsand data (for improved performance)
8/10/2019 DWH_VeryUseful
5/28
5
Source : adapted from Strange (1997).
8/10/2019 DWH_VeryUseful
6/28
6
Data ReconciliationTypical operational data is:
Transient not historicalNot normalized (perhaps due to denormalization forperformance)Restricted in scope not comprehensiveSometimes poor quality inconsistencies and errors
After ETL, data should be:Detailed not summarized yetHistorical periodicNormalized 3 rd normal form or higherComprehensive enterprise-wide perspectiveTimely data should be current enough to assist decision-makingQuality controlled accurate with full integrity
8/10/2019 DWH_VeryUseful
7/28
7
The ETL Process
Capture/ Extract
Scrub or data cleansingTransform
Load and Index
8/10/2019 DWH_VeryUseful
8/28
8
Static extract = capturing asnapshot of the source data at a point
in time
Incremental extract =capturing changes that have occurred
since the last static extract
8/10/2019 DWH_VeryUseful
9/28
9
Fixing errors: misspellings,erroneous dates, incorrect field usage,
mismatched addresses, missing data,duplicate data, inconsistencies
Also: decoding, reformatting, timestamping, conversion, key generation,
merging, error detection/logging, locatingmissing data
8/10/2019 DWH_VeryUseful
10/28
10
Record-level:Selection data partitioning
Joining data combining Aggregation data summarization
Field-level: single-field from one field to one field
multi-field from many fields to one, orone field to many
8/10/2019 DWH_VeryUseful
11/28
11
Refresh mode: bulk rewriting oftarget data at periodic intervals
Update mode: only changes insource data are written to datawarehouse
8/10/2019 DWH_VeryUseful
12/28
12
Data Warehouse ArchitecturesGeneric Two-Level ArchitectureIndependent Data MartDependent Data Mart and OperationalData StoreLogical Data Mart and @ctiveWarehouse
Three-Layer architecture
8/10/2019 DWH_VeryUseful
13/28
8/10/2019 DWH_VeryUseful
14/28
14
Independent data mart Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each
independent data mart
Data access complexity
due to multiple data marts
8/10/2019 DWH_VeryUseful
15/28
15
Dependent data mart withoperational data store
E
T
L
Single ETL for
enterprise data wareho use (EDW)
ODS provides option forobtaining current data
Dependent data marts
loaded from EDW
8/10/2019 DWH_VeryUseful
16/28
16
E
T
L
Near real-time ETL for@active Data Warehouse
ODS and data warehouse are one and the same
Data marts are NOT separatedatabases, but logical views of the
data warehouse Easier to create new data marts
8/10/2019 DWH_VeryUseful
17/28
17Three-layer data architecture
8/10/2019 DWH_VeryUseful
18/28
18
Data CharacteristicsStatus vs. Event Data
Status
Status
Event a database action(create/update/delete) thatresults from a transaction
h
8/10/2019 DWH_VeryUseful
19/28
19
Data CharacteristicsTransient vs.Periodic Data
Changes to existingrecords are writtenover previousrecords, thusdestroying the
previous data content
Data are never physically altered or
deleted once theyhave been added to
the store
8/10/2019 DWH_VeryUseful
20/28
8/10/2019 DWH_VeryUseful
21/28
21
Star schema example
F act table provides statistics for sales brokendown by product, period and store dimensions
8/10/2019 DWH_VeryUseful
22/28
22
Modeling dates
Fact tables contain time-period data Date dimensions are important
8/10/2019 DWH_VeryUseful
23/28
23
8/10/2019 DWH_VeryUseful
24/28
24
Issues Regarding Star SchemaDimension table keys must be surrogate (non-intelligent and non-business related), because:Keys may change over time
Length/format consistency
Granularity of Fact Table what level of detail doyou want?
Transactional grain finest level Aggregated grain more summarizedFiner grains better market basket analysis capabilityFiner grain more dimension tables, more rows in fact table
Duration of the database how much history shouldbe kept?
Natural duration 13 months or 5 quartersFinancial institutions may need longer durationOlder data is more difficult to source and cleanse
8/10/2019 DWH_VeryUseful
25/28
25
On-Line Analytical Processing (OLAP)The use of a set of graphical tools that
provides users with multidimensional views oftheir data and allows them to analyze thedata using simple windowing techniquesRelational OLAP (ROLAP)
Traditional relational representationMultidimensional OLAP (MOLAP)
Cube structureOLAP Operations
Cube slicing come up with 2-D view of dataDrill-down going from summary to moredetailed views
8/10/2019 DWH_VeryUseful
26/28
26
Slicing a data cube
8/10/2019 DWH_VeryUseful
27/28
27
Example:
Drill-down
Summary report
Drill-down with color added
8/10/2019 DWH_VeryUseful
28/28
28
Data Mining and VisualizationKnowledge discovery using a blend of statistical, AI, andcomputer graphics techniquesGoals:
Explain observed events or conditions
Confirm hypothesesExplore data for new or unexpected relationshipsData mining techniques
Statistical regression Associate rule
ClassificationClustering
Data visualization representing data in graphical /multimedia formats for analysis