Upload
gavin-wade
View
215
Download
2
Tags:
Embed Size (px)
Citation preview
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 2
Evolution of Database Technology
1960s: Hierarchical (IMS) & network (CODASYL) DBMS.
1970s: Relational data model, relational DBMS implementation.
1980: RDBMS rules the earth 1985-: Advanced data models (extended-relational,
OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.).
1990s: ORDB, OLAP, Data mining, data warehousing, multimedia databases, and network databases.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 3
New Business Environment Economic crisis
Importance of market & credit risk management for banks Deregulation
Intensifying competition heightened interest in retaining & acquiring good
customers Mergers & Acquisitions
Needs for consolidated view of business Created diverse computer systems within large
corporations. E-Business
New way of reaching customers. Opportunity for 1:1 marketing
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 4
What This Means
Increasing competition customers have more choices price wars, such as the one in HK
each operators finds its own niche (value, coverage, customer service) Increasing “churn” focus on loyalty, customer relationship management
With the similar technology, customers become more important to business turn from product oriented to customer oriented
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 5
An Example
Customer service hotline of a mobile phone company
Cust A: You made a mistake in my last month statement ….Receptionist: Let me check…Oh, you are right…. As a token of
apology, we offer you one month free service.
Cust B: You made a mistake in my last month statement ….Receptionist: Let me check…Oh, you are right…. As a token of
apology, we will send you two free movie tickets.
On line decision making
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 6
Changes in Business Strategy
Old-style incumbent one focuses on reducing cost improving product penetration (find a customer for a
product not vice versa)
New-style aggressive one focuses on getting closer to customers finding new ways to increase revenue from customers satisfaction loyalty more customers revenue
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 7
How Does IT Play Its Role?
From traditional to OLTP to OLAP OLTP: on-line transaction processing OLAP: on-line analytical processing
To better support OLAP, warehousing your business data querying one clean, integrated data warehouse rather
than dozens operational databases To do more and better than OLAP, consider data mining
discovering knowledge from operational data turning the huge volume of data into a mine of
gold/diamond
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 8
How Can IT Play the Role?
Current situation In most organizations, data about specific parts of
business is there -- lots and lots of data, somewhere, in some form.
Data is available but not information -- and not the right information at the right time
What should we do? To bring together information from multiple sources as
to provide a consistent database source for decision support queries.
To off-load decision support applications from the on-line transaction system.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 9
Decision Support
Decision Support is a term used to describe the capability of a system to support the formulation of business decisions through complex queries against a database.
It can also specifically refer to a database which is intended for this purpose, as opposed to one which primarily supports on-line transaction processing operations.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 10
Evolution of Electronic Data Processing 60’s: Batch Reports
hard to find and analyze information inflexible & expensive, reprogram every new request
70’s: Terminal-Based DSS and EIS still inflexible, not integrated with desktop tools
80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational DB
90’s: Data warehouse with integrated OLAP engines and tools
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 11
OLTP vs. Decision Support Queries Traditionally, DBMS have been used for on-line transaction
processing (OLTP) order entry: pull up order 990101 and update status field banking: transfer $1000 from account X to account Y
DSS: Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions What were the sales volumes by region and product
category for the last year? How did the share price of computer manufacturers
correlate with quarterly profits over the past 10 years? Will a 10% discount increase sales volume sufficiently?
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 12
TPC-D Benchmark Query #16
Counts the number of Suppliers who can supply Parts that satisfy a particular customer's requirements. The Customer is interested in Parts of eight different sizes as long as they are not a given type, not of a given brand, and not from a Supplier who has had complaints registered at the Better Business Bureau. Results must be presented in descending count and ascending brand, type and size.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 13
TPC-H Benchmark Query #16 (SQL)
SELECT P_BRAND, P_TYPE, P_SIZE, COUNT(DISTINCT PS_SUPPKEY)(NAMED SUPPLIER_CNT) FROM PARTSUPP, PARTTBL WHERE P_PARTKEY = PS_PARTKEY AND P_BRAND <> 'Brand#45' AND P_TYPE NOT LIKE 'MEDIUM POLISHED%' AND P_SIZE IN (49, 14, 23, 45, 19, 3, 36, 9) AND PS_SUPPKEY NOT IN ( SELECT S_SUPPKEY FROM SUPPLIER WHERE S_COMMENT LIKE '%Better Business Bureau%Complaints% ') GROUP BY P_BRAND, P_TYPE, P_SIZE ORDER BY SUPPLIER_CNT DESC, P_BRAND, P_TYPE, P_SIZE;
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 14
OLTP Applications
clerical data processing tasks update-intensive detailed up-to-date data structured, repetitive tasks short transactions are the unit of work read and/or update a few records isolation, recovery and integrity are critical
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 15
Decision Support & OLAP
Decision support applications typically consist of long and often complex read-only queries that access
large portions of the database. Databases for Decision Support
Decision support databases are updated relatively infrequently, either by periodic
batch runs, or by background "trickle" update streams. need not contain real-time or up-to-the-minute
information, as decision support applications tend to process large amounts of data which usually would not be affected significantly by individual transactions.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 16
OLTP vs. OLAP OLTP OLAP
users Clerk, IT professional Knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 17
Why Data Warehousing needed?
Lack of historical business data Data required for analysis often resides in
different operational systems Query performance is extremely poor when the
analysis is done in the operational systems. Operational DBMS were not designed for decision
support
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 18
The Architecture of Data
What’s has been learned from data
logical model physical layout of data summaries by who,
what, when, where,... who, what, when,
where, ...Operational data
Metadata
Database schema
Summary data
Businessrules
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 19
Data Warehouse
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. --- W. H. Inmon
A decision support database that is used primarily in organizational decision making.
A collection of data maintained separately from the organization’s operational database
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 20
DW: Subject Oriented & Integrated Subject oriented
oriented to the major subject areas of the corporation that have been defined in the data model.
• E.g. for an insurance company: customer, product, transaction or activity, policy, claim, account, and etc.
operational DB and applications may be organized differently
• E.g. based on type of insurance's: auto, life, medical, fire, ... Integrated
There is no consistency in encoding, naming conventions, … among different data sources
When data is moved to the warehouse, it is converted.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 21
DW: Non-Volatile & Time Variant Non-volatile
Operational data is regularly accessed and manipulated a record at a time and update is done to data in the operational environment.
Warehouse Data is loaded and accessed. Update of data does not occur in the data warehouse environment.
Time Variant The time horizon for the data warehouse is significantly longer
than that of operational systems. Operational database contain current value data. Data warehouse
data is nothing more than a sophisticated series of snapshots, taken as of some moment in time.
The key structure of operational data may or may not contain some element if time. The key structure of the data warehouse always contains some element of time.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 22
Why Separate Data Warehouse Performance
special data organization, access methods, and implementation methods are needed to support multidimensional views and operations typical of OLAP
Complex OLAP queries would degrade performance for operational transactions
Function missing data: Decision support requires historical data which
operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources: operational DBs, external sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 23
The Reference Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Servers
AnalysisQueryReportsData mining
Data Sources Tools
Serve
Data Marts
Other
Sources
Operational DBs
Monitor &Integrator
Metadatarepository
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 24
Data Sources
Data sources are often the operational systems, providing the lowest level of data.
Data sources are designed for operational use, not for decision support, and the data reflect this fact.
Multiple data sources are often from different systems run on a wide range of hardware and much of the software is built in-house or highly customized.
Multiple data sources introduce a large number of issues -- semantic conflicts.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 25
Data Extraction, Cleaning and Integration Important to warehouse clean data (operational data
from multiple sources are often dirty). Three classes of tools
Data migration: allows simple data transformation Data Scrubbing: uses domain-specific knowledge
to scrub data Data auditing: discovers rules and relationships by
scanning data (detect outliers). Data cleaning and integration may use up to 50-70%
of the effort and budget
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 26
Load and Refresh
Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, build indxes, etc.
Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse when to refresh
• determined by usage, types of data source, etc. how to refresh
• data shipping: using triggers to update snapshot log table and propagate the updated data to the warehouse
• transaction shipping: shipping the updates in the transaction log
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 27
Integrator
receive changes from the monitors make the data conform to the conceptual schema
used by the warehouse integrate the changes into the warehouse
merge the data with existing data already present resolve possible update anomalies
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 28
Metadata
Structure of the data in DW (data models) Metrics (algorithms for summarization and
aggregation) Mapping from legacy systems to the data
warehouse Data usage statistics Performance statistics
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 29
Metadata Repository (I)
Administrative metadata source database and their contents gateway descriptions warehouse schema, view and derived data definitions dimensions and hierarchies pre-defined queries and reports data mart locations and contents data partitions data extraction, cleansing, transformation rules, defaults data refresh and purge rules user profiles, user groups security: user authorization, access control
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 30
Metadata Repository (II)
Business data business terms and definitions ownership of data charging policies
Operational metadata data lineage: history of migrated data and sequence of
transformations applied currency of data: active, archived, purged Monitoring information: warehouse usage statistics, error
reports, audit trails
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 31
Data Marts A data mart (departmental data warehouse) is a
specialized system that brings together the data needed for a department or related applications.
Data marts can be implemented within the data warehouse by creating special, application-specific views.
Data marts can also be implemented as materialized views Departmental subsets that focus on selected subjects.
More sophisticated data marts may use different data representations and include their own OLAP engines
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 32
Other Tools
User interface that allows users to interact with the warehouse query and reporting tools analysis tools data mining tools
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 33
System Design
Capacity planing -- define architecture Integrate servers, storage, clients Design warehouse schema, views Design physical warehouse organization: data placement,
partitioning, access methods Connect sources: gateways, ODBC drivers Design and implement scripts for data extract, load and
refresh Define metadata and populate repository Design and implement end-user applications Roll out warehouse and applications
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 34
Technologies Involved
conceptual data modeling design warehouse schema
integration of data from heterogeneous sources for monitor and integrator
extending relational database techniques multidimensional database and MOLAP
distributed and parallel processing warehouse and OLAP server
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 35
Conceptual Modeling Data Warehouses
Modeling data warehouses: dimensions & measurements Star schema: A single object (fact table) in the
middle connected to a number of objects (dimension tables)
Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.
Fact constellations: Multiple fact tables share dimension tables.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 36
Example of Star Schema
DateMonthYear
Date
CustIdCustNameCustCityCustCountry
Cust
Sales Fact Table
Date
Product
Store
Customer
unit_sales
dollar_sales
Yen_sales
Measurements
ProductNoProdNameProdDescCategoryQOH
Product
StoreIDCityStateCountryRegion
Store
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 37
Example of Snowflake Schema
DateMonth
Date
CustIdCustNameCustCityCustCountry
Cust
Sales Fact Table
Date
Product
Store
Customer
unit_sales
dollar_sales
Yen_sales
Measurements
ProductNoProdNameProdDescCategoryQOH
Product
MonthYear
MonthYear
Year
CityState
City
CountryRegion
CountryStateCountry
State
StoreIDCity
Store
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 38
Star Schema versus Snowflake Schema
Star Schema
De-normalized Few attribute tables Simple attribute relationship Bigger attribute tables Less joins
Snowflake Schema
Normalized More attribute tables Complex attribute relationship Smaller attribute tables More joins
Real data warehouses are rarely designed in pure Star or Snowflake schema because of the complex relationships among the modeled objects.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 39
Summary Tables
Data warehouse may store some selected summary data, the pre-aggregated data.
Summary data can store as separate fact tables sharing the same dimension tables with the base fact table.
Summary data can be encoded in the original fact table and dimension tables.
id level date month year0 1 1 1 19981 2 NULL 1 19982 2 NULL 2 19983 3 NULL NULL 1998
DateID ProdID Sales0 1 10001 1 200001 2 400003 1 300000
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 40
OLAP Servers
Relational OLAP: extended relational DBMS that maps operations on multidimensional data to standard relations operations
Multidimensional OLAP: special purpose server that directly implements multidimensional data and operations
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 41
ROLAP versus MOLAP
ROLAP exploits services of relational engine effectively provides additional OLAP services
• design tools for DSS schema• performance analysis tool to pick aggregates to
materialize SQL comes in the way of sequential processing
and columar aggregation Some queries are hard to formulate and can often
be time consuming to execute
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 42
ROLAP versus MOLAP
MOLAP the storage model is an n-dimensional array Front-end multidimensional queries map to server
capabilities in a straightforward way Direct addressing abilities Handling sparse data in array representation is
expensive Poor storage utilization when the data is sparse
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 43
Multidimensional View of Data
Sales volume as a function of product, time, and geography
Pro
duct
Regio
n
month
Dimensions: Product, Region, weekHierarchical summarization paths
Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 44
A Sample Data Cube
Total annual salesof TV in China.
Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
China
India
Japan
sum
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 45
Sample Operations
Roll up: summarize data total sales volume last year by product category by region
Roll down, Drill down, drill through: go from higher level summary to lower level summary or detailed data For a particular product category, find the detailed sales
data for each salesperson by date Slice and dice: select and project
Sales of beverages in the West over the last 6 months Pivot: reorient cube
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 46
Cube Operation
SELECT date, product, customer, SUM (amount)
FROM SALES
CUBE BY date, product, customer
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product) (customer)
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 47
Cuboid Lattice
(B)(A) (C) (D)
(B,C) (B,D) (C,D)(A,D)(A,C)
(A,B,D) (B,C,D)(A,C,D)
(A,B)
( all )
(A,B,C,D)
(A,B,C)
R Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cube.
The top most cuboid contains only one cell.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 48
Cuboid -- A Formal Definition
Let R be a relation with k+1 attributes X = {A1, A2, …, Ak ,V}.
A cuboid on j attributes S = {Ai1, A i2, …, A ij} is defined as a group-by on attributes Ai1, A i2, …, A ij using aggregate function F(.) applied on attribute V. This cuboid can be represented as a k+1 attribute relation by using the special value ALL for the remaining k-j attributes .
The CUBE on attribute set X is the union of cuboids on all subsets of attributes of X. The cuboid on all attributes in X is called the base cuboid.
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 49
Cube Computation -- Array Based Algorithm
An MOLAP approach: the base cuboid is stored as multidimensional array.
Read in a number of cells to compute partial cuboidsB
{}
A
C
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 50
View and Materialized views
View derived relation defined in terms of base (stored) relations
Materialized views a view can be materialized by storing the tuples of the
view in the database Index structures can be built on the materialized view
Maintenance is an issue for materialized views recomputation incremental updating
H.Lu/HKUST MTMI519: Data Warehousing & OLAP -- 51
Issues Related Materialized Views
Select a set of views to be materialized limited by resource, cannot materialize all the views issues to consider: available resources, overhead with
respect to the workload simple algorithm works reasonably well.
Exploit the materialized views to answer queries Query optimization using views
Efficiently update materialized views during loading and fresh