Upload
sharada1406
View
4
Download
0
Tags:
Embed Size (px)
Citation preview
Data Warehousing: Data Data Warehousing: Data Models and OLAP Models and OLAP
operationsoperations
ByBy
Kishore JaladiKishore Jaladi
[email protected]@yahoo.com
Topics CoveredTopics Covered1. Understanding the term “Data Warehousing”1. Understanding the term “Data Warehousing”
2. Three-tier Decision Support Systems2. Three-tier Decision Support Systems
3. Approaches to OLAP servers3. Approaches to OLAP servers
4. Multi-dimensional data model4. Multi-dimensional data model
5. ROLAP5. ROLAP
6. MOLAP6. MOLAP
7. HOLAP7. HOLAP
8. Which to choose: Compare and Contrast8. Which to choose: Compare and Contrast
9. Conclusion9. Conclusion
Understanding the term Data Understanding the term Data WarehousingWarehousing• Data Warehouse: Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in of management's decision making process". He defined the terms in the sentence as follows:the sentence as follows:
• Subject Oriented:Subject Oriented:Data that gives information about a particular subject instead of Data that gives information about a particular subject instead of about a company's ongoing operations. about a company's ongoing operations.
• Integrated: Integrated: Data that is gathered into the data warehouse from a variety of Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.sources and merged into a coherent whole.
• Time-variant: Time-variant: All data in the data warehouse is identified with a particular time All data in the data warehouse is identified with a particular time period. period.
• Non-volatileNon-volatileData is stable in a data warehouse. More data is added but data is Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent never removed. This enables management to gain a consistent picture of the business. picture of the business.
Other important Other important terminologyterminology
• Enterprise Data warehouseEnterprise Data warehousecollects all information about subjects collects all information about subjects ((customers,products,sales,assets, personnelcustomers,products,sales,assets, personnel) that span the ) that span the entire organizationentire organization
• Data MartData MartDepartmental subsets that focus on selected subjectsDepartmental subsets that focus on selected subjects
• Decision Support System (DSS)Decision Support System (DSS)Information technology to help the knowledge worker Information technology to help the knowledge worker (executive, manager, analyst) make faster & better decisions(executive, manager, analyst) make faster & better decisions
• Online Analytical Processing (OLAP)Online Analytical Processing (OLAP)an element of decision support systems (DSS)an element of decision support systems (DSS)
Three-Tier Decision Support Three-Tier Decision Support SystemsSystems
• Warehouse database serverWarehouse database server– Almost always a relational DBMS, rarely flat filesAlmost always a relational DBMS, rarely flat files
• OLAP serversOLAP servers– Relational OLAP (ROLAP): extended relational Relational OLAP (ROLAP): extended relational
DBMS that maps operations on multidimensional DBMS that maps operations on multidimensional data to standard relational operatorsdata to standard relational operators
– Multidimensional OLAP (MOLAP): special-purpose Multidimensional OLAP (MOLAP): special-purpose server that directly implements multidimensional server that directly implements multidimensional data and operationsdata and operations
• ClientsClients– Query and reporting toolsQuery and reporting tools– Analysis toolsAnalysis tools– Data mining toolsData mining tools
The Complete Decision Support The Complete Decision Support SystemSystem
Information Sources Data Warehouse Server(Tier 1)
OLAP Servers(Tier 2)
Clients(Tier 3)
OperationalDB’s
SemistructuredSources
extracttransformloadrefreshetc.
Data Marts
DataWarehouse
e.g., MOLAP
e.g., ROLAP
serve
OLAP
Query/Reporting
Data Mining
serve
serve
Approaches to OLAP Approaches to OLAP ServersServers
Three possibilities for OLAP serversThree possibilities for OLAP servers(1) Relational OLAP (ROLAP)(1) Relational OLAP (ROLAP)
– Relational and specialized relational DBMS to Relational and specialized relational DBMS to store and manage warehouse datastore and manage warehouse data
– OLAP middleware to support missing piecesOLAP middleware to support missing pieces(2) Multidimensional OLAP (MOLAP)(2) Multidimensional OLAP (MOLAP)
– Array-based storage structuresArray-based storage structures– Direct access to array data structuresDirect access to array data structures
(3) Hybrid OLAP (HOLAP)(3) Hybrid OLAP (HOLAP)
– Storing detailed data in RDBMSStoring detailed data in RDBMS– Storing aggregated data in MDBMSStoring aggregated data in MDBMS– User access via MOLAP toolsUser access via MOLAP tools
The Multi-Dimensional Data The Multi-Dimensional Data ModelModel
““Sales by product line over the past six months”Sales by product line over the past six months”
““Sales by store between 1990 and 1995”Sales by store between 1990 and 1995”
Prod Code Time Code Store Code Sales Qty
Store Info
Product Info
Time Info
. . .
Numerical MeasuresKey columns joining fact table
to dimension tables
Fact table for measures
Dimension tables
ROLAP: Dimensional Modeling Using ROLAP: Dimensional Modeling Using Relational DBMSRelational DBMS
• Special schema design: Special schema design: star, snowflakestar, snowflake
• Special indexes: bitmap, multi-table joinSpecial indexes: bitmap, multi-table join
• Proven technology (relational model, DBMS), Proven technology (relational model, DBMS), tend to outperform specialized MDDB tend to outperform specialized MDDB especially on large data setsespecially on large data sets
• ProductsProducts– IBM DB2, Oracle, Sybase IQ, RedBrick, IBM DB2, Oracle, Sybase IQ, RedBrick,
InformixInformix
The “Classic” Star SchemaThe “Classic” Star Schema
A single fact table, with A single fact table, with detail and summary datadetail and summary data
Fact table primary key Fact table primary key has only one key column has only one key column per dimensionper dimension
Each key is generatedEach key is generated Each dimension is a Each dimension is a
single table, highly de-single table, highly de-normalizednormalized
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagResolutionSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level
Product Desc.BrandColorSizeManufacturerLevel
STORE KEY
The “Snowflake” SchemaThe “Snowflake” Schema
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDRegion_IDRegional Mgr.
District_IDDistrict Desc.Region_ID
Region_ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
Aggregation in a Single Fact TableAggregation in a Single Fact Table
Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagResolutionSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.Level
Product Desc.BrandColorSizeManufacturerLevel
STORE KEY
PERIOD KEY
Store Dimension Time Dimension
Product Dimension
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Period DescYearQuarterMonthDayCurrent FlagSequence
Fact Table
PRODUCT KEY
Store DescriptionCityStateDistrict IDDistrict Desc.Region_IDRegion Desc.Regional Mgr.
Product Desc.BrandColorSizeManufacturer
STORE KEY
The “Fact Constellation” The “Fact Constellation” SchemaSchema
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY
DollarsUnitsPrice
Region Fact Table
Region_IDPRODUCT_KEYPERIOD_KEY
TheThe
Aggregations using Aggregations using “Snowflake” Schema and “Snowflake” Schema and Multiple Fact TablesMultiple Fact Tables
• No No LEVELLEVEL in dimension tables in dimension tables
• Dimension tables are normalized by Dimension tables are normalized by decomposing at the attribute leveldecomposing at the attribute level
• Each dimension table has one key for Each dimension table has one key for each level of the dimensionís each level of the dimensionís hierarchy hierarchy
• The lowest level key joins the The lowest level key joins the dimension table to both the fact table dimension table to both the fact table and the lower level attribute tableand the lower level attribute table
How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.
District_ IDDistrict Desc.Region_ ID
Region_ ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY Dollars
UnitsPrice
RegionFact Table
Region_IDPRODUCT_KEYPERIOD_KEY
Aggregation Contd …Aggregation Contd …
Advantage: Best performance when queries involve aggregation
Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database
STORE KEY
Store Dimension
Store DescriptionCityStateDistrict IDDistrict Desc.Region_ IDRegion Desc.Regional Mgr.
District_ IDDistrict Desc.Region_ ID
Region_ ID
Region Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
DollarsUnitsPrice
District Fact Table
District_IDPRODUCT_KEYPERIOD_KEY Dollars
UnitsPrice
RegionFact Table
Region_IDPRODUCT_KEYPERIOD_KEY
AggregatesAggregates
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
81
AggregatesAggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Another ExampleAnother Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
Points to be noticed about ROLAPPoints to be noticed about ROLAP
• Defines complex, multi-dimensional data with Defines complex, multi-dimensional data with simple modelsimple model
• Reduces the number of joins a query has to Reduces the number of joins a query has to processprocess
• Allows the data warehouse to evolve with rel. Allows the data warehouse to evolve with rel. low maintenancelow maintenance
• Can contain both detailed and summarized Can contain both detailed and summarized data.data.
• ROLAP is based on familiar, proven, and ROLAP is based on familiar, proven, and already selected technologies.already selected technologies.
BUT!!!BUT!!!
• SQL for multi-dimensional manipulation of SQL for multi-dimensional manipulation of calculations.calculations.
MOLAP: Dimensional Modeling MOLAP: Dimensional Modeling Using the Multi Dimensional ModelUsing the Multi Dimensional Model
• MDDB: a special-purpose data modelMDDB: a special-purpose data model
• Facts stored in multi-dimensional Facts stored in multi-dimensional arraysarrays
• Dimensions used to index arrayDimensions used to index array
• Sometimes on top of relational DBSometimes on top of relational DB
• ProductsProducts– Pilot, Arbor Essbase, GentiaPilot, Arbor Essbase, Gentia
The MOLAP CubeThe MOLAP Cube
sale prodId storeId amtp1 s1 12p2 s1 11p1 s3 50p2 s2 8
s1 s2 s3p1 12 50p2 11 8
Fact table view: Multi-dimensional cube:
dimensions = 2
3-D Cube3-D Cube
dimensions = 3
Multi-dimensional cube:Fact table view:
sale prodId storeId date amtp1 s1 1 12p2 s1 1 11p1 s3 1 50p2 s2 1 8p1 s1 2 44p1 s2 2 4
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
ExampleExample
Store
Product
Time
M T W Th F S S
Juice
Milk
Coke
Cream
Soap
Bread
NYSF
LA
10
34
56
32
12
56
56 units of bread sold in LA on M
Dimensions:Time, Product, Store
Attributes:Product (upc, price, …)Store ……
Hierarchies:Product Brand …Day Week QuarterStore Region Country
roll-up to week
roll-up to brand
roll-up to region
Cube Aggregation: Roll-upCube Aggregation: Roll-up
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
Cube Operators for Roll-upCube Operators for Roll-up
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
s1 s2 s3p1 56 4 50p2 11 8
s1 s2 s3sum 67 12 50
sump1 110p2 19
129
. . .
sale(s1,*,*)
sale(*,*,*)sale(s2,p2,*)
s1 s2 s3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129
Extended CubeExtended Cube
day 2 s1 s2 s3 *p1 44 4 48p2* 44 4 48s1 s2 s3 *
p1 12 50 62p2 11 8 19* 23 8 50 81
day 1
*
sale(*,p2,*)
Aggregation Using Aggregation Using HierarchiesHierarchies
region A region Bp1 56 54p2 11 8
store
region
country
(store s1 in Region A;stores s2, s3 in Region B)
day 2 s1 s2 s3p1 44 4p2 s1 s2 s3
p1 12 50p2 11 8
day 1
Points to be noticed about MOLAPPoints to be noticed about MOLAP
• Pre-calculating or pre-consolidating transactional data Pre-calculating or pre-consolidating transactional data improves speed. improves speed.
BUTBUTFully pre-consolidating incoming data, MDDs require an Fully pre-consolidating incoming data, MDDs require an enormous amount of overhead both in processing time and in enormous amount of overhead both in processing time and in storage. An input file of 200MB can easily expand to 5GBstorage. An input file of 200MB can easily expand to 5GB
MDDs are great candidates for the MDDs are great candidates for the <<50GB department data 50GB department data marts.marts.
• Rolling up and Drilling down through aggregate data.Rolling up and Drilling down through aggregate data.
• With MDDs, application design is essentially the definition of With MDDs, application design is essentially the definition of dimensions and calculation rules, while the RDBMS requires dimensions and calculation rules, while the RDBMS requires that the database schema be a star or snowflake.that the database schema be a star or snowflake.
Hybrid OLAP (HOLAP)Hybrid OLAP (HOLAP)
• HOLAP = Hybrid OLAP:HOLAP = Hybrid OLAP:
– Best of both worldsBest of both worlds
– Storing detailed data in RDBMSStoring detailed data in RDBMS
– Storing aggregated data in MDBMSStoring aggregated data in MDBMS
– User access via MOLAP toolsUser access via MOLAP tools
Multi-dimensional
accessMultidimensional
Viewer
RelationalViewer
ClientMDBMS Server
Multi-dimensional
data
SQL-Read
RDBMS Server
Userdata Meta data
Deriveddata
SQL-Reach Through
SQL-Read
Data Flow in HOLAPData Flow in HOLAP
When deciding which When deciding which technology to go for, consider:technology to go for, consider:
1) Performance: 1) Performance:
• How fast will the system appear to the end-user? How fast will the system appear to the end-user?
• MDD server vendors believe this is a key point in their favor. MDD server vendors believe this is a key point in their favor.
2) Data volume and scalability: 2) Data volume and scalability:
• While MDD servers can handle up to 50GB of storage, RDBMS While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of gigabytes and terabytes. servers can handle hundreds of gigabytes and terabytes.
An experiment with Relational and the Multidimensional models on a data set
The analysis of the author’s example illustrates the following differences between the best Relational alternative and the Multidimensional approach.
* This may include the calculation of many other derived data without any additional I/O.
Reference: http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
relationrelationalal
Multi-Multi-dimensiondimensionalal
ImprovemeImprovementnt
Disk space requirementDisk space requirement
(Gigabytes)(Gigabytes)1717 1010 1.71.7
Retrieve the corporate Retrieve the corporate measuresmeasures
Actual Vs Budget, by month Actual Vs Budget, by month (I/O’s)(I/O’s)
240240 11 240240
Calculation of Variance Calculation of Variance Budget/Actual for the whole Budget/Actual for the whole database (I/O time in hours)database (I/O time in hours)
237237 2*2* 110*110*
What-if analysisWhat-if analysisIFIF
A. You require write access A. You require write access B. Your data is under 50 GBB. Your data is under 50 GBC. Your timetable to implement is 60-90 daysC. Your timetable to implement is 60-90 daysD. Lowest level already aggregatedD. Lowest level already aggregatedE. Data access on aggregated levelE. Data access on aggregated levelF. You’re developing a general-purpose application for inventory movement or assets F. You’re developing a general-purpose application for inventory movement or assets managementmanagement
THENTHENConsider an Consider an MDD /MOLAP MDD /MOLAP solution for your data mart solution for your data mart
IFIF
A. Your data is over 100 GBA. Your data is over 100 GBB. You have a "read-only" requirementB. You have a "read-only" requirementC. Historical data at the lowest level of granularityC. Historical data at the lowest level of granularityD. Detailed access, long-running queriesD. Detailed access, long-running queriesE. Data assigned to lowest level elementsE. Data assigned to lowest level elements
THENTHENConsider an Consider an RDBMS/ROLAPRDBMS/ROLAP solution for your data mart. solution for your data mart.
IFIFA. OLAP on aggregated and detailed dataA. OLAP on aggregated and detailed dataB. Different user groupsB. Different user groupsC. Ease of use and detailed dataC. Ease of use and detailed data
THENTHENConsider an Consider an HOLAP HOLAP for your data martfor your data mart
ExamplesExamples
• ROLAPROLAP– Telecommunication startup: call data records Telecommunication startup: call data records
(CDRs) (CDRs) – ECommerce SiteECommerce Site– Credit Card CompanyCredit Card Company
• MOLAPMOLAP– Analysis and budgeting in a financial departmentAnalysis and budgeting in a financial department– Sales analysisSales analysis
• HOLAPHOLAP– Sales department of a multi-national companySales department of a multi-national company– Banks and Financial Service ProvidersBanks and Financial Service Providers
Tools availableTools available• ROLAP:ROLAP:
– ORACLE 8iORACLE 8i– ORACLE Reports; ORACLE DiscovererORACLE Reports; ORACLE Discoverer– ORACLE Warehouse BuilderORACLE Warehouse Builder– Arbors Software’s EssbaseArbors Software’s Essbase
• MOLAP:MOLAP:– ORACLE Express ServerORACLE Express Server– ORACLE Express Clients (C/S and Web)ORACLE Express Clients (C/S and Web)– MicroStrategy’s DSS serverMicroStrategy’s DSS server– Platinum Technologies’ Plantinum InfoBeaconPlatinum Technologies’ Plantinum InfoBeacon
• HOLAP:HOLAP:– ORACLE 8iORACLE 8i– ORACLE Express ServeORACLE Express Serve– ORACLE Relational Access ManagerORACLE Relational Access Manager– ORACLE Express Clients (C/S and Web)ORACLE Express Clients (C/S and Web)
ConclusionConclusion• ROLAP: RDBMS -> star/snowflake schemaROLAP: RDBMS -> star/snowflake schema
• MOLAP: MDD -> Cube structuresMOLAP: MDD -> Cube structures
• ROLAP or MOLAP: Data models used play major role in performance ROLAP or MOLAP: Data models used play major role in performance differencesdifferences
• MOLAP: for summarized and relatively lesser volumes of data (10-MOLAP: for summarized and relatively lesser volumes of data (10-50GB)50GB)
• ROLAP: for detailed and larger volumes of dataROLAP: for detailed and larger volumes of data
• Both storage methods have strengths and weaknessesBoth storage methods have strengths and weaknesses
• The choice is requirement specific, though currently data The choice is requirement specific, though currently data warehouses are predominantly built using RDBMSs/ROLAP.warehouses are predominantly built using RDBMSs/ROLAP.
ReferencesReferences• http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
– OLAP, Relational, and Multidimensional Database Systems, by George Colliat, Arbor Software Corporation
• http://www.donmeyer.com/art3.html– Data warehousing ServicesData warehousing Services, , Data Mining & Analysis, LLCData Mining & Analysis, LLC
• http://www.cs.man.ac.uk/~franconi/teaching/2001/CS636/CS636-olap.ppt– Data Warehouse Models and OLAP OperationsData Warehouse Models and OLAP Operations, by Enrico , by Enrico
FranconiFranconi
• http://www.promatis.com/mediacenter/papers- ROLAP, MOLAP, HOLAP: How to determine which to - ROLAP, MOLAP, HOLAP: How to determine which to technology is appropriate, by Holger Frietch, PROMATIS technology is appropriate, by Holger Frietch, PROMATIS CorporationCorporation