Upload
misterzed
View
3.344
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Are OLAP cubes "large monters" that deliver quick data retrieval at the expense of long upload time? This presentation shows one way to kill this myth.
Citation preview
Case study:
Quasi real-time OLAP cubes
by Ziemowit JankowskiDatabase Architect
OLAP Cubes - what is it?
• Used to quickly analyze and retrieve data from different perspectives
• Numeric data• Structured data:
o can be represented as numeric values (or sets thereof) accessed by a composite key
o each of the parts of the composite key belongs to a well-defined set of values
• Facts = numeric values• Dimensions = parts of the composite key• Source = usually a start or snowflake schema in a
relational DB (other sources possible)
OLAP Cubes - data sources
Star schema Snowflake schema
OLAP Facts and dimensions
• Every "cell" in an OLAP cube contains numeric data a.k.a "measures".
• Every "cell" may contain more than one measure, e.g. forecast and outcome.
• Every "cell" has a unique combination of dimension values.
OLAP Cubes - operations
• Slice = choose values corresponding to ONE value on one or more dimensions
• Dice = choose values corresponding to one slice or a number of consecutive slices on more than 2 dimensions of the cube
OLAP Cubes - operations (cont'd)
• Drill down/up = choose lower/higher level details. Used in context of hierarchical dimensions.
• Pivot = rotate the orientation of the data for reporting purposes
• Roll-up
OLAP Cubes - refresh methods
• Incremental:o possible when cubes grow
"outwards", i.e. no "scattered" changes in data
o only delta data need to be reado refresh may be fast if delta is
small• Full:
o possible for all cubes, even when changes are "scattered" all over thedata
o all data need to be re-read with every
o refresh may take long time (hours)
The situation on hand
• Business operating on 24*6 basis (Sun-Fri)• Events from production systems are aggregated into
flows and production units• Production figures may be adjusted manually long
after production date• Daily production figures are basis for daily forecasts
with the simplified formula:forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm
• Adjustments in production figures will alter forecast figures
• Outcome and forecast should be stored in MS OLAP cubes as per software architecture demands
• The system should simplify comparisons between forecast and outcome figures
Software
• Source of data:o Relational databaseo Oracle 10g databaseo extensive use of PL/SQL in database
• Destination of data:o OLAP cubes - MS SQL Server Analysis Services
(version 2005 and 2008)• Other software:
o MS SQL Server database
QUESTION
Can we get almost real-time reports from MS OLAP cubes?
ANSWER
YES! The answer lies in "cube partitioning".
Cube partitioning - the basics
• Cube partitions may be updated independently• Cube partitions may not overlap (duplicate values
may occur)• Time is a good dimension to partition on
MS OLAP cube partitioning - details
• Every cube partition has its own query to define the data set fetched from the data source
• The SQL statements define the non-overlapping data sets
MS OLAP cube partitioning - details
How to partition? - theory
• Partitions with different lengths and different update frequencies:o current data = very small partition, very short
update times, updated ofteno "not very current" data = a bit larger partition,
longer update times, updated less ofteno historical data = large partition, long update
times, updated seldom• Operation 24x6 delivers the "seldom" window
How to partition? - theory cont'd
• One cube for both forecast and outcome
Solution - approach one
Decisions:• Cubes partitioned on date boundaries• MOLAP cubes (for better queryperformance)• Use SSIS to populate cubes
o dimensions populated by incremental processingo facts populated by full processingo jobs for historical data must be run after midnight
to compensate for date change
Actions:• Cubes built• SSIS deployed inside SQL Server (and not filesystem)• SSIS set up as scheduled database jobs
Did it work?
No!Malfunctions:• Simultaneous updates of cube partitions could lead
to deadlocks• Deadlocks left cube partitions in unprocessed state
Amendment:• Cube partitions must not be updated simultaneously
Solution - approach two
Decisions:• Cube processing must be ONE partition at a time• Scheduling done by SSIS "super package":
o SQL Server table contains approx. frequency and package names
o "super package" executes SSIS packages as indicated by the table
Actions:• Scheduling table created• "Super package" created to be self-modifying
Did it work?
Not really!Malfunctions:• Historical data had to be updated after midnight and
real-time updates for "Now" partition were postponed. This was done to avoid "gaps" in outcome data and "overlappings" in forecast data.
• Real-time updates ended soon after midnight and were resumed a few hours later. (That was NOT acceptable.)
Amendment:• Re-think!
Solution - approach three
Decisions:• Take advantage of 6*24 cycle (as opposed to 7*24)• Switch dates on Saturdays only
o the "Now" partition had to stretch from Saturday to Saturday
o all other partitions had to stretch from a Saturday to another Saturday
• Re-process all time-consuming partitions on Saturday after switch of date
Solution - approach three cont'd
Actions:• Create logic in Oracle database to do date
calculations "modulo week", i.e. based on Saturday. Logic implemented as function.
• Rewrite SQL statements for cube partitions so that they employ the Oracle function (as above) instead of current date +/- given number of days.
• Reschedule the time consuming updates so they run every 7th day.
Did it work?
Yes!Malfunctions:• None, really.
Lessons learned
• It is possible to build real-time OLAP cubes in MS technology
• It is possible to make the partitions self-maintaining in terms of partition boundaries
• The concept need careful engineering as there are pits in the way.
Omitted details
Some details have been omitted:• the quasi real-time updates are scheduled to occur
every 2nd or 3rd minute• scheduling is not exact, as the Super-job keeps track
of what is to be run and when and executes SSIS packages based on "scheduled-to-run" state, their priority and a few other criteria
• the source of data is not a proper star schema, it is rather an emulation of facts and dimensions by means of data tables and views in Oracle.