Upload
allayna
View
65
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Building Scalable OLAP Applications with Mondrian and MySQL. Julian Hyde: Chief Architect, OLAP, at Pentaho and Mondrian Project Founder Tuesday, April 24th 2007. Agenda. Pentaho Introduction What is OLAP? Key features and architecture Getting started Schemas & queries - PowerPoint PPT Presentation
Citation preview
Building Scalable OLAP Applications with Mondrian and MySQL
Julian Hyde: Chief Architect, OLAP, at Pentahoand Mondrian Project Founder
Tuesday, April 24th 2007
Agenda
Pentaho Introduction
What is OLAP?
Key features and architecture
Getting started
Schemas & queries
Tuning & scalability
Active Data Warehousing
Case Studies
Business Intelligence suite
Q & A
Pentaho Introduction
World’s most popular enterprise open source BI Suite2 million lifetime downloads, averaging 100K / month
Founded in 2004: Pioneer in professional open source BI
Management - proven BI and open source veterans from Business Objects, Cognos, Hyperion, JBoss, Oracle, Red Hat, SAS
Board of Directors – deep expertise and proven success in open sourceLarry Augustin - founder, VA Software, helped coin the phrase “open source”
New Enterprise Associates – investors in SugarCRM, Xensource, others
Index Ventures – investors in MySQL, Zend, others
Widely recognized as the leader in open source BIDistributed worldwide by Red Hat via the Red Hat Exchange
Embedded in next release of OpenOffice (40 million users worldwide)
What is OLAP?
View data “dimensionally”i.e. Sales by region, by channel, by time period
Navigate and exploreAd Hoc analysis“Drill-down” from year to quarterPivotSelect specific members for analysis
Interact with high
performanceTechnology optimized for rapid interactive response
A brief history of OLAP
1970
1980
1990
2000
2010
APL
Express
Mondrian
Pentaho
Essbase
Business Objects
Cognos PowerPlay
Mainframe OLAPComshare
MicroStrategy
JPivot
Desktop OLAP
Open-source OLAP
Open-source BI suites OpenIBEE
RDBMS support for DW
Sybase IQ
Informix MetaCube
Oracle partitions,bitmap indexes
JRubikPalo
Microsoft OLAP Services
Codd’s “12Rules of OLAP”
Spreadsheets
MySQL
Mondrian features and architecture
Key Features
On-Line Analytical Processing (OLAP)
cubesautomated aggregationspeed-of-thought response times
Open Architecture100% JavaJ2EESupports any JDBC data sourceMDX and XML/A
Analysis ViewersEnables ad-hoc, interactive data explorationAbility to slice-and-dice, drill-down, and pivot Provides insights into problems or successes
How Mondrian Extends MySQL for OLAP Applications
MySQL Provides
Data storage
SQL query execution
Heavy-duty sorting, correlation,
aggregation
Integration point for all BI tools
Mondrian Provides
Dimensional view of data
MDX parsing
SQL generation
Caching
Higher-level calculations
Aggregate awareness
Cube Schema
XML
Cube Schema
XML
Cube Schema
XML
Open Architecture
Open Standards (Java, XML,
MDX, XML/A, SQL)
Cross Platform (Windows &
Unix/Linux)
J2EE ArchitectureServer ClusteringFault Tolerance
Data SourcesJDBCJNDI
J2EE Application Server
Mondrian
Web Server
JDBC
RDBMS
cube cube cube
File or RDBMS Repository
RDBMS
JDBC JDBC
JPivot servlet
Viewers
JPivot servlet XML/A servlet
Microsoft Excel(via Spreadsheet
Services)
Getting started
Getting started with Mondrian
1. Download mondrian from http://mondrian.pentaho.org and unzip
2. Copy mondrian.war to TOMCAT_HOME/webapps.
3. Create a database:$ mysqladmin create foodmart$ mysqlmysql> grant all privileges on *.* to 'foodmart'@'localhost' identified by 'foodmart';Query OK, 0 rows affected (0.00 sec)
mysql> quitBye
4. Install demo dataset (300,000 rows):$ java -cp
"/mondrian/lib/mondrian.jar:/mondrian/lib/log4j-1.2.9.jar:/mondrian/lib/eigenbase-xom.jar:/mondrian/lib/eigenbase-resgen.jar:/mondrian/lib/eigenbase-properties.jar:/usr/local/mysql/mysql-connector-java-3.1.12-bin.jar" mondrian.test.loader.MondrianFoodMartLoader -verbose -tables -data -indexes -jdbcDrivers=com.mysql.jdbc.Driver -inputFile=/mondrian/demo/FoodMartCreateData.sql -outputJdbcURL= "jdbc:mysql://localhost/foodmart?user=foodmart&password=foodmart“
5. Start tomcat, and hit http://localhost:8080/mondrian
demonstration
Schemas and queries
A Mondrian schema consists of…
A dimensional model (logical)Cubes & virtual cubesShared & private dimensionsCalculated measures in cube and in query languageParent-child hierarchies
… mapped onto a star/snowflake
schema (physical)Fact tableDimension tablesJoined by foreign key relationships
Writing a Mondrian Schema
Regular cubes, dimensions,
hierarchies
Shared dimensions
Virtual cubes
Parent-child hierarchies
Custom readers
Access-control
<!-- Shared dimensions -->
<Dimension name="Region">
<Hierarchy hasAll="true“ allMemberName="All Regions">
<Table name="QUADRANT_ACTUALS"/>
<Level name="Region" column="REGION“ uniqueMembers="true"/>
</Hierarchy>
</Dimension>
<Dimension name="Department">
<Hierarchy hasAll="true“ allMemberName="All Departments">
<Table name="QUADRANT_ACTUALS"/>
<Level name="Department“ column="DEPARTMENT“ uniqueMembers="true"/>
</Hierarchy>
</Dimension>
(Refer to http://mondrian.pentaho.org/documentation/schema.php )
Tools
Schema
Workbench
Pentaho cube
designer
cmdrunner
MDX – Multi-Dimensional Expressions
A language for multidimensional queries
Plays the same role in Mondrian’s API as SQL does in JDBC
SQL-like syntax
… but un-SQL-like semantics
SELECT {[Measures].[Unit Sales]} ON COLUMNS,
{[Store].[USA], [Store].[USA].[CA]} ON ROWS
FROM [Sales]
WHERE [Time].[1997].[Q1]
(Refer to http://mondrian.pentaho.org/documentation/mdx.php )
Scalability & Tuning
Tuning is a Process
Some techniques:
Test performance on a realistic data set!
Use tracing to identify long-running SQL statementsmondrian.trace.level=2
Tune SQL queries
Run the same query several times, to examine cache
effectiveness
When Mondrian starts, prime the cache by running queries
Get to know Mondrian’s system properties
Tuning MySQL for Mondrian
Fact table:Index foreign keysTry indexing multiple combinations of foreign keysUse MyISAM or MERGE (also known as MRG_MyISAM)Partitioned tables
Dimension tables:Index primary and foreign keysUse MyISAM for large dimension tables, MEMORY for smaller dimension tables
Create aggregate tables:Contain pre-computed summariesPrevent full table scan followed by massive GROUP BY
Parent-child hierarchiesCreate closure tables
Scaling to large datasets
Tune MySQLPartitions allow parallelism
Aggregate tables allow you to do the
hard work to load-time, not runtime
Increase cache size
Scaling to large numbers of concurrent users
Tune MySQL
Increase JVM memory
Make sure that
connections are using
the same cache
Multiple web servers
Multiple mondrian
instances
Multiple DBMS
instances with read-
only copies of the
data
J2EE Application Server
Mondrian
cube
RDBMS
JDBC
Viewers
JPivot servlet
J2EE Application Server
Mondrian
Web Server
cube
RDBMS
JDBC
JPivot servlet
Viewers
XML/A servlet
Web Server
JPivot servlet
J2EE Application Server
Mondrian
Web Server
cube
JDBC
JPivot servlet
XML/A servlet
Web Server
JPivot servlet
J2EE Application Server
Mondrian
Web Server
cube
RDBMS
JDBC
JPivot servlet
XML/A servlet
Web Server
JPivot servlet
J2EE Application Server
Mondrian
Web Server
cube
JDBC
JPivot servlet
XML/A servlet
Web Server
JPivot servlet
J2EE Application Server
Mondrian
Web Server
cube
RDBMS
JDBC
JPivot servlet
XML/A servlet
Web Server
JPivot servlet
RDBMS
Viewers
Active Data Warehousing
Active Data Warehousing
An Active Data Warehouse:
Extends the traditional DW to support operational processing
Provides a single view of the business
Is updated in near real-time
Cost/benefit:
High cost (technical challenges)
High reward (time is money)
Technical challenges of Active Data Warehousing
DBMS has mixed work-loadShort, fast queries to query specific recordsLonger queries to look at changing patterns
Emphasis on recent dataBackground DML to keep database up to date
Continuous, incremental ETLContinuous or near real-timeIncrementalConsider message-driven ETL
OLAP cacheOLAP servers use cache and/or MOLAP database for performanceCache consistency
Mondrian and Active Data Warehouse
“ROLAP with caching”No MOLAP data store to maintain
Disable aggregate tablesUnless your ETL process can maintain these incrementally
Either turn off cachemondrian.rolap.star.disableCaching=true
“Pure ROLAP” modeResults are cached for the lifetime of a single MDX statement
Or use new CacheControl API to selectively flush mondrian’s
cache
Cache control
In an Active Warehouse, the most recent data are modified much
more often than historic data
Cache control API allows the application to selectively flush the
cache
year quarter month day unit_sales
2007 Q1 1 1 1500
2007 Q1 1 2 1700
2007 Q1 1 3 1250
2007 Q1 1 4 1800
…
2007 Q2 4 22 1378
2007 Q2 4 23 1920
2007 Q2 4 24 3502007 Q2 4 24 4502007 Q2 4 24 525
(Refer to http://mondrian.pentaho.org/documentation/cache_control.php )
Cache control: A simple example
Let’s run a simple MDX query:
SELECT
{[Time].[1997],
[Time].[1997].[Q1],
[Time].[1997].[Q2]} ON COLUMNS,
{[Customers].[USA],
[Customers].[USA].[OR],
[Customers].[USA].[WA]} ON ROWS
FROM [Sales]
Cache contents after running query
year nation unit_sales
1997 USA 266,773
year nation state unit_sales
1997 USA OR 67,659
1997 USA WA 124,366
year quarter nation unit_sales
1997 Q1 USA 66,291
1997 Q2 USA 62,610
year quarter nation state unit_sales
1997 Q1 USA OR 19,287
1997 Q1 USA WA 30,114
1997 Q2 USA OR 15,079
1997 Q2 USA WA 29,479
year=1997, nation=USA,
measure=unit_sales
year=1997, nation=USA, state={OR, WA},
measure=unit_sales
year=1997, quarter=*, nation=USA,
state={OR, WA}, measure=unit_sales
year=1997, quarter=any, nation=USA,
measure=unit_sales
Flushing part of a segment
year=1997, quarter=*, nation=USA,
state={OR, WA}, measure=unit_sales
year quarter nation state unit_sales
1997 Q1 USA OR 19,287
1997 Q1 USA WA 30,114
1997 Q2 USA OR 15,079
1997 Q2 USA WA 29,479
OR WA
Q
1
19,28
7
30,11
4
Q
2
15,07
9
29,47
9
year=1997, quarter=*, nation=USA,
state={OR, WA}, measure=unit_sales;
exclude: state=WA and quarter=Q2
CA WA
Q
1
16,89
0
30,11
4
Q
2
18,05
2
30,07
9
Cachemanager
year=1997, quarter=*, nation=USA,
state={CA, WA}, measure=unit_sales
Code using CacheControl to flush part of a segment
Connection connection;
CacheControl cacheControl = connection.getCacheControl(null);
Cube salesCube = connection.getSchema().lookupCube("Sales", true);
SchemaReader schemaReader = salesCube.getSchemaReader(null);
// Create region containing [Customer].[USA].[OR].
Member memberOregon = schemaReader.getMemberByUniqueName(
new String[]{“Customer", “USA", “OR"}, true);
CacheControl.CellRegion regionOregon =
cacheControl.createMemberRegion(memberOregon, true);
// Create region containing [Customer].[USA].[OR].
Member memberTimeQ2 = schemaReader.getMemberByUniqueName(
new String[]{“Time", “1997", “Q2"}, true);
CacheControl.CellRegion regionTimeQ2 =
cacheControl.createMemberRegion(memberTimeQ2, true);
// Create region containing ([Customer].[USA].[OR], [Time].[1997].[Q2])
CacheControl.CellRegion regionOregonQ2 =
cacheControl.createCrossjoinRegion(regionOregon, regionTimeQ2);
cacheControl.flush(regionOregonQ2);
Case studies
Case Study: Frontier AirlinesFrontier Airlines
Key Challenges
Understanding and optimizing fares to ensure
Maximum occupancy (no empty seats)
Maximum profitability (revenue per seat)
Pentaho Solution
Pentaho Analysis (Mondrian)
Chose Open Source RDBMS and Mondrian over Oracle
500 GB of data, 6 server cluster
Results
Comprehensive, integrated analysis to set strategic pricing
Improved per-seat profitability (amount not disclosed)
Why Pentaho
Rich analytical and MDX functionality
Cost of ownership
“The competition is intense in the airline industry and Frontier is committed to staying ahead of the curve by leveraging technology that will help us offer the best prices and the best flight experience…. [the application] fits right in with our philosophy of providing world-class performance at a low price.”
Case Study: U.S.-based, Public Software Company OEM Example
Key Requirements
Embedding analysis within a pricing planning application
Incumbent Vendor
MicroStrategy
Decision Criteria
Embeddability
Extensibility
Cost of ownership
Best Practices
Evaluate replacement costs holistically
Treat migrations as an opportunity to improve a deployment, not just move it
Understand and prioritize requirements – functionality, usability, architecture, etc.
“The old solution (MicroStrategy) was built to be a standalone application. It expected to own the browser session, to have separate security, and its own user interface standards. We were able to make Mondrian a seamless part of the application, and it was much easier to do so.”
The big picture
Business Intelligence Suite
Mondrian OLAP
Analysis tools:Pivot tableChartingDashboards
ETL (extract/transform/load)
Integration with operational reporting
Integration with data mining
Actions on operational data
Design/tuning tools
Pentaho Open Source BI Offerings
All available in a Free Open Source license
A Sample of Joint MySQL-Pentaho Users
“Pentaho provided a robust, open source platform for our sales reporting application, and the ongoing support we needed. The experts at OpenBI provided outstanding services and training, and allowed us to deploy and start generating results very quickly.”
“We selected Pentaho for its ease-of-use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”
Next Steps and Resources
Contact InformationJulian Hyde, Chief Architect, [email protected]
More information http://www.pentaho.org and
http://mondrian.pentaho.org
Pentaho Community Forum http://community.pentaho.orgGo to Developer ZoneDiscussions
Pentaho BI Platform including Mondrian
http://www.pentaho.org/download/latest
Mondrian OLAP Library onlyhttp://sourceforge.net/project/showfiles.php?group_id=35302
Birds of a Feather session:MySQL Data Warehousing and BI
Speakers:
James Dixon, Senior Architect and Chief Technology (a.k.a. "Chief Geek"),
Pentaho Corporation
Roland Bouman, Certification Developer, MySQL A.B.
Matt Casters, Data Integration Architect and Kettle Project Founder,
Pentaho
Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho
Brian Miezejewski, Principal Consultant, MySQL inc.
Track: Data Warehousing and Business Intelligence
Date: TONIGHT!!
Time: 7:30pm - 9:00pm
Location: Camino Real
Thank you for attending!