Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Tiber Solutions Understanding the Current & Future Landscape of BI and Data Storage
Jim Hadley
<docname>_<date>_<author> 2
Tiber Solutions • Founded in 2005 to provide Business Intelligence / Data
Warehousing / Big Data thought leadership to corporations and government agencies.
• Deeply skilled in all facets of BI/DW/Big Data solutions – star schema, ETL, BI, data visualization, data analytics, data architecture, information architecture, BI agile development methodology, and MDM/governance.
• Provide hands-on architecture, implementation, and coaching expertise within IT organizations from the CIO to the developers.
• Partner with business executives to co-invent optimal BI/DW applications to dramatically improve their business.
<docname>_<date>_<author> 3
Tiber Solutions
• Amethyst Technologies • Amtrak
• Census Bureau
• Cognosante
• Defense Logistics Agency
• Department of Health and Human Services
• Department of the Treasury
• Fannie Mae
• Federal Depository Insurance Corporation
• Frontpoint Security
• Freddie Mac • Graduate Management Admission
Council
• Internal Revenue Service
• Military Health System
• National Institutes of Health
• Occupational Safety and Health Administration
• Office of the Comptroller of the Currency
• SAP Business Objects
• Securities and Exchange Commission
Customers
<docname>_<date>_<author> 4
Agenda
• Business Intelligence Landscape - Concepts/Architectures
- BI Tool vs. Data Visualization Tool Comparison
• Data Storage Landscape - Concepts/Architectures
- Product Group Comparison
<docname>_<date>_<author> 5
Business Intelligence Landscape
Facts
Facts
Data Retrieval Data Presentation
Success Factors: • Retrieval Speed • Ease of Access
Success Factors: • Visualization Richness and Diversity • Delivery Options (e.g., Mobile, Push)
<docname>_<date>_<author> 6
Business Intelligence Landscape Characteristics Business Intelligence Tools Data Visualization Tools
Product Examples SAP Web Intelligence Cognos
MicroStrategy
Tableau Qliktech Qlikview TIBCO Spotfire
Microsoft BI Stack
Strengths Data Retrieval Dynamic, Complex Ad Hoc Queries
Data Presentation Rich and Diverse Visualizations
Limitations Limited Visualizations Limited Ad Hoc Capabilities
Primary Use Ad Hoc Query Canned Reports
Data Visualization Data Exploration
Ad Hoc Query Capabilities Yes No (must be in cube)
Leverages Semantic Layer For Data Retrieval Yes Partially
Queries Data In Database Real-Time Yes No
Requires Persisting Data Set In Cubes or Files No Yes
Requires Developer Skills Semantic Layer (Universe) – Yes Reports – Some
Cubes – Yes Reports/Dashboards - No
SAP Products • SAP Web Intelligence • SAP Dashboards - Requires Developer • SAP Lumira – Not nearly as mature • SAP Explorer – Limited visualizations
<docname>_<date>_<author> 7
Business Intelligence Tool Architecture
Facts
Facts
Assumptions: • Data warehouse/data mart exists in which ETL
processing has harmonized and combined data from multiple data sources.
Business Layer • Folders – Used to organize objects into logical groups (e.g.,
Customer Dim, Sales Measures) • Objects – Business terms are used to represent database columns
(e.g., CUST_NM) or SQL formulas (e.g., SUM(REVENUE_AMT)-SUM(COST_AMT))
Technical Layer • Connections – Database connection parameters • Tables/Columns – Fact and Dimension tables and columns • Joins – Predefined joins between fact tables and dimension tables • Contexts – A group of joins. Each fact table should have a context Se
man
tic L
ayer
(Uni
vers
e)
Business Terms
SQL
<docname>_<date>_<author> 8
Business Intelligence Tool Architecture
Facts
Facts
Objects Selected by End User • Dims - Fiscal Year, Fiscal Quarter, Product Group • Measures - Net Sales Amount, Forecast Amount
Related Tables and Columns • Fiscal Year – d_date.fiscal_yr • Fiscal Quarter – d_date.fiscal_qtr • Product Group – d_product.product_grp • Net Sales Amount – f_sales.net_sales_amt • Forecast Amount – f_forecast.forecast_amt
Sales Query: SELECT d.fiscal_yr,
d.fiscal_qtr, p.product_grp, SUM(f_sales.net_sales_amt)
FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=p.product_key GROUP BY d.fiscal_yr, fiscal_qtr, p.product_grp
Forecast Query: SELECT d.fiscal_yr,
d.fiscal_qtr, p.product_grp, SUM(f_forecast.forecast_amt)
FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=vp.product_key GROUP BY d.fiscal_yr, d.fiscal_qtr, p.product_grp
Full Outer Join
Assumptions: • Fact tables are at different levels
of granularity (detail). • 1-to-N fact tables can be queried
with common dimensions.
Sales Context Forecast Context
<docname>_<date>_<author> 9
Data Visualization Tool Architecture
OLTP Nightly
SQL Load
Data Visualization Experience • OLAP/File column names can be
renamed to business terms. • Easy for end users to drag/drop/
visualize data using multiple visualization styles.
• Data across cubes can be combined.
Data Retrieval Observations: • There is an assumption that the data is available,
combinable, and clean (without any ETL or DQ). • Data can be sourced from any database or file. • Most products use OLAP cube technology to improve
performance. • OLAP cubes can be “linked” (joined) together, but they
must have shared common dimensions and granularity. • Data retrieval across OLAP cubes can be difficult. • OLAP cubes are refreshed at night. • Does not support dynamic ad hoc queries. • IT is usually required to set up OLAP cubes on servers. • OLAP cubes have practical size limits.
Data Presentation Observations: • Data visualization products support 100s of visualization
styles. • Tools are good at “recommending” visualizations based
on data result set. • Tools are very interactive. • Easy to “integrate” visualizations together. • Business users can successfully use the client tools
without IT – really.
DW/DM Nightly
SQL Load
<docname>_<date>_<author> 10
Federated BI Architecture
Travel Reservations
Semantic Layer (Universe)
Federated Architecture
Semantic Layer (Universe)
ETL
Real-time
Batch (Nightly)
Traditional BI/DW Federated BI
Travel Reservations
Data Warehouse
Use Case: How many passengers made refundable reservations and never traveled in 2014? Traditional
BI/EDW Federated Bi
1. Query 2014 refundable reservation rows – 25 million. Batch Real-time
2. Query 2014 travel rows – 15 million. Batch Real-time
3. Left outer join the reservation query result set with the travel query result set based on common dimension data – travel date, customer information, originating city, destination city, and flight number.
Batch Real-time
4. Aggregate the joined result set rows counting all rows where travel information is null. Real-time Real-time
<docname>_<date>_<author> 11
Data Storage Concepts/Architectures
• Columnar Data Storage • Compression/Tokenization
• Parallelization
• In-Memory
Performance Bottleneck: Reading data off of disk.
<docname>_<date>_<author> 12
Columnar Data Storage
1 2 3 4 5 6 7 8 9 10
Traditional RDBMS Columnar Data Storage
SELECT col1, col2, col3
FROM table
• Data is stored row-oriented on disk. • All columns are read off of disk – even if only a
subset of columns are selected. • Unselected columns are pruned after disk read. • Optimized for row inserts
1 2 3 4 5 6 7 8 9 10 SELECT col1,
col2, col3
FROM table
• Data is stored column-oriented on disk. • Only selected columns are read off of disk. • Unselected columns are not read off of disk. • Optimized for data retrieval.
Results: Less columns to read = Less disk to read = Faster data retrieval speeds
Quantitative Results: 3 times faster
<docname>_<date>_<author> 13
Compression/Tokenization
State
Traditional RDBMS Compressed Databases
Alabama Alabama Alabama Alabama Alabama Alaska Alaska . . . Wyoming
50 bytes
10 million rows
State
1 1 1 1 1 2 2 . . . 50
10 million rows
V-List
1 = Alabama 2 = Alaska 3 = Arizona 4 = Arkansas 5 = California 6 = Colorado 7 = Connecticut . . . 50 = Wyoming
• Data is stored on disk as it appears to the end user. • Columns are byte-bound.
Example: 50 bytes x 10 million rows = 500MB to read from disk.
6 bits (0.75 bytes)
• All distinct values are given a token representation. • Tokens are stored on disk and not the actual data
values. • Columns are not byte-bound.
Example: 26 = 64 values (50 values required) 6 bits or 0.75 bytes required 0.75 bytes x 10M rows = 7.5MB of disk read
• Results: Narrower columns = Less disk to read = Faster data retrieval speeds • Quantitative Results: 66 times faster • Total Quantitative Results: 3 (columnar) x 66 (compression) = 200 times faster
<docname>_<date>_<author> 14
Parallelization
Sales Table
Full-Table Scan
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
20 million rows
• The entire table is read sequentially.
• Example: 20 million rows are read sequentially in 200 seconds.
• Results: Parallel partition reads = Faster data retrieval speeds • Quantitative Results: 10 times faster • Total Quantitative Results: 3 (columnar) x 66 (compression) x 10 (parallel) = 2,000 times faster • Total quantitative results are rarely this significant and are for illustrative purposes only.
Parallelized Full-Table Scan Parallelized Partition Scan
Sales Partition - 1
Sales Partition - 2
Sales Partition - 3
Sales Partition - 4
Sales Partition - 5
Sales Partition - 6
Sales Partition - 7
Sales Partition - 8
Sales Partition - 9
Sales Partition - 10
Sales Partition - 2005
Sales Partition - 2006
Sales Partition - 2007
Sales Partition - 2008
Sales Partition - 2009
Sales Partition - 2010
Sales Partition - 2011
Sales Partition - 2012
Sales Partition - 2013
Sales Partition - 2014
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
- 2 million rows
• The table’s 10 partitions are read in parallel
• Example: 20 million rows are read in 10 parallel processes (2 million rows each) in 20 seconds.
• One partition is read (Where Year = 2012)
• Example: 2 million rows are read by one process (2 million rows) in 20 seconds.
<docname>_<date>_<author> 15
In-Memory • In-memory processing is the trump card. • However, in-memory processing is not cheap. • Using column-oriented data storage and compression/tokenization
techniques can significantly allow more data to fit into memory. • Don’t assume in-memory is the only solution.
Example:
• Perceived Problem: “My Honda is too slow” • Actual Problem: Driver only drives the car in first gear.
• Solution 1: Buy a Ferrari and drive it in first gear.
• Solution 2: Keep your Honda and learn how to use a clutch.
<docname>_<date>_<author> 16
Data Storage Product Group Comparison
Characteristics
Traditional RDBMS Columnar In-Memory Hadoop Ecosystem
Columnar Data Storage No Yes Sometimes No
Compression/Tokenization No Yes Sometimes No
Parallelization Yes Yes Yes Yes
In-Memory No No Yes No
Product Examples Oracle IBM DB2
SQL Server
Amazon Redshift Vertica HBase
EMC GreenPlum IBM DB2 BLU
SAP HANA MemSQL
HDFS/MapReduce HCatalog
Cassandra
<docname>_<date>_<author> 17
Data Storage – Final Thoughts
• Columnar data storage, compression, parallelization, and in-memory processing ONLY address data retrieval performance.
• These techniques DO NOT address: - Harmonization of data sources (e.g., VA = Virginia =
VIRGINIA, missing DC and Guam)
- Data quality issues
- Complexity of different data sets (e.g., many-to-many relationships, ratios, timing of data capture, etc.)
- End users ability to intuitively and easily access, present, and understand information.
<docname>_<date>_<author> 18
Questions
Jim Hadley, President Email: [email protected] Phone: 703.593.2833