Tiber Solutionstibersolutions.com/wp-content/uploads/2016/11/... · MicroStrategy Tableau Qliktech Qlikview TIBCO Spotfire Microsoft BI Stack Strengths Data Retrieval Dynamic, Complex

Tiber Solutions Understanding the Current & Future Landscape of BI and Data Storage

Jim Hadley

<docname>_<date>_<author> 2

Tiber Solutions •  Founded in 2005 to provide Business Intelligence / Data

Warehousing / Big Data thought leadership to corporations and government agencies.

•  Deeply skilled in all facets of BI/DW/Big Data solutions – star schema, ETL, BI, data visualization, data analytics, data architecture, information architecture, BI agile development methodology, and MDM/governance.

•  Provide hands-on architecture, implementation, and coaching expertise within IT organizations from the CIO to the developers.

•  Partner with business executives to co-invent optimal BI/DW applications to dramatically improve their business.


Tiber Solutions

•  Amethyst Technologies •  Amtrak

•  Census Bureau

•  Cognosante

•  Defense Logistics Agency

•  Department of Health and Human Services

•  Department of the Treasury

•  Fannie Mae

•  Federal Depository Insurance Corporation

•  Frontpoint Security

•  Freddie Mac •  Graduate Management Admission

Council

•  Internal Revenue Service

•  Military Health System

•  National Institutes of Health

•  Occupational Safety and Health Administration

•  Office of the Comptroller of the Currency

•  SAP Business Objects

•  Securities and Exchange Commission

Customers


Agenda

•  Business Intelligence Landscape - Concepts/Architectures

- BI Tool vs. Data Visualization Tool Comparison

•  Data Storage Landscape - Concepts/Architectures

- Product Group Comparison


Business Intelligence Landscape

Facts

Facts

Data Retrieval Data Presentation

Success Factors: •  Retrieval Speed •  Ease of Access

Success Factors: •  Visualization Richness and Diversity •  Delivery Options (e.g., Mobile, Push)


Business Intelligence Landscape Characteristics Business Intelligence Tools Data Visualization Tools

Product Examples SAP Web Intelligence Cognos

MicroStrategy

Tableau Qliktech Qlikview TIBCO Spotfire

Microsoft BI Stack

Strengths Data Retrieval Dynamic, Complex Ad Hoc Queries

Data Presentation Rich and Diverse Visualizations

Limitations Limited Visualizations Limited Ad Hoc Capabilities

Primary Use Ad Hoc Query Canned Reports

Data Visualization Data Exploration

Ad Hoc Query Capabilities Yes No (must be in cube)

Leverages Semantic Layer For Data Retrieval Yes Partially

Queries Data In Database Real-Time Yes No

Requires Persisting Data Set In Cubes or Files No Yes

Requires Developer Skills Semantic Layer (Universe) – Yes Reports – Some

Cubes – Yes Reports/Dashboards - No

SAP Products •  SAP Web Intelligence •  SAP Dashboards - Requires Developer •  SAP Lumira – Not nearly as mature •  SAP Explorer – Limited visualizations


Business Intelligence Tool Architecture

Facts

Facts

Assumptions: •  Data warehouse/data mart exists in which ETL

processing has harmonized and combined data from multiple data sources.

Business Layer •  Folders – Used to organize objects into logical groups (e.g.,

Customer Dim, Sales Measures) •  Objects – Business terms are used to represent database columns

(e.g., CUST_NM) or SQL formulas (e.g., SUM(REVENUE_AMT)-SUM(COST_AMT))

Technical Layer •  Connections – Database connection parameters •  Tables/Columns – Fact and Dimension tables and columns •  Joins – Predefined joins between fact tables and dimension tables •  Contexts – A group of joins. Each fact table should have a context Se

man

tic L

ayer

(Uni

vers

e)

Business Terms

SQL


Business Intelligence Tool Architecture

Facts

Facts

Objects Selected by End User •  Dims - Fiscal Year, Fiscal Quarter, Product Group •  Measures - Net Sales Amount, Forecast Amount

Related Tables and Columns •  Fiscal Year – d_date.fiscal_yr •  Fiscal Quarter – d_date.fiscal_qtr •  Product Group – d_product.product_grp •  Net Sales Amount – f_sales.net_sales_amt •  Forecast Amount – f_forecast.forecast_amt

Sales Query: SELECT d.fiscal_yr,

d.fiscal_qtr, p.product_grp, SUM(f_sales.net_sales_amt)

FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=p.product_key GROUP BY d.fiscal_yr, fiscal_qtr, p.product_grp

Forecast Query: SELECT d.fiscal_yr,

d.fiscal_qtr, p.product_grp, SUM(f_forecast.forecast_amt)

FROM d_date d, d_product p, f_sales f WHERE f.date_key=d.date_key AND f.product_key=vp.product_key GROUP BY d.fiscal_yr, d.fiscal_qtr, p.product_grp

Full Outer Join

Assumptions: •  Fact tables are at different levels

of granularity (detail). •  1-to-N fact tables can be queried

with common dimensions.

Sales Context Forecast Context


Data Visualization Tool Architecture

OLTP Nightly

SQL Load

Data Visualization Experience •  OLAP/File column names can be

renamed to business terms. •  Easy for end users to drag/drop/

visualize data using multiple visualization styles.

•  Data across cubes can be combined.

Data Retrieval Observations: •  There is an assumption that the data is available,

combinable, and clean (without any ETL or DQ). •  Data can be sourced from any database or file. •  Most products use OLAP cube technology to improve

performance. •  OLAP cubes can be “linked” (joined) together, but they

must have shared common dimensions and granularity. •  Data retrieval across OLAP cubes can be difficult. •  OLAP cubes are refreshed at night. •  Does not support dynamic ad hoc queries. •  IT is usually required to set up OLAP cubes on servers. •  OLAP cubes have practical size limits.

Data Presentation Observations: •  Data visualization products support 100s of visualization

styles. •  Tools are good at “recommending” visualizations based

on data result set. •  Tools are very interactive. •  Easy to “integrate” visualizations together. •  Business users can successfully use the client tools

without IT – really.

DW/DM Nightly

SQL Load


Federated BI Architecture

Travel Reservations

Semantic Layer (Universe)

Federated Architecture

Semantic Layer (Universe)

ETL

Real-time

Batch (Nightly)

Traditional BI/DW Federated BI

Travel Reservations

Data Warehouse

Use Case: How many passengers made refundable reservations and never traveled in 2014? Traditional

BI/EDW Federated Bi

1. Query 2014 refundable reservation rows – 25 million. Batch Real-time

2. Query 2014 travel rows – 15 million. Batch Real-time

3. Left outer join the reservation query result set with the travel query result set based on common dimension data – travel date, customer information, originating city, destination city, and flight number.

Batch Real-time

4. Aggregate the joined result set rows counting all rows where travel information is null. Real-time Real-time


Data Storage Concepts/Architectures

•  Columnar Data Storage •  Compression/Tokenization

•  Parallelization

•  In-Memory

Performance Bottleneck: Reading data off of disk.


Columnar Data Storage

1 2 3 4 5 6 7 8 9 10

Traditional RDBMS Columnar Data Storage

SELECT col1, col2, col3

FROM table

•  Data is stored row-oriented on disk. •  All columns are read off of disk – even if only a

subset of columns are selected. •  Unselected columns are pruned after disk read. •  Optimized for row inserts

1 2 3 4 5 6 7 8 9 10 SELECT col1,

col2, col3

FROM table

•  Data is stored column-oriented on disk. •  Only selected columns are read off of disk. •  Unselected columns are not read off of disk. •  Optimized for data retrieval.

Results: Less columns to read = Less disk to read = Faster data retrieval speeds

Quantitative Results: 3 times faster


Compression/Tokenization

State

Traditional RDBMS Compressed Databases

Alabama Alabama Alabama Alabama Alabama Alaska Alaska . . . Wyoming

50 bytes

10 million rows

State

1 1 1 1 1 2 2 . . . 50

10 million rows

V-List

1 = Alabama 2 = Alaska 3 = Arizona 4 = Arkansas 5 = California 6 = Colorado 7 = Connecticut . . . 50 = Wyoming

•  Data is stored on disk as it appears to the end user. •  Columns are byte-bound.

Example: 50 bytes x 10 million rows = 500MB to read from disk.

6 bits (0.75 bytes)

•  All distinct values are given a token representation. •  Tokens are stored on disk and not the actual data

values. •  Columns are not byte-bound.

Example: 26 = 64 values (50 values required) 6 bits or 0.75 bytes required 0.75 bytes x 10M rows = 7.5MB of disk read

•  Results: Narrower columns = Less disk to read = Faster data retrieval speeds •  Quantitative Results: 66 times faster •  Total Quantitative Results: 3 (columnar) x 66 (compression) = 200 times faster


Parallelization

Sales Table

Full-Table Scan

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

20 million rows

•  The entire table is read sequentially.

•  Example: 20 million rows are read sequentially in 200 seconds.

•  Results: Parallel partition reads = Faster data retrieval speeds •  Quantitative Results: 10 times faster •  Total Quantitative Results: 3 (columnar) x 66 (compression) x 10 (parallel) = 2,000 times faster •  Total quantitative results are rarely this significant and are for illustrative purposes only.

Parallelized Full-Table Scan Parallelized Partition Scan

Sales Partition - 1

Sales Partition - 2

Sales Partition - 3

Sales Partition - 4

Sales Partition - 5

Sales Partition - 6

Sales Partition - 7

Sales Partition - 8

Sales Partition - 9

Sales Partition - 10











- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

- 2 million rows

•  The table’s 10 partitions are read in parallel

•  Example: 20 million rows are read in 10 parallel processes (2 million rows each) in 20 seconds.

•  One partition is read (Where Year = 2012)

•  Example: 2 million rows are read by one process (2 million rows) in 20 seconds.


In-Memory •  In-memory processing is the trump card. •  However, in-memory processing is not cheap. •  Using column-oriented data storage and compression/tokenization

techniques can significantly allow more data to fit into memory. •  Don’t assume in-memory is the only solution.

Example:

•  Perceived Problem: “My Honda is too slow” •  Actual Problem: Driver only drives the car in first gear.

•  Solution 1: Buy a Ferrari and drive it in first gear.

•  Solution 2: Keep your Honda and learn how to use a clutch.


Data Storage Product Group Comparison

Characteristics

Traditional RDBMS Columnar In-Memory Hadoop Ecosystem

Columnar Data Storage No Yes Sometimes No

Compression/Tokenization No Yes Sometimes No

Parallelization Yes Yes Yes Yes

In-Memory No No Yes No

Product Examples Oracle IBM DB2

SQL Server

Amazon Redshift Vertica HBase

EMC GreenPlum IBM DB2 BLU

SAP HANA MemSQL

HDFS/MapReduce HCatalog

Cassandra


Data Storage – Final Thoughts

•  Columnar data storage, compression, parallelization, and in-memory processing ONLY address data retrieval performance.

•  These techniques DO NOT address: - Harmonization of data sources (e.g., VA = Virginia =

VIRGINIA, missing DC and Guam)

- Data quality issues

- Complexity of different data sets (e.g., many-to-many relationships, ratios, timing of data capture, etc.)

- End users ability to intuitively and easily access, present, and understand information.


Questions

Jim Hadley, President Email: [email protected] Phone: 703.593.2833

Documents

Tiber Solutionstibersolutions.com/wp-content/uploads/2016/11/... · MicroStrategy Tableau Qliktech Qlikview TIBCO Spotfire Microsoft BI Stack Strengths Data Retrieval Dynamic, Complex