Upload
remy-rosenbaum
View
447
Download
2
Embed Size (px)
Citation preview
Accelerating BI on Big Data
Topics
• BI on Big Data Trade-Off
• SQL-on-Hadoop Performance Challenges
• Live Demo: Tableau on HadoopImpala / Redshift / Jethro
• Jethro Technology Overview
• What Does Jethro Do?– Acceleration server for BI on Big Data
• How It Works?– Full Indexing and cube caching– Combines Columnar SQL DB design
with search-indexing technology• When to Use It?
– Reporting, dashboards, discovery, ad-hoc
• How to Get It– Download & free evaluation
• Partnerships– BI & Hadoop vendors
About Us
SQL
Data
• Typical usage based on extracting selective data from remote data sources
• Extracted data then dynamically loaded into memory for interactive analysis
• Challenges: – Size: performance degradation typically
~250M rows– Refresh lag time
BI & Big Data: Extract (In-Memory)
Tableau & Big Data
Data
Extract
• For every user interaction Tableau issues SQL queries to the target DB
• DB retrieves requested data, processes SQL aggregations and returns to Tableau
• Challenges: – DB performance is significantly slower
than in-mem speed
BI & Big Data: Live-Connect (In-DB)
Tableau & Big Data
Queries
Live Access
SQL enables the change of data platform while keeping the analytic apps intact
Analytics: ETL, Predictive, Reporting, BI
SQL
10x-100x Data1/10 HW $costOpen Platform
Big Data Platforms: Hadoop vs. EDW Appliances
SQL-on-Hadoop Performance Challenges
SQL
SQL-on-Hadoop
ETL Predictive Reporting
BI
Too SLOW on Hadoop
x
It’s unrealistic to expect to the same performance when data is much larger and highly optimized hardware is replaced with commodity boxes
The Hadoop Trade-Off: Scale & Cost vs. Performance
SQL-on-Hadoop Performance Challenges
More Hardware– Add nodes, RAM, CPU, SSD, network
Different SQL-on-Hadoop engines
– Hive, Impala, Drill, SparkSQL, HAWQ, Presto, Actian, etc.
Rigid Data Model– Less granularity, more pre-
aggregations– Pre-defined OLAP Cubes– De-normalize into single large table– Multiple partition keys (replication)
Replicate from Hadoop to EDW– Traditional: Teradata, Vertica,
Netezza, …– Cloud: Redshift – As-a-Svc: BigQuery, Snowflake,
Qubole
No Hadoop, No EDW– Search: Elastic + Kibana– NoSQL: Hbase, Cassandra, MongoDB
BI & Data Combined– Full-stack Hadoop: Platfora, Arcadia– As-a-Svc: DOMO, QuikSight, PowerBI,
…
BI on Big Data: Technology Alternatives
A Library Analogy:Billions of books, Thousands of racks
Query: List books by author “Stephen King”
Process: Every librarian pulls out book by book from their rack and check for Author
• Hive• Impala• Presto
• SparkSQL• Drill• Pivotal/HAWQ
• IBM/Big SQL• Actian• …
SQL-on-Hadoop: MPP/Full-Scan Architecture
SQL-on-Hadoop Performance Challenges
Unsuitable for BI
Query: List books by author “Stephen King”
Process: Access Author index, entry of “Stephen King”, get list of books, fetch only these books
Result: Fast, minimal resources, scalable
SQL-on-Hadoop: Index-Access Architecture
SQL-on-Hadoop Performance Challenges
Optimal for BI
Hardware Data Format Hadoop Cluster
Compute Cluster
Total RAM, CPU
AWS $ per hr.
Jethro Jethro indexes 3x m1.xlarge 2x r3.4xlarge (spot)
290GB, 44 cores
$0.75
Impala Parquet 6x r3.2xlarge1x r3.xlarge
390GB, 52 cores
$4.25
Redshift Redshift 6x dc1.large 90GB, 12 cores
$1.50
• Point browser to: tableau.jethrodata.com– Login: demo / demo
• Choose workbook: Jethro, Impala, Redshift
• Dashboard interaction: choose year, category or any other filters to drill-down
• Data– Based on TPC-DS benchmark– 1TB raw data (400GB fact)– Fact table: ~2.9B rows– 7 Dimensions
LIVE Benchmark: Tableau on Hadoop (and Redshift)
Live Benchmark
Indexing Data for Jethro Acceleration
• Identify BI-worthy datasets– Not all data in Hadoop should have Jethro
• Jethro “loader” creates an indexed version– Stores back in same HDFS
• If no Hadoop is used it can also be stored in local filesystem, network storage or cloud storage (e.g. S3)
– Highly efficient: ~1B rows/hour, 3x compression
• Incremental refresh– As frequently as every min, hour, day, …– Does not require a full-rebuild of index
Raw Indexed
Data Node
Data Node
Data Node
Data Node
Data Node
Jethro Query Node
Jethro QN
1. Index Access 2. Read data only for required rows
Performance and resources based on the size of the working-set
Storage- HDFS- Cloud (S3, EFS)- NAS/SAN- Local FS
SELECT date, SUM(sales) FROM T1 WHERE product=‘Books’ AND state=‘NY’ GROUP BY date
Index-Access: How it Works
Jethro Indexes – Superior Technology
http://www.google.com/patents/WO2013001535A3?cl=en Patent Pending:
Complete– Every column is indexed
Simple– Inverted-list indexes map each
column value to a list of rowsFast to read
– Index-of-index provides direct access to a value entry
– No need to scan entire index, or load index to memory
Scalable – Distributed, highly hierarchical
compressed bitmaps
Fast to write– Appendable index structure for
fast incremental refresh
Automated Cube & Query Cashing• Every query is cached
– Based on result-set size vs. execution time
• Cubes generated automatically – Identify repeat query patterns– For example: adding the filter as a col to
a GROUP BY• All stored in HDFS
– 10,000’s of cashed cubes and queries• Incremental refresh
– Query executes ONLY on the incremental data and then merges with cached results
What Is Jethro for Tableau?An indexing & cashing server
1. Tableau uses live connect (ODBC) to send SQL queries
2. Jethro checks if query can be served from existing cubes– Yes: reply to Tableau
3. Jethro uses indexed table to access only necessary data– Auto create a cube based on
this and similar queries
Live Connect
HDFS
BI Tools
Why Jethro is the Right Technology for BI on Big Data?Limitless BI on Big Data: Supporting the full-range of BI use-cases. Jethro’s technology is a unique and optimal fit.1. Full indexing enables interactive discovery and fast drill
down– Eliminates need to repeatedly read unnecessary data. The deeper you
go the faster it gets!
2. Auto cubes & cache enables interactive dashboards and fast reports– Optimize repeat query performance
3. Incremental-refresh enables LIVE BI over streaming data – Reduces maintenance and cuts lag time
Ready to Try Jethro?
1. Register: jethro.io– Download and Install on-prem or cloud
2. Schedule a 30min POC review with Jethro SA (free!)
3. Index BI-worthy datasets4. Use Tableau5. Train Jethro with BI apps– Continuous performance improvement
That’s It!
Accelerating BI on Big Data