Mark Rittman, Oracle ACE Director
THE FUTURE OF ANALYTICS, DATA INTEGRATION
AND BI ON BIG DATA PLATFORMS
HADOOP USER GROUP IRELAND (HUG IRL)
Dublin, September 2016
•Mark Rittman, Co-Founder of Rittman Mead
•Oracle ACE Director, specialising in Oracle BI&DW
•14 Years Experience with Oracle Technology
•Regular columnist for Oracle Magazine
•Author of two Oracle Press Oracle BI books
•Oracle Business Intelligence Developers Guide
•Oracle Exalytics Revealed
•Writer for Rittman Mead Blog : http://www.rittmanmead.com/blog
•Email : [email protected]
•Twitter : @markrittman
About the Speaker
2
OR AS I SAY AT PARTIES…
3
4
BUT SERIOUSLY…
5
•Started back in 1996 on a bank Oracle DW project
•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts
•Went on to use Oracle Developer/2000 and Designer/2000
•Our initial users queried the DW using SQL*Plus
•And later on, we rolled-out Discoverer/2000 to everyone else
•And life was fun…
20 Years in Old-school BI & Data Warehousing
6
•Data warehouses provided a unified view of the business
•Single place to store key data and metrics
•Joined-up view of the business
•Aggregates and conformed dimensions
•ETL routines to load, cleanse and conform data
•BI tools for simple, guided access to information
•Tabular data access using SQL-generating tools
•Drill paths, hierarchies, facts, attributes
•Fast access to pre-computed aggregates
•Packaged BI for fast-start ERP analytics
Data Warehouses and Enterprise BI Tools
7
Oracle
MongoDB
Oracle
Sybase
IBMDB/2
MSSQL
MSSQLServer
CoreERPPlatform
Retail
Banking
CallCenter
E-Commerce
CRM
Business
IntelligenceTools
DataWarehouse
Access&Performance
Layer
ODS/Foundation
Layer
7
•Examples were Crystal Reports, Oracle Reports, Cognos Impromptu, Business Objects
•Report written against carefully-curated BI dataset, or directly connecting to ERP/CRM
•Adding data from external sources, or other RDBMSs, was difficult and involved IT resources
•Report-writing was a skilled job
•High ongoing cost for maintenance and changes
•Little scope for analysis, predictive modeling
•Often user frustration and pace of delivery
Reporting Back Then…
8 8
•For example Oracle OBIEE, SAP Business Objects, IBM Cognos
•Full-featured, IT-orientated enterprise BI platforms
•Metadata layers, integrated security, web delivery
•Pre-build ERP metadata layers, dashboards + reports
•Federated queries across multiple sources
•Single version of the truth across the enterprise
•Mobile, web dashboards, alerts, published reports
•Integration with SOA and web services
Then Came Enterprise BI Tools
10 10
THEN CAME … BIG DATA
11
AND HADOOP
13
BIG, FAST AND FAULT-TOLERANT
14
•Data from new-world applications is not like historic data
•Typically comes in non-tabular form
•JSON, log files, key/value pairs
•Users often want it speculatively
•Haven’t thought it through
•Schema can evolve
•Or maybe there isn’t one
•But the end-users want it now
•Not when you’re ready
But Why Hadoop? Reason #1 - Flexible Storage
16
BigDataManagementPlatform
Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopmentenvironment
Datasetsandsamples
Models andprograms
SingleCustomerView EnrichedCustomerProfile
Correlating
Modeling
MachineLearning
Scoring
Schema-onReadAnalysis
•Enterprise High-End RDBMSs such as Oracle can scale
•Clustering for single-instance DBs can scale to >PB
•Exadata scales further by offloading queries to storage
•Sharded databases (e.g. Netezza) can scale further
•But cost (and complexity) become limiting factors
•Typically $1m/node is not uncommon
But Why Hadoop? Reason #2 - Massive Scalability
17
•Hadoop started by being synonymous with MapReduce, and Java coding
•But YARN (Yet another Resource Negotiator) broke this dependency
•Modern Hadoop platforms provide overall cluster resource management, but support multiple processing frameworks
•General-purpose (e.g. MapReduce)
•Graph processing
•Machine Learning
•Real-Time Processing (Spark Streaming, Storm)
•Even the Hadoop resource management framework can be swapped out
•Apache Mesos
Reason #3 - Processing Frameworks
18
BigDataPlatform-AllRunningNativelyUnderHadoop
YARN(ClusterResourceManagement)
Batch(MapReduce)
HDFS(ClusterFilesystemholdingrawdata)
Interactive (Impala,Drill, Tez,Presto)
Streaming+ In-Memory
(Spark,Storm)
Graph+Search(Solr,Giraph)
EnrichedCustomerProfile
Modeling
Scoring
•Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage
•Flexible data storage platform with cheap storage, flexible schema support + compute
•Data lands in the data lake or reservoir in raw form, then minimally processed
•Data then accessed directly by “data scientists”, or processed further into DW
Meet the New Data Warehouse : The “Data Lake”
19
DataTransfer DataAccess
DataFactory DataReservoir
BusinessIntelligenceTools
HadoopPlatform
FileBasedIntegration
StreamBased
Integration
Datastreams
Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment
environment
Datasetsandsamples
Models andprograms
Marketing/SalesApplications
Models
MachineLearning
Segments
OperationalData
Transactions
CustomerMasterata
UnstructuredData
Voice+ChatTranscripts
ETLBasedIntegration
RawCustomerData
Datastoredintheoriginal
format(usuallyfiles)suchasSS7,ASN.1,JSONetc.
MappedCustomerData
Datasetsproducedbymappingandtransformingrawdata
NEW STARTUPS ENABLING A HYBRID “OLD WORLD/NEW WORLD” APPROACH
20
AND PERFECT FOR ANALYTICS
22
•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering
•Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs
•Hadoop was designed from outside for massive horizontal scalability - using cheap hardware
•Anticipates hardware failure and makes multiple copies of data as protection
•More nodes you add, more stable it becomes
•And at a fraction of the cost of traditional RDBMS platforms
Hadoop : The Default Platform Today for Analytics
23
BI INNOVATION IS HAPPENING AROUND HADOOP
24
“WE’RE WINNING!”
27
BUT…
29
isn’t Hadoop Slow?
too slowfor ad-hoc querying?
WELCOME TO 2016
32
(HADOOP 2.0)
35
HADOOP IS NOW FAST
37
Hadoop 2.0 Processing Frameworks + Tools
38
•Cloudera’s answer to Hive query response time issues
•MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access
•Mostly in-memory, but spills to disk if required
•Uses Hive metastore to access Hive table metadata
•Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
39
•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations
•Apache AVRO often used for general-purpose processing
•Splitability, schema evolution, in-built metadata, support for block compression
•Parquet now commonly used with Impala due to column-orientated storage
•Mirrors work in RDBMS world around column-store
•Only return (project) the columns you require across a wide table
Parquet - Column-Orientated Storage for Analytics
40
•But Parquet (and HDFS) have significant limitation for real-time analytics applications
•Append-only orientation, focus on column-store makes streaming ingestion harder
•Cloudera Kudu aims to combine best of HDFS + HBase
•Real-time analytics-optimised
•Supports updates to data
•Fast ingestion of data
•Accessed using SQL-style tables and get/put/update/delete API
Cloudera Kudu - Best of HBase and Column-Store
41
•Kudu storage used with Impala - create tables using Kudu storage handler
•Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA
Example Impala DDL + DML Commands with Kudu
42
CREATE TABLE `my_first_table` (`id` BIGINT,`name` STRING)TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id');
INSERT INTO my_first_table VALUES (99, "sarah");INSERT IGNORE INTO my_first_table VALUES (99, "sarah");
UPDATE my_first_table SET name="bob" where id = 3;
DELETE FROM my_first_table WHERE id < 3;
DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
AND IT’S NOW IN-MEMORY
43
Accompanied by Innovations in Underlying Platform
45
Cluster Resource Management tosupport mulJ-tenant distributed services
In-Memory Distributed Storage,to accompany In-Memory Distributed Processing
DATAFLOW PIPELINES ARE THE NEW ETL
46
New ways to do BI
New ways to do BI
HADOOP IS THE NEW ETL ENGINE
49
50Copyright © 2015, Oracle and/or its affiliates. All rights reserved. |
Proprietary ETL engines die circa
2015 – folded into big data
Oracle Open World 2015 21
Proprietary ETL is Dead. Apache-based ETL is What’s Next
ScriptedSQL
StoredProcs
ODI forColumnar
ODI forIn-Mem
ODI forExadata
ODI forHive
ODI forPig & Oozie
1990’s
Eon of Scripts and PL-SQL Era of SQL E-LT/Pushdown Big Data ETL in Batch Streaming ETL
Period of Proprietary Batch ETL Engines
Informatica
Ascential/IBM
Ab InitioActa/SAPSyncSort
1994
Oracle Data Integrator
ODI forSpark
ODI forSpark Streaming
WarehouseBuilder
MACHINE LEARNING & SEARCH FOR “AUTOMAGIC” SCHEMA DISCOVERY
51
New ways to do BI
•By definition there's lots of data in a big data system ... so how do you find the data you want?
•Google's own internal solution - GOODS ("Google Dataset Search")
•Uses crawler to discover new datasets
•ML classification routines to infer domain
•Data provenance and lineage
•Indexes and catalogs 26bn datasets
•Other users, vendors also have solutions
•Oracle Big Data Discovery
•Datameer
•Platfora
•Cloudera Navigator
Google GOODS - Catalog + Search At Google-Scale
53
A NEW TAKE ON BI
54
•Came out if the data science movement, as a way to "show workings"
•A set of reproducible steps that tell a story about the data
•as well as being a better command-line environment for data analysis
•One example is Jupyter, evolution of iPython notebook
•supports pySpark, Pandas etc
•See also Apache Zepplin
Web-Based Data Analysis Notebooks
55
AND EMERGING OPEN-SOURCE BI TOOLS AND PLATFORMS
57
And Emerging Open-Source BI Tools and Platforms
http://larrr.com/wp-content/uploads/2016/05/paper.pdf
And Emerging Open-Source BI Tools and Platforms
WELCOME TO THE FUTURE
62
Mark Rittman, Oracle ACE Director
THE FUTURE OF ANALYTICS, DATA INTEGRATION
AND BI ON BIG DATA PLATFORMS
HADOOP USER GROUP IRELAND (HUG IRL)
Dublin, September 2016