Upload
mukund-babbar
View
276
Download
1
Embed Size (px)
Citation preview
1 1 Pivotal Confidential–Internal Use Only
SQL & Machine Learning on Hadoop
Mukund Babbar Pivotal Feb, 2015
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Journey to Apache
Michael Stonebraker develops Postgres at UCB
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum forks PostgreSQL
Hadoop 1.0 Released
HAWQ & MADlib go Apache
HAWQ launched
Hadoop 2.0 Released
MADlib launched
Greenplum open sourced
3 3 Pivotal Confidential–Internal Use Only
Apache HAWQ Overview
4
HAWQ – SQL on Hadoop
5
Shared-Nothing Database Architecture
Standby Master
Segment Host with one or more Segment Instances Segment Instances process queries in parallel
High speed interconnect for continuous pipelining of data processing …
Master Host
SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts
Interconnect
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
Segment Hosts have their own CPU, disk and memory (shared nothing)
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node1
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node2
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node3
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
nodeN
6
Key Features of
HAWQ
5
7
5 • Up to 30x SQL-‐on-‐Hadoop performance advantage
• Faster ;me to insight • Massive MPP scalability to petabytes
Benefits: Near real-‐;me latency, complex queries and advanced analy;cs at scale
1. Advanced Analy9cs Performance
Key Features of
HAWQ
8
HAWQ Performance vs Impala
HAWQ Faster
Impala Faster
2 28 46 66 73 76 79 80 88 90 96
HAWQ • Faster on 46 of 62
TPC-DS queries completed*
• 4.55x mean avg. • 12 hrs faster total
* Impala supported 74 of 99 queries, 12 crashed mid-run
9
HAWQ vs Apache Hive w/Tez
HAWQ Faster
Hive Faster
3 7 15 25 27 34 46 48 76 79 89 90 96
HAWQ • Faster on 45 of 60
TPC-DS queries completed*
• 3.44x mean avg. • 9 hrs faster total
* Hive supported 65 of 99 queries, 5 crashed mid-run
10
5 • ANSI SQL-‐92, -‐99, -‐2003 • All 99 TPC-‐DS queries tested, no modifica;ons
• Plus, OLAP extensions • Complete ACID integrity and reliability Benefits: 100% SQL compliant No risk to SQL applica;ons All na;ve on HDP via HAWQ
2. 100% ANSI SQL Compliant
Key Features of
HAWQ
11
5 • Advanced machine learning for big data • Local, in-‐database opera;on • Excep;onal MPP/parallel performance • Open source, Postgres-‐based Benefits: Advanced, highly scalable, machine learning, directly on data in Hadoop
3. Integrated Machine Learning
Key Features of
HAWQ
12
5 • HDP, PHD, other ODPi-‐derived distros • Easily managed via Ambari • On premises, in cloud, or PaaS • HBase, Avro, Parquet and more • Connectors to make HAWQ data available to other SQL query tools Benefits: Flexibility Accessibility Portability
4. Flexible Deployment
Key Features of
HAWQ
13
5 • Cost-‐based query op;miza;on • Robust query plan op;miza;on • Complex big data management
Benefits: Op;mize performance and costs Maximize Hadoop cluster resources Offload EDW w/o compromise
5. Query Op9miza9on Op9ons
Key Features of
HAWQ
14
Advanced MPP: Polymorphic Storage™
� Columnar storage is well suited to scanning a large percentage of the data
� Row storage excels at small lookups � Most systems need to do both � Row and column orienta;on can be
mixed within a table or database
� Both types can be drama;cally more efficient with compression
� Compression is definable column by column: � Blockwise: Gzip1-‐9 & QuickLZ � Streamwise: Run Length Encoding (RLE) (levels 1-‐4)
� Flexible indexing, par;;oning enable more granular control and enable true ILM
TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov
Row-‐oriented for Small Scans Column-‐oriented for Full Scans
15
PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}
• Allows users to write HAWQ functions in R, Perl, Java, Perl, pgsql or C languages
• The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster
• Data Parallelism: – PL/X piggybacks on
HAWQ’s MPP architecture
16
Apache HAWQ
● Discover New Rela9onships ● Enable Data Science ● Analyze External Sources ● Query All Data Types!
Mul9-‐level Fault Tolerance
Granular Authoriza9on
Resource Mgmt (+ YARN)
high mul(-‐tenancy
ANSI SQL Standard
OLAP Extensions
JDBC ODBC Connec9vity
Parallel Processing
Online Expansion
HDFS
Petabyte Scale
Cost Based Op9mizer
Dynamic Pipelining
ACID + Transac9onal
Mul9-‐Language UDF Support
Built-‐in Data Science Library
Extensible (PXF)
Query External Sources
Hardened, 10+ Years Investment, Produc9on Proven
Accessibility + Usability
HDFS Na9ve File Formats
● Manage Mul9ple Workloads ● Petabyte Scale Analy9cs ● Security controls
● Leverage Exis9ng SQL Skills & BI Tools
● Easily Integrate with Other Tools
● Sub-‐second Performance
Compression + Par99oning
core
compliance
● Hadoop-‐Na9ve ● Supports Pivotal HD
and Hortonworks Data Pladorm
● Ambari-‐Integrated
17
Apache HAWQ 2.0 (new features..) Areas of Enhancement New Features
Elas;c & Scalable Architecture
Hadoop-‐Na;ve Integra;ons
Simplified External Data Access/Queries
Performance & Op;miza;ons
On-‐Demand Virtual Segments
Flexible Query Dispatch on subset nodes
3 Tier RM: YARN level>User>Query-‐Operator
Dynamic Cluster Expansion (no redistribute)
New Fault Tolerance Service
HCatalog integra;on -‐ Read Access
HDFS Catalog Cache
Per Table Directory storage (user friendly)
Single physical segment per node
Easier Administra;on/Usage
Cloud-‐Ready Simpler Management Commands
18
HAWQ Segments
HAWQ Masters
Yarn
Physical Segment
Client
Parser/ Analyzer
Op;mizer
Dispatcher
DataNode
NodeManager
NameNode NameNode
External Data Stores via Xtension Framework (Hive/HBase/etc)
Resource Manager
Fault Tolerance Service
Catalog Service
Virtual Segment
Virtual Segment
Physical Segment
DataNode
NodeManager
Virtual Segment
Virtual Segment
Physical Segment
DataNode
NodeManager
Virtual Segment
Virtual Segment
Resource Broker
libYARN
HDFS Catalog Cache
Interconnect Interconnect
Apache HAWQ 2.0 Architecture
19 19 Pivotal Confidential–Internal Use Only
Apache MADlib Overview
20
Scalable, In-Database Machine Learning
• Open Source https://github.com/apache/incubator-madlib • Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL • Downloads and Docs: http://madlib.incubator.apache.org/
Apache (incubating)
21
Functions Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers • Linear Algebra
Matrix Factorization • Singular Value Decomposition (SVD) • Low Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards Regression • Elastic Net Regularization • Robust Variance (Huber-White), Clustered
Variance, Marginal Effects
Other Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Apriori) • Topic Modeling (Parallel LDA) • Decision Trees • Random Forest • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation • Naïve Bayes • Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators • CountMin (Cormode-Muth.) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient
Inferential Statistics
Hypothesis Tests
Time Series • ARIMA
Oct 2014
22
MADlib Advantages
� Better parallelism – Algorithms designed to leverage MPP and
Hadoop architecture
� Better scalability – Algorithms scale as your data set scales
� Better predictive accuracy – Can use all data, not a sample
� ASF open source (incubating) – Available for customization and optimization
23
Calling MADlib Functions: Fast Training & Scoring • MADlib allows users to easily create
models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
• All the data can be used in one model • Built-in functionality to create multiple
smaller models (e.g. classification grouped by feature)
• Open source lets you tweak and extend methods, or build your own
24
Challenges in computing OLS solution
a b c d e f g h
X
Segment 1
Segment 2
25
Challenges in computing OLS solution
a b c d e f g h
X
Segment 1
Segment 2
a c e g b d f h
Segm
ent 1
Segm
ent 2
XT
26
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Data across nodes are multiplied
27
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Looks like the result can be decomposed
ab+cd+ef+gh b2+d2+f2+h2
ab+cd+ef+gh
28
Challenges in computing OLS solution
a b c d e f g h
X
a c e g b d f h
XT
a2+c2+e2+g2 =
Data across nodes are multiplied!
ab+cd+ef+gh b2+d2+f2+h2
ab+cd+ef+gh
= + a b e f
e f a b + c d g
h g h c
d +
29
Linear Regression on 10 Million Rows in Seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
30
Contributors Welcome!
• Web sites – http://hawq.incubator.apache.org/ – http://madlib.incubator.apache.org/ – https://cran.r-project.org/web/packages/PivotalR/index.html
• Github – https://github.com/apache/incubator-hawq – https://github.com/apache/incubator-madlib – https://github.com/pivotalsoftware/PivotalR
31
?