Upload
hadoop-dev
View
559
Download
1
Embed Size (px)
Citation preview
© 2015 IBM Corporation
Getting off to a fast start with Big SQL for Big Data analytics: Session #4132 Cynthia M. Saracco Oct. 27, 2015
Please Note:
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in
making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material,
code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will vary
depending upon many factors, including considerations such as the amount of multiprogramming in the
user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated
here.
2
• Big SQL = industry-standard SQL for Hadoop platforms
Easy on-ramp to Hadoop for SQL professionals
Supports familiar SQL tools / apps (via JDBC and ODBC drivers)
• How to get started?
Create tables / views. Store data in DFS, HBase, or Hive warehouse
Load data into tables (from local files, remote files, RDBMSs)
Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-in functions, UDFs, . . . . )
Explore advanced features
• Collect statistics and inspect data access plan
• Transparently join/union Hadoop & RDBMS data (query federation)
• Leverage Big Data technologies: SerDes, Spark, . . .
Overview
• Command-line interface: Java SQL Shell (JSqsh)
• Web tooling (Data Server Manager)
• Tools that support IBM JDBC/ODBC driver
Invocation options
Creating a Big SQL table
• Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in DFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
Results from previous CREATE TABLE . . .
• Data stored in subdirectory of Hive warehouse
. . . /hive/warehouse/myid.db/users
Default schema is user ID. Can create new schemas
“Table” is just a subdirectory under schema.db
Table’s data are files within table subdirectory
• Meta data collected (Big SQL & Hive)
SYSCAT.* and SYSHADOOP.* views
• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory contents
Useful if table contents already in DFS
Avoids need to LOAD data
Populating Tables via LOAD
• Typically best runtime performance
• Load data from local or remote file system
load hadoop using file url
'sftp://myID:[email protected]:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
• Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
Querying your Big SQL tables
• Same as ISO-compliant RDBMS
• No special query syntax for Hadoop tables
Projections, restrictions
UNION, INTERSECT, EXCEPT
Wide range of built-in functions (e.g. OLAP)
Full support for subqueries
All standard join operations
. . .
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
Accessing Big SQL data from Spark shell
// based on BigInsights 4.1, which includes Spark 1.4.1
// establish a Hive context
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// query some Big SQL data
val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact")
// action on the data – count # of rows
saleFacts.count()
. . .
// transform the data as needed (create a Vector with data from 2 cols)
val subset = saleFacts.map {row => Vectors.dense(row.getDouble(16),row.getDouble(17))}
// invoke basic Spark MLlib statistical function over the data
val stats = Statistics.colStats(subset)
// print one of the statistics collected
println(stats.mean)
A word about . . . SerDes
• Custom serializers / deserializers (SerDes)
Read / write complex or “unusual” data formats (e.g., JSON)
Commonly used by Hadoop community
Developed by user or available publicly
• Add SerDes to directories; reference SerDe when creating table
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), . . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
Sample JSON input for previous example
JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
Sample Big SQL query output for JSON data
Sample output: Select * from socialmedia-json
Get started with Big SQL: External resources
• Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/
We Value Your Feedback!
• Don’t forget to submit your Insight session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference.
• Access your surveys at insight2015survey.com to quickly submit your surveys from your smartphone, laptop or conference kiosk.
13
14
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.
15
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
© 2015 IBM Corporation
Thank You
Supplemental
SQL for Hadoop (Big SQL) architecture
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
DFS
18
Optimization and performance
– IBM MPP engine (C++) replaces Java
MapReduce layer
– Continuous running daemons (no start up
latency)
– Message passing allow data to flow between
nodes without persisting intermediate results
– In-memory operations with ability to spill to disk
(useful for aggregations, sorts that exceed
available RAM)
– Cost-based query optimization with 140+ rewrite
rules
Various storage formats supported
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
Integration with RDBMSs via LOAD, query
federation
BigInsights
Populating Tables via INSERT
• INSERT INTO . . . SELECT FROM . . .
Parallel read and write operations
CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
• INSERT INTO . . . VALUES(. . . )
Not parallelized. 1 file per INSERT. Useful only for quick tests
CREATE . . . TABLE . . . AS SELECT . . .
• Create a Big SQL table based on contents of other table(s)
• Source tables can be in different file formats or use different underlying storage mechanisms
-- source tables in this example are external (just DFS files)
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90) )
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
A word about . . . query federation
• Data rarely lives in isolation
• Big SQL transparently queries heterogeneous systems
Join Hadoop to RDBMSs
Query optimizer understands capabilities of external system, including available statistics
As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
A word about . . . performance
• TPC (Transaction Processing Performance Council)
Formed August 1988
Widely recognized as most credible, vendor-independent SQL benchmarks
TPC-H and TPC-DS are the most relevant to SQL over Hadoop
• R/W nature of workload not suitable for HDFS
• Hadoop-DS benchmark: BigInsights, Hive, Cloudera
Run by IBM & reviewed by TPC certified auditor
Based on TPC-DS. Key deviations
• No data maintenance or persistence phases (not supported across all vendors)
Common set of queries across all solutions
• Subset that all vendors can successfully execute at scale factor
• Queries are not cherry picked
Most complete TPC-DS like benchmark executed so far
Analogous to porting a relational workload to SQL on Hadoop
Hadoop-DS Benchmark Topology
3 identical 17 node clusters
deployed
Hardware spec (per node):
- RHEL 6.4
- IBM x3650 M4 BD: Intel e5-
[email protected] v2, 2 sockets,
10 cores each w/HT = 40
logical CPUs
- 128GB RAM, 1866MHz
- 10x3.5” HDD@2TB
- Dual port 10GbE
Workload:
- Hadoop-DS
- 10TB scale factor
1) Load data
2) Power run (single stream)
3) Throughput run (4 streams)
Impala 1.4 on CDH 5
- Parquet storage format
Hive 0.13 (Tez) on
HortonWorks HDP 2.1
- ORC storage format
Big SQL 3.0 on BigInsights 3.0.0.1
- Parquet storage format
http://public.dhe.ibm.com/common/ssi/ecm/im/en/imw14800usen/IMW14800USEN.PDF
IBM Big SQL – Runs 100% of the queries
Key points
With Impala and Hive, many
queries needed to be re-written,
some significantly
Owing to various restrictions,
some queries could not be re-
written or failed at run-time
Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real databases
in production is another
Other environments require significant effort at scale
Results for 10TB scale shown here
Hadoop-DS benchmark – Single user performance @ 10TB
Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13 for single query stream using 46 common queries
Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS
benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are
either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to
modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will
vary based on actual workload, configuration, applications, queries and other variables in a production environment.
Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
Hadoop-DS benchmark – multi-user performance @ 10TB
With 4 streams, Big SQL is 2.1x faster than Impala and 8.5x faster than Hive 0.13 for 4 query streams using 46 common queries
Based on IBM internal tests comparing BigInsights Big SQL, Cloudera Impala and Hortonworks Hive (current versions available as of 9/01/2014) running on identical hardware. The test workload was based on the latest revision of the TPC-DS
benchmark specification at 10TB data size. Successful executions measure the ability to execute queries a) directly from the specification without modification, b) after simple modifications, c) after extensive query rewrites. All minor modifications are
either permitted by the TPC-DS benchmark specification or are of a similar nature. All queries were reviewed and attested by a TPC certified auditor. Development effort measured time required by a skilled SQL developer familiar with each system to
modify queries so they will execute correctly. Performance test measured scaled query throughput per hour of 4 concurrent users executing a common subset of 46 queries across all 3 systems at 10TB data size. Results may not be typical and will
vary based on actual workload, configuration, applications, queries and other variables in a production environment.
Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
Audited Results
• Letters of attestation are available for both Hadoop-DS benchmarks at 10TB and 30TB scale
• InfoSizing, Transaction Processing Performance Council Certified Auditors verified both IBM results as well as results on Cloudera Impala and HortonWorks HIVE.
• These results are for a non-TPC benchmark. A subset of the TPC-DS Benchmark standard requirements was implemented