Upload
oreillystrata
View
3.221
Download
4
Embed Size (px)
Citation preview
Strata Conference, February 2013
SQL on Hadoop Defining the New Genera/on of Analy/c Databases
Speaker Bio: Carl Steinbach
2
Currently:
Engineer at Citus Data PMC Chair, Committer -- Apache Hive Project @cwsteinbach on Twitter
Formerly:
Cloudera, Informatica, NetApp, Oracle
This is going to sound strange, but…
3
I used to think databases were boring
Why?
4
Undergrad at MIT 1997-2001 Number of Database Classes: 0 Number of Database Faculty Members: 0 My Conclusion: Databases are a Dead Field
Things Changed Over the Next Couple of Years
5
I got a job! Database Group Formed at MIT (2003)
- Mike Stonebraker - Sam Madden
New Class: 6.830 Database Systems (2005)
What Changed?
6
Web-scale Data New DB Research: Columnar Storage, NoSQL MPP Analytic Databases Gained Market Traction GFS (’03) and MapReduce (‘04) Papers Apache Hadoop – v0.1.0 released in 2006
What’s Good About Hadoop?
7
Commodity Storage Scale-out Flexibility
MapReduce Multi-structured Data
What’s Bad About Hadoop?
8
MapReduce! No Schemas! Missing Features
Optimizer, Indexes, Views Incompatibility with Existing Tools
BI, ETL, IDEs
Hive MetaStore
Apache Hive Solved Many of These Problems
9
HiveServer2
SQL to MapReduce Compiler
Rule Based Op/mizer
MR Plan Execu/on Coordinator
Table to Files
Table to Format
datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
datanode
HDFS
Map/Reduce
Hive Operators
Hive SerDes
SQL Queries Catalog Metadata
User Client
Hive ODBC/JDBC
Hive CLI
ETL, BI, SQL IDE
But Other Problems Remained
10
MapReduce: Latency Overhead Many Missing Features: • ANSI SQL • Cost Based Optimizer • UDFs • Data Types • Security • …
One Solution: Separate MPP DB Cluster
11
MPP Worker Node MPP Worker Node MPP Worker Node MPP Worker Node
Global Query Executor
MPP Master Node
HDFS datanode
HDFS datanode
HDFS datanode
HDFS datanode
Local Query Executor
Local Query Executor
Local Query Executor
Local Query Executor
MPP Database Cluster
Hadoop Cluster
One Solution: Separate MPP DB Cluster
12
MPP Worker Node MPP Worker Node MPP Worker Node MPP Worker Node
Global Query Executor
MPP Master Node
HDFS datanode
HDFS datanode
HDFS datanode
HDFS datanode
Local Query Executor
Local Query Executor
Local Query Executor
Local Query Executor
Pull Data to Work IO Bo]leneck
Better Solution: A New Architecture for SQL on Hadoop
13
Global Query Executor
MPP Master Node
HDFS datanode
Local Query Executor
HDFS datanode
Local Query Executor
HDFS datanode
Local Query Executor
HDFS datanode
Local Query Executor
Maintain Data Locality
Push Work to
Data
The New Architecture in Detail: CitusDB
14
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
The New Architecture in Detail: CitusDB
15
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
Metadata Sync
Step 1) The CitusDB Master Node retrieves file system metadata from the Hadoop NameNode.
The New Architecture in Detail: CitusDB
16
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
User Query
Step 2) The user submits a SQL query to the CitusDB master node using the PostgreSQL CLI or a JDBC/ODBC app.
The New Architecture in Detail: CitusDB
17
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
Local Queries
Step 3) The Master Node generates an op/mized global query plan and sends fragment queries to the workers.
The New Architecture in Detail: CitusDB
18
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
Local Results
Step 4) The CitusDB worker processes running on each DataNode process the fragment queries and send par/al result sets back to the Master Node.
The New Architecture in Detail: CitusDB
19
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrappers
PostgreSQL Tools
ODBC/JDBC Clients
Hadoop Metadata
HDFS NameNode
Query Results
Step 5) The Master Node merges the par/al result sets and returns the final result to the user.
CitusDB: Standing on the Shoulders of Giants
20
Mature, Battle-tested Enterprise Class Features Has an Elephant Mascot
Proven Scalability Cost Effectiveness Has an Elephant Mascot
+
Leveraging PostgreSQL Performance
21
Cost-based Query Optimizer
postgres=# EXPLAIN SELECT customer.c_custkey, sum((lineitem.l_extendedprice * (1::numeric -‐ lineitem.l_discount))) ….
-‐> Sort (cost=282459.19..282599.52 rows=56134 width=182) Sort Key: customer.c_custkey, customer.c_name Sort Method: external merge Disk: 17192kB
…. -‐> Hash Join (cost=39666.61..257246.25 rows=56134 width=16) Hash Cond: (lineitem.l_orderkey = orders.o_orderkey) -‐> Seq Scan on lineitem_102022 lineitem (cost=0.00..190571.11)
Leveraging PostgreSQL Features: More than 300 Built-in Functions
22
REGR_SLOPE ENUM_RANGE NTH_VALUE SIN TRANSACTION_TIMESTAMP REGR_SYY IS FALSE BROADCAST ISCLOSED VAR_SAMP QUERY_TO_XML GET_BIT IS NOT UNKNOWN ENUM_LAST NUMNODE SPLIT_PART REPLACE EVERY IS TRUE RPAD NOW REGR_SXX PATH PERCENT_RANK IN CURRENT_DATE UPPER_INF XMLAGG GENERATE_SERIES REGR_INTERCEPT DIAMETER ROW_NUMBER ARRAY_NDIMS
QUOTE_LITERAL STRING_TO_ARRAY CONVERT_TO DATE_TRUNC TO_DATE LOWER_INC CHAR_LENGTH CURRENT_TIMESTAMP ARRAY_LENGTH REPEAT UPPER CASE BOOL_AND CORR XPATH_EXISTS RADIUS MOD AREA ATAN2 REGEXP_REPLACE DECODE BOOL_OR CUME_DIST OVERLAY POINT JUSTIFY_INTERVAL POWER IS NOT TRUE FORMAT REVERSE IS NOT FALSE ROW_TO_JSON SET_BIT
COS EXTRACT DIV BTRIM LOWER CONCAT ARRAY_FILL JUSTIFY_DAYS VAR_POP OCTET_LENGTH STATEMENT_TIMESTAMP TAN ARRAY_APPEND XMLCOMMENT REGEXP_MATCHES CONVERT_FROM XPATH AT TIME ZONE RANDOM CEILING WIDTH ROUND COVAR_SAMP HOSTMASK ARRAY_DIMS CURSOR_TO_XML GENERATE_SUBSCRIPTS PG_SLEEP STDDEV_SAMP SIMILAR TO NOTNULL SUBSTRING REGEXP_SPLIT_TO_ARRAY
GREATEST REGR_SXY OVERLAPS FLOOR SQRT RTRIM REGR_AVGY IS DISTINCT TIMEOFDAY LN TO_CHAR TRUNC ISNULL SCHEMA_TO_XML COALESCE ENUM_FIRST BIT_AND RADIANS SUM TRIM STDDEV_POP LSEG STRING_AGG PCLOSE MASKLEN LIKE POSITION VAR_POP DENSE_RANK DATABASE_TO_XML HOST SETSEED TO_TIMESTAMP
QUOTE_IDENT REGR_R2 LAG PI TRUNC STRIP XMLAGG UPPER COVAR_POP NETMASK FIRST_VALUE LOWER_INF REGR_COUNT SET_MASKLEN NOW ISOPEN REGR_COUNT NOW MIN TO_HEX GET_BYTE XML_IS_WELL_FORMED LASTVAL HEIGHT DENSE_RANK SETVAL LAST_VALUE STRPOS COT ARRAY_CAT TO_ASCII ISFINITE NOT
SET_BYTE XMLFOREST LAG FORMAT ARRAY_AGG LTRIM BETWEEN BOX CURRVAL LOCALTIME LPAD REGR_AVGX DATE_PART ARRAY_TO_STRING EXTRACT UPPER_INC TRANSLATE SQRT NOT LIKE LOG DATE_TRUNC VARIANCE UNNEST ANY LOCALTIMESTAMP LENGTH INITCAP SIGN SUBSTR STDDEV ABS SOME MD5
Leveraging PostgreSQL Features
23
Extensible, Rich Type System Pluggable Format Handlers Security Internationalization Connectivity: ODBC, JDBC Ecosystem Add-Ons:
PostGIS, XML/JSON, Fuzzy Search, Language Bindings (.NET, Python, etc)
Where are We Headed? Distributed. SQL. Anywhere.
24
CitusDB Master Node
Metadata
Distributed Query Planner
Distributed Query Executor
Hadoop Datanode
HDFS
Local Query Planner
Local Query Executor
Foreign Data Wrapper
RDBMS server
RDBMS
Local Query Planner
Local Query Executor
Foreign Data Wrapper
MongoDB Shard
mongod
Local Query Planner
Local Query Executor
Foreign Data Wrapper
Defining the New Generation of Distributed Analytic Databases
25
SQL à Ease of Use, Increased Productivity Real-time responsiveness à Faster Data Locality à Proven Scalability Schema-on-Read à Flexibility, Lower Cost
Where Are We At?
26
CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI
http://citusdata.com/docs/sql-on-hadoop
We’re Hiring
27
http://citusdata.com/job