View
371
Download
0
Category
Preview:
DESCRIPTION
The presentation delivered during Hadoop Kitchen in Moscow on 27.09.2014. It describes the technology that lies beneath Pivotal HAWQ technology
Citation preview
1Pivotal Confidential–Internal Use Only 1Pivotal Confidential–Internal Use Only
Pivotal HAWQ
A.Grishchenko
HadoopKitchen @ Mail.ru27 Sep 2014
2Pivotal Confidential–Internal Use Only
SQL-on-Hadoop Solutions
Hive
2008
Developed by Facebook– Hive is used for data analysis in their data warehouse– DWH size is ~300PB at the moment, ~600TB of data is loaded daily. Data
is compressed using ORCFiles, compression ratio is ~8x
HiveQL language is not compatible with ANSI SQL-92
Has many limitations on subqueries
Cost-based optimizer (Optiq) is only in technical preview now
3Pivotal Confidential–Internal Use Only
SQL-on-Hadoop Solutions
Hive
2008
Developed by Cloudera– Open-source solution– Cloudera sells this solution to enterprise shops– Was in beta until the May’2013
Supports HiveQL, moving forward complete ANSI SQL-92 support
Written in C++, does not use Map-Reduce for running queries
Requires much memory, big tables join usually causes OOM error
Impala
10.2012
4Pivotal Confidential–Internal Use Only
SQL-on-Hadoop Solutions
Hive
2008
Hortonworks initiative– Consists of a number of steps to make Hive run 100x faster
Tez – solution to make Hive queries be translated to Tez jobs, which are similar to Map-Reduce but may have arbitrary topology
Optiq – cost-based query optimizer for Hive (technical preview ATM)
ORCFile – columnar storage format with adaptive compression and inline indexes
Hive-5317 – ACID and Update/Delete support (release at ~ 11.2014)
Impala
10.2012
Stinger
02.2013
5Pivotal Confidential–Internal Use Only
SQL-on-Hadoop Solutions
Hive
2008
Pivotal product– Greenplum MPP DBMS, ported to store data in HDFS– Written in C, query optimizer is rewritten for this solution (ORCA)
Supports ANSI SQL-92 and analytic extensions from SQL-2003
Supports complex queries with correlated subqueries, window functions and different joins
Data is put on disk only if the process does not have enough memory
Impala
10.2012
Stinger
02.2013
HAWQ
02.2013
6Pivotal Confidential–Internal Use Only
SQL-on-Hadoop Solutions
Hive
2008
HP Vertica– Supports only MapR distribution as requires updatable storage– Supports ANSI SQL-92, SQL-2003– Supports UPDATE/DELETE– Officially announced as available in July’2014, no implementations yet
IBM BigSQL v3– IBM DB2 ported to store data in HDFS– Federated queries, good query optimizer, etc.
Both solutions are similar to Pivotal HAWQ in general idea
Impala
10.2012
Stinger
02.2013
HAWQ
02.2013
Vertica,BigSQL
2014
7Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components
Master
Segment 1
Segment 2
Segment K
Server 1
Standby Master
Server 2
Server 3
Segment K+1
Segment K+2
Segment 2*K
Server 4
Segment N
Server M
… … ……
8Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components
HAWQ Master
HAWQ Segm.
Server 1
HAWQ SBMstr
Server 2
Server 5
…
NameNode
Server 3
SNameNode
Server 4
ZK QJMZK QJMZK QJM
Datanode
HAWQ Segm.
Server 6
Datanode
HAWQ Segm.
Server M
Datanode
9Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components
HAWQ Master
Query Parser
Query Optimizer
Query Executor
Transaction Manager
Process Manager
Metadata Catalog
HAWQ Standby Master
Query Parser
Query Optimizer
Query Executor
Transaction Manager
Process Manager
Metadata Catalog
WALreplic.
10Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components Metadata is stored only on master-servers
Metadata is stored in modified Postgres instance, replicated to standby master with WAL
Metadata contains– Table information – schema, names, files– Statistics – number of unique values, value ranges, sample values,
etc.– Information about users, groups, priorities, etc.
Master server shutdown causes the switch to standby with the loss of running sessions
11Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components
HAWQ Segment
Query Executor
libhdfs3
PXF
HDFS Datanode
Segment Data Directory
Local Filesystem (xfs)
Spill Data Directory
12Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components Both masters and segments are modified postgres
instances (to be clear, modified Greenplum instances)
Opening connection to the master server you fork postmaster process that starts to work with your session
Starting the query execution you connect to the segment instances and they also fork a process to execute query
Query execution plan is split into independent blocks (slices), each of them is executed as a separate OS process on the segment server, moving the data through UDP
13Pivotal Confidential–Internal Use Only
Pivotal HAWQ Components Tables can be stored as:
– Row-oriented (quicklz, zlib compression)– Column-oriented (quicklz, zlib, rle compression)– Parquet tables
Each segment has separate directory on HDFS where it stores its data shard
Within columnar storage each column is represented as a separate file
Parquet allows to store the table by columns and does not load NameNode with many files / block location requests
14Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
15Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
16Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
17Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
18Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
19Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
20Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
21Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE
22Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE
23Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
24Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
25Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
26Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
27Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
28Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
29Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
30Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
31Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
32Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
33Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
QE S1 S2 S3
34Pivotal Confidential–Internal Use Only
HAWQ Master
Metadata
Transact. Mgr.
Parser Query Optimiz.
Process Mgr.
Query Executor
NameNode
Query Execution in Pivotal HAWQ
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
Local Spill Directory
HAWQ SegmentBackend
HDFS Datanode
Segment Directory
35Pivotal Confidential–Internal Use Only
PXF Framework
Gives you ability to read different data types from HDFS– Text files, both compressed and uncompressed– Seqence-files– AVRO-files
Able to read data from external data sources– HBase– Cassandra– Redis
Extensible API
36Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
37Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
38Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
39Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
40Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
41Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
42Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
43Pivotal Confidential–Internal Use Only
NameNode
PXF FrameworkHAWQ Master
PXF Fragmenter
Process Mgr.
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
Local Spill Directory
HAWQ SegmentQuery Executor
HDFS DatanodeSegment Directory
PXF Accessor
PXF Fragmenter
44Pivotal Confidential–Internal Use Only
Further Steps
Master server scaling – pool of master servers
New native data storage formats and new native compression algorithms
YARN as resource manager for HAWQ
Dynamic segment allocation / decommission
45Pivotal Confidential–Internal Use Only 45Pivotal Confidential–Internal Use Only
Questions?
BUILT FOR THE SPEED OF BUSINESS
Recommended