Accumulo Summit 2014: SQL-on-Accumulo with Pivotal HAWQ and PXF

Preview:

DESCRIPTION

Pivotal Xtension Framework (PXF) support for Accumulo within HAWQ provides a fully-featured and native SQL interface to data stored in Accumulo. The Accumulo/PXF module works by intelligently extracting data from Accumulo through iterators and the Accumulo APIs to deliver data to HAWQ's SQL execution engine. Data extraction is fully parallel and utilizes query predicate push downs for an additional performance boost. Additionally, it natively supports Accumulo's security labels functionality. PXF is an external table interface in HAWQ, a SQL-on-Hadoop system, which allows you to read data stored within the Hadoop ecosystem. External tables can be used to load data into HAWQ from Hadoop and/or also query Hadoop data without materializing it into HAWQ PXF enables analysis of HAWQ data and Hadoop data in a single query. It supports a wide range of data formats such as Text, AVRO, Hive, Sequence, RCFile formats, HBase, and now Accumulo.

Citation preview

SQL-on-Accumulo with Pivotal HAWQ and PXF

Agenda

• HAWQ & PXF Overview

• Accumulo Connector - Usage

• Accumulo Connector - Advanced Features

• PXF API

• Demo

HAWQ is…

A parallel SQL query engine on Hadoop

PHD

PHD

PHD

PHD

PXF is...

A fast extensible framework connecting HAWQ to a data store of choice that

exposes a parallel API

PHD

dire

ct an

alytics

PXF

PHD

ind

irect a

na

lytics

PXF

Usage

CREATE EXTERNAL TABLE <table>(<col list>)LOCATION (‘pxf://rest_host:port/<data source>?<plugin options>’)FORMAT ‘<type>’(<params>)[SEGMENT REJECT LIMIT <n> [ROWS|PERCENT] LOG ERRORS INTO <err_t>]

-- direct analytics (external)SELECT <…> FROM <table> WHERE <…>

-- indirect analytics (internal)INSERT INTO <hawq table> SELECT <…> FROM <table> WHERE <…>

Any SQL operation (joining, aggregates, sorting, etc) can be executed

Accumulo Connector - Usage

CREATE EXTERNAL TABLE <table>(<col list>)LOCATION (‘pxf://…/<accumulo table name>?profile=accumulo’)FORMAT ‘custom’(formatter=‘pxfwritable_import’)

CREATE EXTERNAL TABLE t(recordkey text, “cf1:date” date, “cf1:price” double)

LOCATION (‘pxf://…/instance:sales?profile=accumulo’)FORMAT ‘custom’(formatter=‘pxfwritable_import’)

-- Example of a simple querySELECT “cf1:date”, max(“cf1:price”) FROM tGROUP BY “cf1:date”

Accumulo Connector - Advanced Features

Smart filtering with predicate pushdownExcluding irrelevant tablets and filtering on values on source according to HAWQ’s query WHERE clause.

Error tables for logging badly formatted data and avoid aborting the querySpecify desired error threshold. Query the error table after operation to see the rejected data and the related error.

Lookup table for easy access to non textual qualifiersDefine a qualifier lookup table that translates between Accumulo style naming and SQL style naming.

Automatic Statistics for better join planningRun ANALYZE on a PXF-Accumulo table to update HAWQ’s optimizer with table and attribute level statistics from the Accumulo table.

Mechanism for storing remote credentialsThe mapping between a HAWQ user credentials and Accumulo user credentials are entered once in HAWQ and automatically transferred to the Accumulo connector in runtime.

Accumulo Connector - Advanced Features

Visibility labels for enhanced securityThe Accumulo connector utilizes Accumulo’s built in cell-level security to ensure users are only able to view information for which they have been granted access.

Custom Iterators for increased performancePredicate pushdown is implemented using stackable custom Iterators which increase comparison operation (<, <=, >, >=, ==, !=) performance in a query’s WHERE clause.

Intelligent range filteringSpecifying a comparison on a recordkey will modify the Accumulo Connector’s range, minimizing the amount of data scanned, resulting in faster scans.

Automatic type detectionData types are detected automatically within the iterator, ensuring correct comparison operations are being utilized.

PXF API

• Fragmenter – returns a list of data source fragments and their location

• Accessor – access a given list of fragments, read them and return records

• Resolver – deserialize each record according to a given schema or technique

Distributedexecutionthreads

Distributeddatabaseservers

PXF API

• AccumuloFragmenter returns a list of Accumulo tablets+locations for a given table

• AccumuloAccessor access a given list of fragments, read them and return Accumulo records. Use filter pushdown when possible

• AccumuloResolver convert each qualifier value into something that can be understood by HAWQ

Live Demo

Accumulo Table Contents

User Authorizations

$PHD_ROOT/conf/pxf-profiles.xml

Define Table in HAWQ

Setting Authorizations

Executing a Simple Query

A Query With a Single Pushdown Filter

A Query With a Single Pushdown Filter

A Query With a Multiple Pushdown Filters

A Query With a Multiple Pushdown Filters

A Query With a Multiple Pushdown Filters

Setting Authorizations

Executing a Query as ‘foo’

Define a Lookup Table in Accumulo

Define a Lookup Table in HAWQ

Performing a Simple Query

Recommended