43
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Phoenix + Apache HBase An Enterprise Grade Data Warehouse Ankit Singhal , Rajeshbabu , Josh Elser June, 30 2016

Apache Phoenix + Apache HBase

Embed Size (px)

Citation preview

Page 1: Apache Phoenix + Apache HBase

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Phoenix + Apache HBaseAn Enterprise Grade Data Warehouse

Ankit Singhal , Rajeshbabu , Josh Elser

June, 30 2016

Page 2: Apache Phoenix + Apache HBase

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

About us!!

– Committer and member of Apache Phoenix PMC

– MTS at Hortonworks.

Ankit Singhal

– Committer and member of Apache Phoenix PMC

– Committer in Apache HBase

– MTS at Hortonworks.

RajeshBabu

– Committer in Apache Phoenix

– Committer and Member of Apache Calcite PMC

– MTS at Hortonworks.

Josh Elser

Page 3: Apache Phoenix + Apache HBase

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query server

Q&A

Page 4: Apache Phoenix + Apache HBase

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Warehouse

EDW helps organize and aggregate analytical data from various functional domains and serves as a critical repository for organizations’ operations.

STA

GIN

G

Files

IOTdata

Data Warehouse

Mart

OLTP

ETL Visualization or BI

Page 5: Apache Phoenix + Apache HBase

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix Offerings and Interoperability:-

ETL Data Warehouse Visualization & BI

Page 6: Apache Phoenix + Apache HBase

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coproc

Zoo

Keep

er

Table,b,123

Table,a,123

Phx coproc

Table,c,123

Table,b,123

Phx coproc

RegionServer RegionServer

Application

HBase & PhoenixHBase , a distributed NoSQL storePhoenix , provides OLTP and Analytics over HBase

Page 7: Apache Phoenix + Apache HBase

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Open Source Data Warehouse

Hardware cost

Soft

war

e co

st

Specialized H/WCommodity H/W

Lice

nsi

ng

cost

No

Co

stSMPMPP

Open Source MPP

HBase+ Phoenix

Page 8: Apache Phoenix + Apache HBase

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Architecture

Run on commodity

H/WTrue MPP

O/S and H/W

flexibility

Support OLTP and

ROLAP

Page 9: Apache Phoenix + Apache HBase

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Scalability

Linear scalability for storage

Linear scalability

for memory

Open to Third party

storage

Page 10: Apache Phoenix + Apache HBase

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Reliability

Highly Available

Replication for disaster

recovery

Fully ACID for Data Integrity

Page 11: Apache Phoenix + Apache HBase

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Phoenix & HBase as a Data Warehouse

Manageability

Performance Tuning

Data Modeling &

Schema Evolution

Data pruning

Online expansion

Or upgradeData Backup and recovery

Page 12: Apache Phoenix + Apache HBase

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use cases

Page 13: Apache Phoenix + Apache HBase

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Who uses Phoenix !!

Page 14: Apache Phoenix + Apache HBase

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Analytics Use case - (Web Advertising company)

Functional Requirements– Create a single source of truth

– Cross dimensional query on 50+ dimension and 80+ metrics

– Support fast Top-N queries

Non-functional requirements– Less than 3 second Response time for slice and dice

– 250+ concurrent users

– 100k+ Analytics queries/day

– Highly available

– Linear scalability

Page 15: Apache Phoenix + Apache HBase

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Warehouse Capacity

Data Size(ETL Input)– 24TB/day of raw data system wide

– 25 Billion of impressions

HBase Input(cube)– 6 Billion rows of aggregated data(100GB/day)

HBase Cluster size– 65 Nodes of HBase

– 520 TB of disk

– 4.1 TB of memory

Page 16: Apache Phoenix + Apache HBase

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Use Case Architecture

AdServer

Click Tracking

KafkaInput

KafkaInput

ETL Filter Aggregate

In- Memory Store

ETL Filter Aggregate

Real-time

KafkaCAMUS

HDFSETL

HDFS

DataUploader

DATA

API

HBaseViews

A

N

A

L

Y

T

I

C

S

UI

Batch Processing

Data Ingestion Analytics

ApacheKafka

Page 17: Apache Phoenix + Apache HBase

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Cube Generation

Cubes are stored in HBase

A

N

A

L

Y

T

I

C

S

UI

Convert slice and

dice query to SQL query

Data API

Analytics Data Warehouse Architecture

Bulk Load

HDFS

ETL

Backup and

recovery

Page 18: Apache Phoenix + Apache HBase

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Time Series Use Case- (Apache Ambari)

Functional requirements– Store all cluster metrics collected every second(10k to 100k metrics/second)

– Optimize storage/access for time series data

Non-functional requirements– Near real time response time

– Scalable

– Real time ingestion

Ambari Metrics System (AMS)

Page 19: Apache Phoenix + Apache HBase

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AMS architecture

Metric Monitors

Hosts

HadoopSinks

HBase

Phoenix

Metric Collector

AmbariServer

Page 20: Apache Phoenix + Apache HBase

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Page 21: Apache Phoenix + Apache HBase

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Most important criteria for driving overall performance of queries on the table

Primary key should be composed from most-used predicate columns in the queries

In most cases, leading part of primary key should help to convert queries into point lookups or range scans in HBase

Primary key design

Page 22: Apache Phoenix + Apache HBase

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Use salting to alleviate write hot-spotting

CREATE TABLE …(

) SALT_BUCKETS = N

– Number of buckets should be equal to number of RegionServers

Otherwise, try to presplit the table if you know the row key data set

CREATE TABLE …(

) SPLITS(…)

Salting vs pre-split

Page 23: Apache Phoenix + Apache HBase

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Use block encoding and/or compression for better performance

CREATE TABLE …(

) DATA_BLOCK_ENCODING= ‘FAST_DIFF’, COMPRESSION=‘SNAPPY’

Use region replication for read high availability

CREATE TABLE …(

) “REGION_REPLICATION” = “2”

Table properties

Page 24: Apache Phoenix + Apache HBase

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Set UPDATE_CACHE_FREQUENCY to bigger value to avoid frequently touching server for metadata updates

CREATE TABLE …(

) UPDATE_CACHE_FREQUENCY = 300000

Table properties

Page 25: Apache Phoenix + Apache HBase

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Schema Design

Divide columns into multiple column families if there are rarely accessed columns– HBase reads only the files of column families specified in the query to reduce I/O

pk1 pk2

CF1 CF2

Col1 Col2 Col3 Col4 Col5 Col6 Col7

Frequently accessing columns Rarely accessing columns

Page 26: Apache Phoenix + Apache HBase

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secondary Indexes

Global indexes– Optimized for read heavy use cases

CREATE INDEX idx on table(…)

Local Indexes– Optimized for write heavy and space constrained use cases

CREATE LOCAL INDEX idx on table(…)

Functional indexes– Allow you to create indexes on arbitrary expressions.

CREATE INDEX UPPER_NAME_INDEX ON EMP(UPPER(FIRSTNAME||’ ’|| LASTNAME ))

Page 27: Apache Phoenix + Apache HBase

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secondary Indexes

Use covered indexes to efficiently scan over the index table instead of primary table.

CREATE INDEX idx ON table(…) include(…)

Pass index hint to guide query optimizer to select the right index for querySELECT /*+INDEX(<table> <index>)*/..

Page 28: Apache Phoenix + Apache HBase

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Row Timestamp Column

Maps HBase native row timestamp to a Phoenix column

Leverage optimizations provided by HBase like setting the minimum and maximum time range for scans to entirely skip the store files which don’t fall in that time range.

Perfect for time series use cases.

Syntax

CREATE TABLE …(CREATED_DATE NOT NULL DATE

CONSTRAINT PK PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP…

)

Page 29: Apache Phoenix + Apache HBase

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Use of Statistics

Region A

Region F

Region L

Region R

Chunk A

Chunk C

Chunk F

Chunk I

Chunk L

Chunk O

Chunk R

Chunk U

A

F

R

L

A

F

R

L

C

I

O

U

Client Client

Page 30: Apache Phoenix + Apache HBase

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Skip Scan

Phoenix supports skip scan to jump to matching keys directly when the query has key sets in predicate

SELECT * FROM METRIC_RECORD

WHERE METRIC_NAME LIKE 'abc%'

AND HOSTNAME in ('host1’, 'host2');

CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCANON 2 RANGES OVER METRIC_RECORD

['abc','host1'] - ['abd','host2']

Region1

Region2

Region3

Region4

Client

RS

3R

S 2

RS

1

Skip scan

Page 31: Apache Phoenix + Apache HBase

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Join optimizations

Hash Join– Hash join outperforms other types of join algorithms when one of the relations is smaller or

records matching the predicate should fit into memory

Sort-Merge join– When the relations are very big in size then use the sort-merge join algorithm

NO_STAR_JOIN hint– For multiple inner-join queries, Phoenix applies a star-join optimization by default. Use this hint in

the query if the overall size of all right-hand-side tables would exceed the memory size limit.

NO_CHILD_PARENT_OPTIMIZATION hint– Prevents the usage of child-parent-join optimization.

Page 32: Apache Phoenix + Apache HBase

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Optimize Writes

Upsert values– Call it multiple times before commit for batching mutations

– Use prepared statement when you run the query multiple times

Upsert select– Configure phoenix.mutate.batchSize based on row size

– Set auto-commit to true for writing scan results directly to HBase.

– Set auto-commit to true while running upsert selects on the same table so that writes happen at server.

Page 33: Apache Phoenix + Apache HBase

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hints

SERIAL SCAN, RANGE SCAN

SERIAL

SMALL SCAN

Some important hints

Page 34: Apache Phoenix + Apache HBase

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Additional References

For some more optimizations you can refer to these documents– http://phoenix.apache.org/tuning.html

– https://hbase.apache.org/book.html#performance

Page 35: Apache Phoenix + Apache HBase

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query Server

Page 36: Apache Phoenix + Apache HBase

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Phoenix Query Server

A standalone service that proxies user requests to HBase/Phoenix– Optional

Reference client implementation via JDBC– ”Thick” versus “Thin”

First introduced in Apache Phoenix 4.4.0

Built on Apache Calcite’s Avatica– ”A framework for building database drivers”

Page 37: Apache Phoenix + Apache HBase

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Traditional Apache Phoenix RPC Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZoo

Keep

er Table,b,123

Table,a,123

Phx coproc

Table,c,123

Table,b,123

Phx coproc

RegionServer RegionServer

Application

Page 38: Apache Phoenix + Apache HBase

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Model

Table,a,123

Table,,123

RegionServer

HDFS

HBase client

Phoenix client

Phx coprocZoo

Keep

er Table,b,123

Table,a,123

Phx coproc

Table,d,123

Table,b,123

Phx coproc

RegionServer RegionServer

Query Server

Application

Page 39: Apache Phoenix + Apache HBase

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Technology

HTTP Server and wire API definition

Pluggable serialization– Google Protocol Buffers

“Thin” JDBC Driver (over HTTP)

Other goodies!– Pluggable metrics system

– TCK (technology compatibility kit)

– SPNEGO for Kerberos authentication

– Horizontally scalable with load balancing

Page 40: Apache Phoenix + Apache HBase

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Query Server Clients

Go language database/sql/driver

– https://github.com/Boostport/avatica

.NET driver– https://github.com/Azure/hdinsight-phoenix-sharp

– https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview

ODBC– Built by http://www.simba.com/, also available from Hortonworks

Python DB API v2.0 (not “battle tested”)– https://bitbucket.org/lalinsky/python-phoenixdb

Client enablement

Page 41: Apache Phoenix + Apache HBase

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

AgendaPhoenix & HBase as an Enterprise Data Warehouse

Use Cases

Optimizations

Phoenix Query Server

Q&A

Page 42: Apache Phoenix + Apache HBase

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

We hope to see you all migrating to Phoenix & HBase and expecting more questions on the user mailing lists.

Get involved in mailing lists:[email protected]@hbase.apache.org

You can reach us on:[email protected]@[email protected]

Phoenix & HBase

Page 43: Apache Phoenix + Apache HBase

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You