Transcript
Page 1: Hive User Meeting March 2010 - Hive Team

Hive New Features and API

Facebook Hive Team

March 2010

Page 2: Hive User Meeting March 2010 - Hive Team

JDBC/ODBC and CTAS

Page 3: Hive User Meeting March 2010 - Hive Team

Hive ODBC Driver

• Architecture: • Client/DriverManager local call dynamic libraies

• unixODBC (libodbchive.so) + hiveclient(libhiveclient.so) + thriftclient (libthrift.so) network socket

• HiveServer (in Java) local call

• Hive + Hadoop

• unixODBC is not part of Hive open source, so you need to build it by yourself.• 32-bit/64-bit architecture

• Thrift has to be r790732

• Boost libraries

• Linking with 3rd party Driver Manager.

Facebook

Page 4: Hive User Meeting March 2010 - Hive Team

Facebook Use Case

» Hive integration with MicroStrategy 8.1.2 (HIVE-187) and 9.0.1. (HIVE-1101)

• FreeForm SQL (reported generated from user input queries)

• Reports generated daily.

» All servers (MSTR IS server, HiveServer) are running on Linux.

• ODBC driver needs to be 32 bits.

Facebook

Page 5: Hive User Meeting March 2010 - Hive Team

Hive JDBC

» Embedded mode:› jdbc:hive://

» Client/server mode:› jdbc:hive://host:port/dbname

› host:port is where the hive server is listening.

› Architecture is similar to ODBC.

Facebook

Page 6: Hive User Meeting March 2010 - Hive Team

Create table as select (CTAS)

• New feature in branch 0.5.

• E.g.,

CREATE TABLE T STORED AS TEXTFILE AS

SELECT a+1 a1, concat(b,c,d) b2

FROM S

WHERE …

Resulting schema:

T (a1 double, b2 string)

• The create-clause can take all table properties except external table or partitioned table (on roadmap).

• Atomicity: T will not be created if the select statement has an error.

Facebook

Page 7: Hive User Meeting March 2010 - Hive Team

Join Strategies

Page 8: Hive User Meeting March 2010 - Hive Team

Left semi join

• Implementing IN/EXISTS subquery semantics: SELECT A.*

FROM A WHERE A.KEY IN

(SELECT B.KEY FROM B WHERE B.VALUE > 100);

Is equivalent to:

SELECT A.*

FROM A LEFT SEMI JOIN B

ON (A.KEY = B.KEY and B.VALUE > 100);

• Optimizations: • map-side groupby to reduce data flowing to reducers

• early exit if match in join.

Facebook

Page 9: Hive User Meeting March 2010 - Hive Team

Map Join Implementation

SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;

SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;

Table b Table a Table c

Mapper 1

File a1File a1

File a2File a2

File c1File c1

Mapper 2

Mapper 3

a1a1a2a2

c1c1

a1a1a2a2

c1c1

a1a1a2a2

c1c11. Spawn mapper based on the big table2. All files of all small tables are replicated onto each mapper

Page 10: Hive User Meeting March 2010 - Hive Team

Bucket Map Join

set hive.optimize.bucketmapjoin = true;

1.Work together with map join

2.All join tables are bucketized, and each small table’s bucket number can be divided by big table’s bucket number.

3.Bucket columns == Join columns

Page 11: Hive User Meeting March 2010 - Hive Team

Bucket Map Join Implementation

SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;

SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;

Table b Table a Table c

Mapper 1Bucket b1

Bucket a1Bucket a1

Bucketa2Bucketa2

Bucket c1Bucket c1

Mapper 2Bucket b1

Mapper 3Bucket b2

a1a1c1c1

a1a1c1c1

a2a2c1c1 Normally in production, there will be

thousands of buckets!

Table a,b,c all bucketized by ‘key’a has 2 buckets, b has 2, and c has 1Table a,b,c all bucketized by ‘key’a has 2 buckets, b has 2, and c has 1

1. Spawn mapper based on the big table2. Only matching buckets of all small

tables are replicated onto each mapper

Page 12: Hive User Meeting March 2010 - Hive Team

Sort Merge Bucket Map Join

set hive.optimize.bucketmapjoin = true;set hive.optimize.bucketmapjoin.sortedmerge = true;set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

1.Work together with bucket map join

2.Bucket columns == Join columns == sort columns

3.If partitioned, only big table can allow multiple partitions, small tables must be restricted to a single partition by query.

Page 13: Hive User Meeting March 2010 - Hive Team

Sort Merge Bucket Map Join

Facebook

Table A Table B Table C

1, val_11, val_1

3, val_33, val_3

5, val_55, val_5

4, val_44, val_4

4, val_44, val_4

20, val_2020, val_20

23, val_2323, val_23

20, val_2020, val_20

25, val_2525, val_25

Small tables are read on demand NOT held entire small tables in memory Can perform outer join

Page 14: Hive User Meeting March 2010 - Hive Team

Skew Join

Join bottlenecked on the reducer who gets the skewed key

set hive.optimize.skewjoin = true; set hive.skewjoin.key = skew_key_threshold

Page 15: Hive User Meeting March 2010 - Hive Team

Skew Join

Reducer 1

Reducer 2

a-K 3

b-K 3 a-K 3

b-K 3

a-K 2

b-K 2 a-K 2

b-K 2

a-K 1

b-K 1Table A

Table B

A join B

Write to HDFS

HDFSFile a-K1

HDFSFile b-K1

Map join

a-k1 map joinb-k1

Job 1 Job 2

Final results

Page 16: Hive User Meeting March 2010 - Hive Team

Future Work

Skew Join with a Replication Algorithm

Memory Footprint Optimization

Page 17: Hive User Meeting March 2010 - Hive Team

Views, HBase Integration

Page 18: Hive User Meeting March 2010 - Hive Team

CREATE VIEW Syntax

CREATE VIEW [IF NOT EXISTS] view_name

[ (column_name [COMMENT column_comment], … ) ]

[COMMENT view_comment]

AS SELECT …

[ ORDER BY … LIMIT … ]

-- example

CREATE VIEW pokebaz(baz COMMENT ‘this column used to be bar’)

COMMENT ‘views are good for layering on renaming’

AS SELECT bar FROM pokes;

Facebook

Page 19: Hive User Meeting March 2010 - Hive Team

View Features

» Other commands› SHOW TABLES: views show up too

› DESCRIBE: see view column descriptions

› DESCRIBE EXTENDED: retrieve view definition

» Enhancements on the way soon› Dependency management (e.g. CASCADE/RESTRICT)

› Partition awareness

» Enhancements (long term)› Updatable views

› Materialized views

Facebook

Page 20: Hive User Meeting March 2010 - Hive Team

HBase Storage Handler

CREATE TABLE users(

userid int, name string, email string, notes string)

STORED BY

'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES (

"hbase.columns.mapping" =

”small:name,small:email,large:notes”);

Facebook

Page 21: Hive User Meeting March 2010 - Hive Team

HBase Storage Handler Features

» Commands supported› CREATE EXTERNAL TABLE: register existing HTable

› SELECT: join, group by, union, etc; over multiple Hbase tables, or mixing with native Hive tables

› INSERT: from any Hive query

» Enhancements Needed (feedback on priority welcome) › More flexible column mapping, ALTER TABLE

› Timestamp read/write/restrict

› Filter pushdown

› Partition support

› Write atomicity

Facebook

Page 22: Hive User Meeting March 2010 - Hive Team

UDF, UDAF and UDTF

Page 23: Hive User Meeting March 2010 - Hive Team

User-Defined Functions (UDF)

» 1 input to 1 output

» Typically used in select› SELECT concat(first, ‘ ‘, last) AS full_name…

» See Hive language wiki for full list of built-in UDF’s› http://wiki.apache.org/hadoop/Hive/LanguageManual

» Noteworthy features› Sometimes you want to cast

• SELECT CAST(5.0/2.0 AS INT)…

› Conditional functions• SELECT IF(boolean, if_true, if_not_true)…

Facebook

Page 24: Hive User Meeting March 2010 - Hive Team

User Defined Aggregate Functions (UDAF)

» N inputs to 1 output

» Typically used with GROUP BY› SELECT count(1) FROM … GROUP BY age› SELECT count(DISTINCT first_name) GROUP BY last_name…

› sum(), avg(), min(), max()

» For skew› set hive.groupby.skewindata = true;› set hive.map.aggr.hash.percentmemory = <some lower value>

Facebook

Page 25: Hive User Meeting March 2010 - Hive Team

User Defined Table-Generating Functions (UDTF)

» 1 input to N outputs

» explode(Array<?> arg)› Converts an array into multiple rows, with one element per

row

» Transform-like syntax› SELECT udtf(col0, col1, …) AS colAlias FROM srcTable

» Lateral view syntax› …FROM baseTableLATERAL VIEW udtf(col0, col1…)tableAlias AS colAlias

» Also see: http://bit.ly/hive-udtf

Facebook

Page 26: Hive User Meeting March 2010 - Hive Team

UDTF using Transform Syntax

» SELECT explode(group_ids) AS group_id FROM src

Facebook

Table src

Output

Page 27: Hive User Meeting March 2010 - Hive Team

UDTF using Lateral View Syntax

» SELECT src.*, myTable.*FROM src LATERAL VIEW explode(group_ids) myTable AS group_id

Facebook

Table src

Page 28: Hive User Meeting March 2010 - Hive Team

UDTF using Lateral View Syntax

Facebook

src

group_id

1

2

3

explode(group_ids) myTable AS group_id

Result

Join input rows to output rows

Page 29: Hive User Meeting March 2010 - Hive Team

SerDe – Serialization/Deserialization

Page 30: Hive User Meeting March 2010 - Hive Team

SerDe Examples

» CREATE TABLE mylog (

user_id BIGINT,

page_url STRING,

unix_time INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

» CREATE table mylog_rc (

user_id BIGINT,

page_url STRING,

unix_time INT)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'

STORED AS RCFILE;

Facebook

Page 31: Hive User Meeting March 2010 - Hive Team

SerDe

» SerDe is short for serialization/deserialization. It controls the format of a row.

» Serialized format:› Delimited format (tab, comma, ctrl-a …)

› Thrift Protocols

› ProtocolBuffer*

» Deserialized (in-memory) format:› Java Integer/String/ArrayList/HashMap

› Hadoop Writable classes

› User-defined Java Classes (Thrift, ProtocolBuffer*)

» * ProtocolBuffer support not available yet.

Facebook

Page 32: Hive User Meeting March 2010 - Hive Team

Where is SerDe?

Facebook

File on HDFS

File on HDFS

HierarchicalObject

HierarchicalObject

WritableWritable

StreamStream StreamStream

HierarchicalObject

HierarchicalObject

Map Output File

Map Output File

WritableWritable WritableWritable WritableWritable

HierarchicalObject

HierarchicalObject

File on HDFS

File on HDFS

User ScriptUser Script

HierarchicalObject

HierarchicalObject

HierarchicalObject

HierarchicalObject

Hive OperatorHive Operator Hive OperatorHive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

WritableWritableWritableWritable

WritableWritableWritableWritableText(‘imp 1.0 3 54’) // UTF8 encoded

Page 33: Hive User Meeting March 2010 - Hive Team

Object Inspector

Facebook

getTypeObjectInspector1

getFieldOI

getStructField

getTypeObjectInspector2

getMapValueOI

getMapValue

deserialize SerDeserialize getOI

HierarchicalObject

HierarchicalObject

WritableWritableWritableWritable

StructStruct

intint stringstringlistlist

structstruct

mapmap

stringstring stringstring

HierarchicalObject

HierarchicalObject

String ObjectString Object

getTypeObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

intint intint

HashMap(“a” “av”, “b” “bv”),

HashMap<String, String> a,

“av”

Page 34: Hive User Meeting March 2010 - Hive Team

When to add a new SerDe

» User has data with special serialized format not supported by Hive yet, and user does not want to convert the data before loading into Hive.

» User has a more efficient way of serializing the data on disk.

Facebook

Page 35: Hive User Meeting March 2010 - Hive Team

How to add a new SerDe for text data

» Follow the example incontrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

» RegexSerDe uses a user-provided regular expression to deserialize data.

» CREATE TABLE apache_log(host STRING,

identity STRING, user STRING, time STRING, request STRING,

status STRING, size STRING, referer STRING, agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”)

STORED AS TEXTFILE;

Facebook

Page 36: Hive User Meeting March 2010 - Hive Team

How to add a new SerDe for binary data

» Follow the example incontrib/src/java/org/apache/hadoop/hive/contrib/serde2/thrift (HIVE-706)serde/src/java/org/apache/hadoop/hive/serde2/binarysortable

» CREATE TABLE mythrift_table

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe'

WITH SERDEPROPERTIES (

"serialization.class" = "com.facebook.serde.tprofiles.full",

"serialization.format" = "com.facebook.thrift.protocol.TBinaryProtocol“);

» NOTE: Column information is provided by the SerDe class.

Facebook

Page 37: Hive User Meeting March 2010 - Hive Team

Q & A

Facebook


Recommended