Hive User Meeting March 2010 - Hive Team

  • View
    9.362

  • Download
    1

Embed Size (px)

DESCRIPTION

New features and APIs in Hive

Text of Hive User Meeting March 2010 - Hive Team

  • 1.Hive New Features and API Facebook Hive Team March 2010

2. JDBC/ODBC and CTAS 3. Hive ODBC Driver

  • Architecture:
    • Client/DriverManagerlocal call dynamic libraies
    • unixODBC (libodbchive.so) + hiveclient(libhiveclient.so) + thriftclient (libthrift.so)network socket
    • HiveServer (in Java)local call
    • Hive + Hadoop
  • unixODBC is not part of Hive open source, so you need to build it by yourself.
    • 32-bit/64-bit architecture
    • Thrift has to be r790732
    • Boost libraries
    • Linking with 3 rdparty Driver Manager.

Facebook 4. Facebook Use Case

  • Hive integration with MicroStrategy 8.1.2 (HIVE-187) and 9.0.1. (HIVE-1101)
      • FreeForm SQL (reported generated from user input queries)
      • Reports generated daily.
  • All servers (MSTR IS server, HiveServer) are running on Linux.
      • ODBC driver needs to be 32 bits.

Facebook 5. Hive JDBC

  • Embedded mode:
    • jdbc:hive://
  • Client/server mode:
    • jdbc:hive://host:port/dbname
    • host:port is where the hive server is listening.
    • Architecture is similar to ODBC.

Facebook 6. Create table as select (CTAS)

    • New feature in branch 0.5.
    • E.g.,
    • CREATE TABLE T STORED AS TEXTFILE AS
    • SELECT a+1 a1, concat(b,c,d) b2
    • FROM S
    • WHERE
    • Resulting schema:
    • T (a1 double, b2 string)
    • The create-clause can take all table properties except external table or partitioned table (on roadmap).
    • Atomicity: T will not be created if the select statement has an error.

Facebook 7. Join Strategies 8. Left semi join

  • Implementing IN/EXISTS subquery semantics:
    • SELECT A.*
    • FROM A WHERE A.KEY IN
    • (SELECT B.KEY FROM B WHERE B.VALUE > 100);
    • Is equivalent to:
    • SELECT A.*
    • FROM A LEFT SEMI JOIN B
    • ON (A.KEY = B.KEY and B.VALUE > 100);
  • Optimizations:
    • map-side groupby to reduce data flowing to reducers
    • early exit if match in join.

Facebook 9. Map Join Implementation SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.*a join b on a.key = b.keyjoin c on a.key=c.key;Tableb Tablea Tablec Mapper 1 File a1 File a2 File c1 Mapper 2 Mapper 3 a1 a2 c1 a1 a2 c1 a1 a2 c1

  • Spawn mapper based on the big table
  • All files of all small tables are
  • replicated onto each mapper

10. Bucket Map Join

    • set hive.optimize.bucketmapjoin = true;
    • Work together with map join
    • All join tables are bucketized, and each small tables bucket number can be divided by big tables bucket number.
    • Bucket columns == Join columns

11. Bucket Map Join Implementation SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.*a join b on a.key = b.keyjoin c on a.key=c.key;Tableb Tablea Tablec Mapper 1 Bucket b1 Bucket a1 Bucket a2 Bucketc1 Mapper 2 Bucket b1 Mapper 3 Bucket b2 a1 c1 a1 c1 a2 c1 Normally in production, there will be thousands of buckets! Table a,b,c all bucketized by key a has 2 buckets, b has 2, and c has 1

  • Spawn mapper based on the big table
  • Only matching buckets of all small tables arereplicated onto each mapper

12. Sort Merge Bucket Map Join

    • set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    • Work together with bucket map join
    • Bucket columns == Join columns == sort columns
    • If partitioned, only big table can allow multiple partitions, small tables must be restricted to a single partition by query.

13. Sort Merge Bucket Map Join Facebook TableA TableB TableC 1, val_1 3, val_3 5, val_5 4, val_4 4, val_4 20, val_20 23, val_23 20, val_20 25, val_25

    • Small tables are read on demandNOT held entire small tables in memoryCan perform outer join

14. Skew Join

  • Join bottlenecked on the reducer who gets the skewed key
  • set hive.optimize.skewjoin = true;set hive.skewjoin.key =skew_key_threshold

15. Skew Join Reducer 1 Reducer 2 a-K 3 b-K 3 a-K 3 b-K 3 a-K 2 b-K 2 a-K 2 b-K 2 a-K 1 b-K 1 TableA TableB Ajoin B Write to HDFS HDFS Filea-K1 HDFS Fileb-K1 Map join a-k1map join b-k1 Job 1 Job 2 Final results 16. Future Work

  • Skew Join with a Replication Algorithm
  • Memory Footprint Optimization

17. Views, HBase Integration 18. CREATE VIEW Syntax

    • CREATE VIEW [IF NOT EXISTS] view_name
    • [ (column_name [COMMENT column_comment], ) ]
    • [COMMENT view_comment]
    • AS SELECT
    • [ ORDER BY LIMIT ]
    • -- example
    • CREATE VIEW pokebaz(baz COMMENT this column used to be bar)
    • COMMENT views are good for layering on renaming
    • AS SELECT bar FROM pokes;

Facebook 19. View Features

  • Other commands
    • SHOW TABLES:views show up too
    • DESCRIBE:see view column descriptions
    • DESCRIBE EXTENDED:retrieve view definition
  • Enhancements on the way soon
    • Dependency management (e.g. CASCADE/RESTRICT)
    • Partition awareness
  • Enhancements (long term)
    • Updatable views
    • Materialized views

Facebook 20. HBase Storage Handler

  • CREATE TABLE users(
  • userid int, name string, email string, notes string)
  • STORED BY
  • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  • WITH SERDEPROPERTIES (
  • "hbase.columns.mapping" =
  • small:name,small:email,large:notes);

Facebook 21. HBase Storage Handler Features

  • Commands supported
    • CREATE EXTERNAL TABLE:register existing HTable
    • SELECT:join, group by, union, etc; over multiple Hbase tables, or mixing with native Hive tables
    • INSERT:from any Hive query
  • Enhancements Needed (feedback on priority welcome)
    • More flexible column mapping, ALTER TABLE
    • Timestamp read/write/restrict
    • Filter pushdown
    • Partition support
    • Write atomicity

Facebook 22. UDF, UDAF and UDTF 23. User-Defined Functions (UDF)

  • 1 input to 1 output
  • Typically used in select
    • SELECT concat(first, , last) AS full_name
  • See Hive language wiki for full list of built-in UDFs
    • http://wiki.apache.org/hadoop/Hive/LanguageManual
  • Noteworthy features
    • Sometimes you want to cast
      • SELECT CAST(5.0/2.0 AS INT)
    • Conditional functions
      • SELECT IF(boolean, if_true, if_not_true)

Facebook 24. User Defined Aggregate Functions (UDAF)

  • N inputs to 1 output
  • Typically used with GROUP BY
    • SELECT count(1) FROM GROUP BY age
    • SELECT count(DISTINCT first_name) GROUP BY last_name
    • sum(), avg(), min(), max()
  • For skew
    • set hive.groupby.skewindata = true;
    • set hive.map.aggr.hash.percentmemory =

Facebook 25. User Defined Table-Generating Functions (UDTF)

  • 1 input to N outputs
  • explode(Array arg)
    • Converts an array into multiple rows, with one element per row