Hive Percona 2009

Data Warehousing & Analytics on Hadoop

Ashish Thusoo, Prasad Chakka

Facebook Data Team

Why Another Data Warehousing System?

Data, data and more data200GB per day in March 2008

2+TB(compressed) raw data per day today

Lets try Hadoop…

Pros– Superior in availability/scalability/manageability– Efficiency not that great, but throw more hardware– Partial Availability/resilience/scale more important than ACID

Cons: Programmability and Metadata– Map-reduce hard to program (users know sql/bash/python)– Need to publish data in well known schemas

Solution: HIVE

What is HIVE?

A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

Simplifying Hadoop

hive> select key, count(1) from kv1 where key > 100 group by key;

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Looks like this ..

1 Gigabit 4-8 Gigabit

DataNode +

Map-Reduce

Data Warehousing at Facebook Today

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL

Hive/Hadoop Usage @ Facebook

Types of Applications:– Reporting

Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement

– Ad hoc Analysis Eg: how many group admins broken down by state/country

– Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes

– Spam Detection Anomalous patterns for Site Integrity Application API usage patterns

– Ad Optimization– Too many to count ..

Hadoop Usage @ Facebook

Data statistics:– Total Data: ~1.7PB – Cluster Capacity ~2.4PB– Net Data added/day: ~15TB

6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily

– Compression Factor ~5x (gzip, more with bzip)

Usage statistics:– 3200 jobs/day with 800K tasks(map-reduce tasks)/day– 55TB of compressed data scanned daily– 15TB of compressed output data written to hdfs– 80 MM compute minutes/day

In Pictures

HIVE Internals!!

HIVE: Components

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

ExecutionParser

Planner

Optimizer

Data Model

Logical Partitioning

Hash Partitioning

clicks

HDFS MetaStore

/hive/clicks/hive/clicks/ds=2008-03-25

/hive/clicks/ds=2008-03-25/0

Tables

Data LocationBucketing Info

Partitioning Cols

Metastore DB

Hive Query Language

SQL– Subqueries in from clause– Equi-joins– Multi-table Insert– Multi-group-by

Sampling Complex object types Extensibility

– Pluggable Map-reduce scripts– Pluggable User Defined Functions– Pluggable User Defined Types– Pluggable Data Formats

Machine 2

Machine 1<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

Map Reduce Example

<nk1, nv1><nk2, nv2><nk3, nv3>

LocalMap

GlobalShuffle

LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

Hive QL – Join

INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid = u.userid);

Hive QL – Join in Map Reduce

key value

111 <1,1>

111 <1,2>

222 <1,1>

pageid userid time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid age gender

111 25 female

222 32 male

page_view

key value

111 <2,25>

222 <2,32>

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

Pageid age

pageid age

Reduce

Hive QL – Group By

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

Hive QL – Group By in Map Reduce

pageid age

pv_users

pageid age count

1 25 1

1 32 1

pageid age

key value

<1,25> 1

<2,25> 1

key value

<1,32> 1

<2,25> 1

key value

<1,25> 1

<1,32> 1

key value

<2,25> 1

ShuffleSort

pageid age count

2 25 2

Reduce

Group by Optimizations

Map side partial aggregations– Hash Based aggregates– Serialized key/values in hash tables

Optimizations being Worked On:– Exploit pre-sorted data for distinct counts– Partial aggregations and Combiners– Be smarter about how to avoid multiple stage– Exploit table/column statistics for deciding strategy

Inserts into Files, Tables and Local Files

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)

GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Extensibility - Custom Map/Reduce Scripts

FROM (

FROM pv_users MAP(pv_users.userid, pv_users.date) USING

'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced

REDUCE(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Open Source Community

21 contributors and growing – 6 contributors within Facebook

Contributors from:– Academia– Other web companies– Etc..

7 committers– 1 external to Facebook and looking to add more here

Future Work

Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators

– Cubes/Frequent Item Sets/Window Functions

Hive Roadmap– http://wiki.apache.org/hadoop/Hive/Roadmap

Information

Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive (wiki)- http://hadoop.apache.org/hive (home page)- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)- ##hive (IRC)- Works with hadoop-0.17, 0.18, 0.19

Release 0.3 is coming in the next few weeks

Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org

Hive Percona 2009

Technology

Hive User Group Meeting - uni-leipzig.decloud.pubs.dbs.uni-leipzig.de/sites/cloud.pubs.dbs.uni...Hive User Group Meeting August 2009 Hive Overview Why Another Data Warehousing System?

Percona XtraDB Cluster · Percona XtraDB Cluster powered by Galera Peter Zaitsev CEO, Percona Slide Credits: Vadim Tkachenko Percona MySQL University , Montevideo,UY Feb 5,2013

Percona FT / TokuDB

Integrate Percona MySQL...EventTracker: Integrate Percona MySQL 3 Introduction Percona MySQL is an open source software company specializing in . MySQL support, consulting, managed

MySQL Router – Simple HA - Percona · MySQL Router – Simple HA Marcos Albe, Percona Inc. Percona Live Data Performance Conference Santa Clara – April 2016

HiVE 1 - November 2009

Vitess percona 2012

Percona XtraBackup: install, usage, tricks · Percona XtraBackup: install, usage, tricks ... productivity Percona XtraBackup: ... MySQL 5.1 + InnoDB-plugin

Percona – The Database Performance Experts · Peter Zaitsev, CEO, Percona February 24, 2016 Percona Technical Webinars Using Grafana for MySQL Monitoring

Introducing The New PMM 2! - Percona · Introducing The New PMM 2! Michael Coburn Percona. 2 Michael Coburn Product Manager for PMM (and Percona Toolkit) At Percona for six years

Forking and Branching - Percona...© 2018 Percona. 1 Peter Zaitsev, CEO, Percona Forking and Branching Lessons from MySQL Community August 28, 2018 Percona Technical Webinars

Bug Analyst at Percona Lalit Choudhary · 1 © 2018 Percona MySQL 8.0 Architecture and Enhancements Lalit Choudhary Bug Analyst at Percona

HIVE HONEYSCRIBE HIVE - Princesshay

Hive User Meeting August 2009 Facebook

MySQL In the Cloud - Percona MySQL In the Cloud ? 4 © 2017 Percona ... AWS EC2, S3, EBS IaaS ... Source: 39 © 2017 Percona High

Live Hive Release Understanding The Social Engaged Viewer (Aug 2009)

Percona Toolkit 2.2.12

© Hive Studios 2009 Ivan Pavlović, Hive Studios CSM, Visual C# MVP, MCSD, MCDBA, MCT paki@hive-studios.com

PXC Product Lead @ Percona Percona XtraDB Cluster 101 ...1 © 2017 Percona Krunal Bauskar PXC Product Lead @ Percona Percona XtraDB Cluster 101

Hadoop Summit 2009 Hive