View
4.174
Download
1
Category
Tags:
Preview:
DESCRIPTION
talk given at Percona 2009 conference held along side mysql conference in Santa Clara, USA.
Citation preview
Data Warehousing & Analytics on Hadoop
Ashish Thusoo, Prasad Chakka
Facebook Data Team
Why Another Data Warehousing System?
Data, data and more data200GB per day in March 2008
2+TB(compressed) raw data per day today
Lets try Hadoop…
Pros– Superior in availability/scalability/manageability– Efficiency not that great, but throw more hardware– Partial Availability/resilience/scale more important than ACID
Cons: Programmability and Metadata– Map-reduce hard to program (users know sql/bash/python)– Need to publish data in well known schemas
Solution: HIVE
What is HIVE?
A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files
Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance
Simplifying Hadoop
hive> select key, count(1) from kv1 where key > 100 group by key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
Looks like this ..
Disks
Node
Disks
Node
Disks
Node
Disks
Node
Disks
Node
Disks
Node
1 Gigabit 4-8 Gigabit
Node=
DataNode +
Map-Reduce
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL
Hive/Hadoop Usage @ Facebook
Types of Applications:– Reporting
Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement
– Ad hoc Analysis Eg: how many group admins broken down by state/country
– Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes
– Spam Detection Anomalous patterns for Site Integrity Application API usage patterns
– Ad Optimization– Too many to count ..
Hadoop Usage @ Facebook
Data statistics:– Total Data: ~1.7PB – Cluster Capacity ~2.4PB– Net Data added/day: ~15TB
6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily
– Compression Factor ~5x (gzip, more with bzip)
Usage statistics:– 3200 jobs/day with 800K tasks(map-reduce tasks)/day– 55TB of compressed data scanned daily– 15TB of compressed output data written to hdfs– 80 MM compute minutes/day
In Pictures
HIVE Internals!!
HIVE: Components
HDFS
Hive CLIDDL QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
ExecutionParser
Planner
DB
Web U
I
Optimizer
Data Model
Logical Partitioning
Hash Partitioning
clicks
HDFS MetaStore
/hive/clicks/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
…
Tables
Data LocationBucketing Info
Partitioning Cols
Metastore DB
Hive Query Language
SQL– Subqueries in from clause– Equi-joins– Multi-table Insert– Multi-group-by
Sampling Complex object types Extensibility
– Pluggable Map-reduce scripts– Pluggable User Defined Functions– Pluggable User Defined Types– Pluggable Data Formats
Machine 2
Machine 1<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
Map Reduce Example
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
Hive QL – Join
INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid = u.userid);
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
Pageid age
1 25
2 25
pageid age
1 32
Reduce
Hive QL – Group By
SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pageid age
1 25
2 25
pv_users
pageid age count
1 25 1
1 32 1
pageid age
1 32
2 25
Map
key value
<1,25> 1
<2,25> 1
key value
<1,32> 1
<2,25> 1
key value
<1,25> 1
<1,32> 1
key value
<2,25> 1
<2,25> 1
ShuffleSort
pageid age count
2 25 2
Reduce
Group by Optimizations
Map side partial aggregations– Hash Based aggregates– Serialized key/values in hash tables
Optimizations being Worked On:– Exploit pre-sorted data for distinct counts– Partial aggregations and Combiners– Be smarter about how to avoid multiple stage– Exploit table/column statistics for deciding strategy
Inserts into Files, Tables and Local Files
FROM pv_users INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013
SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
Extensibility - Custom Map/Reduce Scripts
FROM (
FROM pv_users MAP(pv_users.userid, pv_users.date) USING
'map_script' AS(dt, uid) CLUSTER BY(dt)) map
INSERT INTO TABLE pv_users_reduced
REDUCE(map.dt, map.uid) USING 'reduce_script' AS (date, count);
Open Source Community
21 contributors and growing – 6 contributors within Facebook
Contributors from:– Academia– Other web companies– Etc..
7 committers– 1 external to Facebook and looking to add more here
Future Work
Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators
– Cubes/Frequent Item Sets/Window Functions
Hive Roadmap– http://wiki.apache.org/hadoop/Hive/Roadmap
Information
Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive (wiki)- http://hadoop.apache.org/hive (home page)- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)- ##hive (IRC)- Works with hadoop-0.17, 0.18, 0.19
Release 0.3 is coming in the next few weeks
Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org
Recommended