25
Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions” „SQL is highly popular” „need for an open data format” „as a familiar data warehousing tool” Java, extensible, interoperable data warehousing tool no OLTP, no low-latency by default bigdata bi 2012.04.27. Sidló Csaba

Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Embed Size (px)

Citation preview

Page 1: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Big Data BI – Apache Hive

● „open-source data warehouse solution built on top of Hadoop”● „files are insufficient data abstractions”● „SQL is highly popular”● „need for an open data format”

● „as a familiar data warehousing tool”● Java, extensible, interoperable● data warehousing tool → no OLTP, no low-latency by

default

bigdata bi 2012.04.27.Sidló Csaba

Page 2: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

alapvetések

● data model: tables ← partitions ← buckets● relációs● primitive data types, collections: array, map, user

defined types ● HiveQL: SQL-like query language

● + DDL, DML● user defined functions: transformation, aggregation● custom Map-Reduce scripts (any language, streaming

interface)● interfaces: command line, JDBC, ODBC, web

interface

Page 3: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Hive történet

● Dec 2004: Google GFS paper● 2008: started at Facebook; refaktor után: Hadoop

subproject● Sep 2008: Hadoop subproject● May 2009: release 0.3.0● Aug 2009: Facebook VLDB demo● Sep 2010: Hive, Pig: top level Apache projects ● 2011: release 0.8.1, pörgés, pl. NYC Hive Meetup

forrás: https://cwiki.apache.org/confluence/display/Hive/Presentations

Page 4: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

felhasználók● Facebook, 2010:

● summarization, ad-hoc analysis, data mining (assembly training data), spam detection, ad optimization, ...

● tens of thousands of tables, > 700 TB adat, 3-way replication, 5TB compressed data / day (compression: 1:7), 80K compute hours/day

● 200 felhasználó (/ hó?, analysts!), 7500 job / nap● adatfolyam végén: Oracle RAC, elején: Scribe log server● Hadoop production cluster:

– 4800 cores, 600 machines, 16GB per machine – April 2009– 8000 cores, 1000 machines, 32 GB per machine – July 2009– 4 SATA disks of 1 TB each per machine– 2 level network hierarchy, 40 machines per rack– total cluster size is 2 PB, projected to be 12 PB in Q3 2009

forrás: http://borthakur.com/ftp/hadoopworld.pdf http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

Page 5: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

felhasználók 2.

● CNET, Digg, : data mining, log analysis● Grooveshark: analytics, data cleaning, ML● last.fm: ad-hoc queries● Scribd: ML, data mining, ad-hoc queries● NetFlix: log analysis

● 2010: 0.6 TB log/day, 50+ nodes, cloud● általában:

● standard riportozó felületek építése helyett: ad-hoc analitikai igények kiszolgálása

forrás: http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

Page 6: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Big Data platformok és Hive

● IBM InfoSphere:– InfoSphere BigInsights, Hive, Oozie, Pig, Zookeeper,

Avro, Flume, HBase, Lucene● EMC Greenplum:

– Greenplum HD (enhanced HDFS), Hive, Pig, Zookeeper, HBase

● Microsoft:– Big Data Solution, Hive, Pig

● Oracle: – Cloudera's Distribution including Apache Hadoop, Hive,

Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr

forrás: http://radar.oreilly.com/2012/01/big-data-ecosystem.html

Page 7: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: Ullman könyv

Page 8: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: Ullman könyv

Map-Reduce

Page 9: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: VLDB 2009

Page 10: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

forrás: http://borthakur.com/ftp/hadoopworld.pdf

Page 11: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Storage

● HDFS / HBase / Amazon Elastic MapReduce● HDFS:

table → HDFS directory,partitions → sub-directories,buckets → hash érték szerint file-okba szétdobált adatok

● Serialization / Deserialization (SerDe)● raw format:CSV, Thrift, Regex, Hive Binary● default: LazySerDe – rekordok soronként, mezők ctrl-A-val

elválasztva

● file formats:● TextFile● SequenceFile● RCFile: block-based columnar

Page 12: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Metastore

● JDBC: Derby, vagy MySQL, PostgreSQL, Oracle …:● table schema, SerDe library● table locations● partitioning keys, types, partition level metadata● … (statistics, schema evolution?)● Thrift API:

– PHP (web), Python, Java interfaces

Page 13: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Hive on HBase

● Facebook: „low-latency warehouse”● first column: row key, többi: HBase column(-

family)● no control over type mapping, no timestamp

● vs. VoltDB (HStore utód, „NewSQL”!, „high velocity applications”): ACID on Dynamo

● compatiblitiy: ? vannak kétségek● Cloudera cdh3 stack:

– hadoop-0.20.2+923.97

– hive-0.7.1+42.4

– hbase-0.90.3+15.3

– zookeeper-3.3.3+12.12

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Page 14: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQ

L →

Map

-Red

uce

jobs

DAG

exec

utio

n en

gine

Page 15: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Hive APIs

● standard eszközök ráköthetők: Jasper Reports, Microstrategy ...

● JDBC: ● jdbc:hive://host:port/dbname

● Python● PHP

kép forrás: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive_Jdbc.pdf

Page 16: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

tapasztalatok

● ~2010 ősz, Balázs: Hive többnyire működik, de lassú (0.6 körül?)

● most: ● egyszerű install, akár standalone● egyszerű CLI; web GUI● teljesítmény (join főleg): ?● kompatibilitás: 0.8.01 Hive → 0.20.x Hadoop-ra

tesztelve

Page 17: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQL DDL

● browsing: show tables; show partitions ; describe (extended) page_view ;

● definition: CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;

● alter table ...● views● external tables: már létező HDFS file-ok

Page 18: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQL DML

● nincs row-level update, delete● törlés: drop table, partition; insert overwrite

● multi-table insert, insert from queries, insert into files, load files to tablesLOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

FROM usersINSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;

Page 19: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQL select: JOIN

● ANSI equi-jon● data skew → different plans;

● normal: 1 reducer, gets all records● map-side join:

– mapper loads a small table + a portion of big table– does the join

● optimization: hash-join, pruning, exploit pre-sorted data: map-side merge join

Page 20: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQL select: group byselect pageid, age, count(1), count(distinct userid)from pv_usersgroup by pageid, age

● 0.7 óta: van HAVING clause● optimization:

● hash-based aggregates● serialized key/values in hash tables● exploit pre-sorted data● table / column statistics

Page 21: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

extensibility: custom Map-Reduce scripts

forrás: http://www.royans.net/arch/hive-facebook/

Page 22: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

extensibility: UDF / UDAF(+lehetőség: types, data formats)

forrás: http://www.royans.net/arch/hive-facebook/

Page 23: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

HQL „Data Mining funcitions”

● ~ advanced statistics● n-grams:

SELECT explode(ngrams(sentences(lower(val)), 2, 10)) AS x FROM kafka;

{"ngram":[of","the],"estfrequency":23.0} {"ngram":[on","the],"estfrequency":20.0} {"ngram":[in","the],"estfrequency":18.0} …

● histogram_numeric:SELECT explode(histogram_numeric(val, 10)) AS x FROM normal;

{"x":-3.6505464999999995,"y":20.0} {"x":-2.7514727901960785,"y":510.0} {"x":-1.7956678951954481,"y":8263.0} …

Page 24: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Hive Future Work● indexing: Facebook már fejleszti (bitmap csak?)● cost-based optimization, smarter plans● data compression: columnar storage schemes● ORDER BY, IN, EXISTS, subqueries in WHERE● advanced operators:

● cubes● frequent item sets● window functions

● better data locality

indexing: http://www.facebook.com/notes/facebook-engineering/working-with-students-to-improve-indexing-in-apache-hive/10150168427733920

Page 25: Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Hive performance enhancements ~2009

forrás: http://borthakur.com/ftp/hadoopworld.pdf