Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”

Big Data BI – Apache Hive

● „open-source data warehouse solution built on top of Hadoop”● „files are insufficient data abstractions”● „SQL is highly popular”● „need for an open data format”

● „as a familiar data warehousing tool”● Java, extensible, interoperable● data warehousing tool → no OLTP, no low-latency by

default

bigdata bi 2012.04.27.Sidló Csaba

alapvetések

● data model: tables ← partitions ← buckets● relációs● primitive data types, collections: array, map, user

defined types ● HiveQL: SQL-like query language

● + DDL, DML● user defined functions: transformation, aggregation● custom Map-Reduce scripts (any language, streaming

interface)● interfaces: command line, JDBC, ODBC, web

interface

Hive történet

● Dec 2004: Google GFS paper● 2008: started at Facebook; refaktor után: Hadoop

subproject● Sep 2008: Hadoop subproject● May 2009: release 0.3.0● Aug 2009: Facebook VLDB demo● Sep 2010: Hive, Pig: top level Apache projects ● 2011: release 0.8.1, pörgés, pl. NYC Hive Meetup

forrás: https://cwiki.apache.org/confluence/display/Hive/Presentations

https://cwiki.apache.org/confluence/display/Hive/Presentations

felhasználók● Facebook, 2010:

● summarization, ad-hoc analysis, data mining (assembly training data), spam detection, ad optimization, ...

● tens of thousands of tables, > 700 TB adat, 3-way replication, 5TB compressed data / day (compression: 1:7), 80K compute hours/day

● 200 felhasználó (/ hó?, analysts!), 7500 job / nap● adatfolyam végén: Oracle RAC, elején: Scribe log server● Hadoop production cluster:

– 4800 cores, 600 machines, 16GB per machine – April 2009– 8000 cores, 1000 machines, 32 GB per machine – July 2009– 4 SATA disks of 1 TB each per machine– 2 level network hierarchy, 40 machines per rack– total cluster size is 2 PB, projected to be 12 PB in Q3 2009

forrás: http://borthakur.com/ftp/hadoopworld.pdf http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

http://borthakur.com/ftp/hadoopworld.pdf

http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

felhasználók 2.

● CNET, Digg, : data mining, log analysis● Grooveshark: analytics, data cleaning, ML● last.fm: ad-hoc queries● Scribd: ML, data mining, ad-hoc queries● NetFlix: log analysis

● 2010: 0.6 TB log/day, 50+ nodes, cloud● általában:

● standard riportozó felületek építése helyett: ad-hoc analitikai igények kiszolgálása

forrás: http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt

Big Data platformok és Hive

● IBM InfoSphere:– InfoSphere BigInsights, Hive, Oozie, Pig, Zookeeper,

Avro, Flume, HBase, Lucene● EMC Greenplum:

– Greenplum HD (enhanced HDFS), Hive, Pig, Zookeeper, HBase

● Microsoft:– Big Data Solution, Hive, Pig

● Oracle: – Cloudera's Distribution including Apache Hadoop, Hive,

Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr

forrás: http://radar.oreilly.com/2012/01/big-data-ecosystem.html

http://radar.oreilly.com/2012/01/big-data-ecosystem.html

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: Ullman könyv

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: Ullman könyv

Map-Reduce

arch

itec

ture

: H

ive

vs.

RD

BMS

forrás: VLDB 2009

forrás: http://borthakur.com/ftp/hadoopworld.pdf

Storage

● HDFS / HBase / Amazon Elastic MapReduce● HDFS:

table → HDFS directory,partitions → sub-directories,buckets → hash érték szerint file-okba szétdobált adatok

● Serialization / Deserialization (SerDe)● raw format:CSV, Thrift, Regex, Hive Binary● default: LazySerDe – rekordok soronként, mezők ctrl-A-val

elválasztva

● file formats:● TextFile● SequenceFile● RCFile: block-based columnar

Metastore

● JDBC: Derby, vagy MySQL, PostgreSQL, Oracle …:● table schema, SerDe library● table locations● partitioning keys, types, partition level metadata● … (statistics, schema evolution?)● Thrift API:

– PHP (web), Python, Java interfaces

Hive on HBase

● Facebook: „low-latency warehouse”● first column: row key, többi: HBase column(-

family)● no control over type mapping, no timestamp

● vs. VoltDB (HStore utód, „NewSQL”!, „high velocity applications”): ACID on Dynamo

● compatiblitiy: ? vannak kétségek● Cloudera cdh3 stack:

– hadoop-0.20.2+923.97

– hive-0.7.1+42.4

– hbase-0.90.3+15.3

– zookeeper-3.3.3+12.12

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

HQ

L →

Map

-Red

uce

jobs

DAG

→

exec

utio

n en

gine

Hive APIs

● standard eszközök ráköthetők: Jasper Reports, Microstrategy ...

● JDBC: ● jdbc:hive://host:port/dbname

● Python● PHP

kép forrás: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive_Jdbc.pdf

https://cwiki.apache.org/confluence/download/attachments/27362054/Hive_Jdbc.pdf

tapasztalatok

● ~2010 ősz, Balázs: Hive többnyire működik, de lassú (0.6 körül?)

● most: ● egyszerű install, akár standalone● egyszerű CLI; web GUI● teljesítmény (join főleg): ?● kompatibilitás: 0.8.01 Hive → 0.20.x Hadoop-ra

tesztelve

HQL DDL

● browsing: show tables; show partitions ; describe (extended) page_view ;

● definition: CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;

● alter table ...● views● external tables: már létező HDFS file-ok

HQL DML

● nincs row-level update, delete● törlés: drop table, partition; insert overwrite

● multi-table insert, insert from queries, insert into files, load files to tablesLOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

FROM usersINSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;

HQL select: JOIN

● ANSI equi-jon● data skew → different plans;

● normal: 1 reducer, gets all records● map-side join:

– mapper loads a small table + a portion of big table– does the join

● optimization: hash-join, pruning, exploit pre-sorted data: map-side merge join

HQL select: group byselect pageid, age, count(1), count(distinct userid)from pv_usersgroup by pageid, age

● 0.7 óta: van HAVING clause● optimization:

● hash-based aggregates● serialized key/values in hash tables● exploit pre-sorted data● table / column statistics

extensibility: custom Map-Reduce scripts

forrás: http://www.royans.net/arch/hive-facebook/

http://www.royans.net/arch/hive-facebook/

extensibility: UDF / UDAF(+lehetőség: types, data formats)

forrás: http://www.royans.net/arch/hive-facebook/

http://www.royans.net/arch/hive-facebook/

HQL „Data Mining funcitions”

● ~ advanced statistics● n-grams:

SELECT explode(ngrams(sentences(lower(val)), 2, 10)) AS x FROM kafka;

{"ngram":[of","the],"estfrequency":23.0} {"ngram":[on","the],"estfrequency":20.0} {"ngram":[in","the],"estfrequency":18.0} …

● histogram_numeric:SELECT explode(histogram_numeric(val, 10)) AS x FROM normal;

{"x":-3.6505464999999995,"y":20.0} {"x":-2.7514727901960785,"y":510.0} {"x":-1.7956678951954481,"y":8263.0} …

Hive Future Work● indexing: Facebook már fejleszti (bitmap csak?)● cost-based optimization, smarter plans● data compression: columnar storage schemes● ORDER BY, IN, EXISTS, subqueries in WHERE● advanced operators:

● cubes● frequent item sets● window functions

● better data locality

indexing: http://www.facebook.com/notes/facebook-engineering/working-with-students-to-improve-indexing-in-apache-hive/10150168427733920

http://www.facebook.com/notes/facebook-engineering/

Hive performance enhancements ~2009

forrás: http://borthakur.com/ftp/hadoopworld.pdf

http://borthakur.com/ftp/hadoopworld.pdf

Documents

Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”