Upload
phamnguyet
View
218
Download
0
Embed Size (px)
Citation preview
Big Data BI – Apache Hive
● „open-source data warehouse solution built on top of Hadoop”● „files are insufficient data abstractions”● „SQL is highly popular”● „need for an open data format”
● „as a familiar data warehousing tool”● Java, extensible, interoperable● data warehousing tool → no OLTP, no low-latency by
default
bigdata bi 2012.04.27.Sidló Csaba
alapvetések
● data model: tables ← partitions ← buckets● relációs● primitive data types, collections: array, map, user
defined types ● HiveQL: SQL-like query language
● + DDL, DML● user defined functions: transformation, aggregation● custom Map-Reduce scripts (any language, streaming
interface)● interfaces: command line, JDBC, ODBC, web
interface
Hive történet
● Dec 2004: Google GFS paper● 2008: started at Facebook; refaktor után: Hadoop
subproject● Sep 2008: Hadoop subproject● May 2009: release 0.3.0● Aug 2009: Facebook VLDB demo● Sep 2010: Hive, Pig: top level Apache projects ● 2011: release 0.8.1, pörgés, pl. NYC Hive Meetup
forrás: https://cwiki.apache.org/confluence/display/Hive/Presentations
felhasználók● Facebook, 2010:
● summarization, ad-hoc analysis, data mining (assembly training data), spam detection, ad optimization, ...
● tens of thousands of tables, > 700 TB adat, 3-way replication, 5TB compressed data / day (compression: 1:7), 80K compute hours/day
● 200 felhasználó (/ hó?, analysts!), 7500 job / nap● adatfolyam végén: Oracle RAC, elején: Scribe log server● Hadoop production cluster:
– 4800 cores, 600 machines, 16GB per machine – April 2009– 8000 cores, 1000 machines, 32 GB per machine – July 2009– 4 SATA disks of 1 TB each per machine– 2 level network hierarchy, 40 machines per rack– total cluster size is 2 PB, projected to be 12 PB in Q3 2009
forrás: http://borthakur.com/ftp/hadoopworld.pdf http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt
felhasználók 2.
● CNET, Digg, : data mining, log analysis● Grooveshark: analytics, data cleaning, ML● last.fm: ad-hoc queries● Scribd: ML, data mining, ad-hoc queries● NetFlix: log analysis
● 2010: 0.6 TB log/day, 50+ nodes, cloud● általában:
● standard riportozó felületek építése helyett: ad-hoc analitikai igények kiszolgálása
forrás: http://research.cs.wisc.edu/condor/CondorWeek2009/condor_presentations/borthakur-hadoop_univ_research.ppt
Big Data platformok és Hive
● IBM InfoSphere:– InfoSphere BigInsights, Hive, Oozie, Pig, Zookeeper,
Avro, Flume, HBase, Lucene● EMC Greenplum:
– Greenplum HD (enhanced HDFS), Hive, Pig, Zookeeper, HBase
● Microsoft:– Big Data Solution, Hive, Pig
● Oracle: – Cloudera's Distribution including Apache Hadoop, Hive,
Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr
forrás: http://radar.oreilly.com/2012/01/big-data-ecosystem.html
arch
itec
ture
: H
ive
vs.
RD
BMS
forrás: Ullman könyv
arch
itec
ture
: H
ive
vs.
RD
BMS
forrás: Ullman könyv
Map-Reduce
arch
itec
ture
: H
ive
vs.
RD
BMS
forrás: VLDB 2009
forrás: http://borthakur.com/ftp/hadoopworld.pdf
Storage
● HDFS / HBase / Amazon Elastic MapReduce● HDFS:
table → HDFS directory,partitions → sub-directories,buckets → hash érték szerint file-okba szétdobált adatok
● Serialization / Deserialization (SerDe)● raw format:CSV, Thrift, Regex, Hive Binary● default: LazySerDe – rekordok soronként, mezők ctrl-A-val
elválasztva
● file formats:● TextFile● SequenceFile● RCFile: block-based columnar
Metastore
● JDBC: Derby, vagy MySQL, PostgreSQL, Oracle …:● table schema, SerDe library● table locations● partitioning keys, types, partition level metadata● … (statistics, schema evolution?)● Thrift API:
– PHP (web), Python, Java interfaces
Hive on HBase
● Facebook: „low-latency warehouse”● first column: row key, többi: HBase column(-
family)● no control over type mapping, no timestamp
● vs. VoltDB (HStore utód, „NewSQL”!, „high velocity applications”): ACID on Dynamo
● compatiblitiy: ? vannak kétségek● Cloudera cdh3 stack:
– hadoop-0.20.2+923.97
– hive-0.7.1+42.4
– hbase-0.90.3+15.3
– zookeeper-3.3.3+12.12
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
HQ
L →
Map
-Red
uce
jobs
DAG
→
exec
utio
n en
gine
Hive APIs
● standard eszközök ráköthetők: Jasper Reports, Microstrategy ...
● JDBC: ● jdbc:hive://host:port/dbname
● Python● PHP
kép forrás: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive_Jdbc.pdf
tapasztalatok
● ~2010 ősz, Balázs: Hive többnyire működik, de lassú (0.6 körül?)
● most: ● egyszerű install, akár standalone● egyszerű CLI; web GUI● teljesítmény (join főleg): ?● kompatibilitás: 0.8.01 Hive → 0.20.x Hadoop-ra
tesztelve
HQL DDL
● browsing: show tables; show partitions ; describe (extended) page_view ;
● definition: CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;
● alter table ...● views● external tables: már létező HDFS file-ok
HQL DML
● nincs row-level update, delete● törlés: drop table, partition; insert overwrite
● multi-table insert, insert from queries, insert into files, load files to tablesLOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
FROM usersINSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
HQL select: JOIN
● ANSI equi-jon● data skew → different plans;
● normal: 1 reducer, gets all records● map-side join:
– mapper loads a small table + a portion of big table– does the join
● optimization: hash-join, pruning, exploit pre-sorted data: map-side merge join
HQL select: group byselect pageid, age, count(1), count(distinct userid)from pv_usersgroup by pageid, age
● 0.7 óta: van HAVING clause● optimization:
● hash-based aggregates● serialized key/values in hash tables● exploit pre-sorted data● table / column statistics
extensibility: custom Map-Reduce scripts
forrás: http://www.royans.net/arch/hive-facebook/
extensibility: UDF / UDAF(+lehetőség: types, data formats)
forrás: http://www.royans.net/arch/hive-facebook/
HQL „Data Mining funcitions”
● ~ advanced statistics● n-grams:
SELECT explode(ngrams(sentences(lower(val)), 2, 10)) AS x FROM kafka;
{"ngram":[of","the],"estfrequency":23.0} {"ngram":[on","the],"estfrequency":20.0} {"ngram":[in","the],"estfrequency":18.0} …
● histogram_numeric:SELECT explode(histogram_numeric(val, 10)) AS x FROM normal;
{"x":-3.6505464999999995,"y":20.0} {"x":-2.7514727901960785,"y":510.0} {"x":-1.7956678951954481,"y":8263.0} …
Hive Future Work● indexing: Facebook már fejleszti (bitmap csak?)● cost-based optimization, smarter plans● data compression: columnar storage schemes● ORDER BY, IN, EXISTS, subqueries in WHERE● advanced operators:
● cubes● frequent item sets● window functions
● better data locality
indexing: http://www.facebook.com/notes/facebook-engineering/working-with-students-to-improve-indexing-in-apache-hive/10150168427733920
Hive performance enhancements ~2009
forrás: http://borthakur.com/ftp/hadoopworld.pdf