31
Apache Hive CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

  • Upload
    lyduong

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

ApacheHive

CMSC491Hadoop-BasedDistributedCompu<ng

Spring2016AdamShook

Page 2: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

WhatIsHive?

•  DevelopedbyFacebookandatop-levelApacheproject•  AdatawarehousinginfrastructurebasedonHadoop•  Immediatelymakesdataonaclusteravailabletonon-JavaprogrammersviaSQLlikequeries

•  BuiltonHiveQL(HQL),aSQL-likequerylanguage•  InterpretsHiveQLandgeneratesMapReducejobsthatrunonthecluster

•  Enableseasydatasummariza<on,ad-hocrepor<ngandquerying,andanalysisoflargevolumesofdata

Page 3: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

WhatHiveIsNot

•  Hive,likeHadoop,isdesignedforbatchprocessingoflargedatasets

•  NotanOLTPorreal-<mesystem•  Latencyandthroughputarebothhighcomparedtoatradi<onalRDBMS– Evenwhendealingwithrela<velysmalldata(<100MB)

Page 4: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

DataHierarchy

•  Hiveisorganisedhierarchicallyinto:– Databases:namespacesthatseparatetablesandotherobjects

–  Tables:homogeneousunitsofdatawiththesameschema•  AnalogoustotablesinanRDBMS

–  Par<<ons:determinehowthedataisstored•  Allowefficientaccesstosubsetsofthedata

–  Buckets/clusters•  Forsubsamplingwithinapar<<on•  Joinop<miza<on

Page 5: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveQL•  HiveQL/HQLprovidesthebasicSQL-likeopera<ons:–  SelectcolumnsusingSELECT–  FilterrowsusingWHERE–  JOINbetweentables–  EvaluateaggregatesusingGROUPBY–  Storequeryresultsintoanothertable– Downloadresultstoalocaldirectory(i.e.,exportfromHDFS)

– ManagetablesandquerieswithCREATE,DROP,andALTER

Page 6: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Primi<veDataTypes

Type Comments

TINYINT,SMALLINT,INT,BIGINT 1,2,4and8-byteintegers

BOOLEAN TRUE/FALSE

FLOAT,DOUBLE Singleanddoubleprecisionrealnumbers

STRING Characterstring

TIMESTAMP Unix-epochoffsetordate<mestring

DECIMAL Arbitrary-precisiondecimal

BINARY Opaque;ignorethesebytes

Page 7: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

ComplexDataTypes

Type Comments

STRUCT Acollec<onofelementsIfSisoftypeSTRUCT{aINT,bINT}:S.areturnselementa

MAP Key-valuetupleIfMisamapfrom'group'toGID:M['group']returnsvalueofGID

ARRAY IndexedlistIfAisanarrayofelements['a','b','c']:A[0]returns'a'

Page 8: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveQLLimita<ons

•  HQLonlysupportsequi-joins,outerjoins,lelsemi-joins

•  Becauseitisonlyashellformapreduce,complexqueriescanbehardtoop<mise

•  MissinglargepartsoffullSQLspecifica<on:– HAVINGclauseinSELECT–  Correlatedsub-queries–  Sub-queriesoutsideFROMclauses– Updatableormaterializedviews–  Storedprocedures

Page 9: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveMetastore•  StoresHivemetadata•  DefaultmetastoredatabaseusesApacheDerby•  Variousconfigura<ons:–  Embedded(in-processmetastore,in-processdatabase)•  Mainlyforunittests

–  Local(in-processmetastore,out-of-processdatabase)•  EachHiveclientconnectstothemetastoredirectly

–  Remote(out-of-processmetastore,out-of-processdatabase)•  EachHiveclientconnectstoametastoreserver,whichconnectstothemetadatadatabaseitself

Page 10: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveWarehouse

•  HivetablesarestoredintheHive“warehouse”– DefaultHDFSloca<on:/user/hive/warehouse

•  Tablesarestoredassub-directoriesinthewarehousedirectory

•  Par<<onsaresubdirectoriesoftables•  ExternaltablesaresupportedinHive•  Theactualdataisstoredinflatfiles

Page 11: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveSchemas

•  Hiveisschema-on-read– Schemaisonlyenforcedwhenthedataisread(atquery<me)

– Allowsgreaterflexibility:samedatacanbereadusingmul<pleschemas

•  ContrastwithanRDBMS,whichisschema-on-write– Schemaisenforcedwhenthedataisloaded– Speedsupqueriesattheexpenseofload<mes

Page 12: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

CreateTableSyntaxCREATE TABLE table_name

(col1 data_type,

col2 data_type,

col3 data_type,

col4 datatype )

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS format_type;

Page 13: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

SimpleTableCREATE TABLE page_view

(viewTime INT,

userid BIGINT,

page_url STRING,

referrer_url STRING,

ip STRING COMMENT 'IP Address of the User' )

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

Page 14: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

MoreComplexTableCREATE TABLE employees (

(name STRING,

salary FLOAT,

subordinates ARRAY<STRING>,

deductions MAP<STRING, FLOAT>,

address STRUCT<street:STRING,

city:STRING,

state:STRING,

zip:INT>)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

Page 15: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

ExternalTableCREATE EXTERNAL TABLE page_view_stg

(viewTime INT,

userid BIGINT,

page_url STRING,

referrer_url STRING,

ip STRING COMMENT 'IP Address of the User') ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE

LOCATION '/user/staging/page_view';

Page 16: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

MoreAboutTables

•  CREATETABLE–  LOAD:filemovedintoHive’sdatawarehousedirectory

– DROP:bothmetadataanddatadeleted•  CREATEEXTERNALTABLE–  LOAD:nofilesmoved– DROP:onlymetadatadeleted– UsethiswhensharingwithotherHadoopapplica<ons,orwhenyouwanttousemul<pleschemasonthesamedata

Page 17: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Par<<oning

•  Canmakesomequeriesfaster•  Dividedatabasedonpar<<oncolumn•  UsePARTITIONBYclausewhencrea<ngtable•  UsePARTITIONclausewhenloadingdata•  SHOWPARTITIONSwillshowatable’spar<<ons

Page 18: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Bucke<ng

•  Canspeedupqueriesthatinvolvesamplingthedata– Samplingworkswithoutbucke<ng,butHivehastoscantheen<redataset

•  UseCLUSTEREDBYwhencrea<ngtable– Forsortedbuckets,addSORTEDBY

•  Toqueryasampleofyourdata,useTABLESAMPLE

Page 19: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

BrowsingTablesAndPar<<onsCommand Comments

SHOW TABLES; ShowallthetablesinthedatabaseSHOW TABLES 'page.*'; Showtablesmatchingthe

specifica<on(usesregexsyntax)

SHOW PARTITIONS page_view; Showthepar<<onsofthepage_viewtable

DESCRIBE page_view; ListcolumnsofthetableDESCRIBE EXTENDED page_view; Moreinforma<ononcolumns(useful

onlyfordebugging)

DESCRIBE page_view PARTITION (ds='2008-10-31');

Listinforma<onaboutapar<<on

Page 20: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

LoadingData

•  UseLOADDATAtoloaddatafromafileordirectory

– WillreadfromHDFSunlessLOCALkeywordisspecified

– WillappenddataunlessOVERWRITEspecified–  PARTITIONrequiredifdes<na<ontableispar<<oned

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'

OVERWRITE INTO TABLE page_view

PARTITION (date='2008-06-08', country='US')

Page 21: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Inser<ngData

•  UseINSERTtoloaddatafromaHivequery

– WillappenddataunlessOVERWRITEspecified– PARTITIONrequiredifdes<na<ontableispar<<onedFROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION (dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid,

pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US';

Page 22: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Inser<ngData

•  Normallyonlyonepar<<oncanbeinsertedintowithasingleINSERT

•  Amul<-insertletsyouinsertintomul<plepar<<onsFROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view

PARTITION ( dt='2008-06-08', country='US‘ )

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'US'

INSERT OVERWRITE TABLE page_view

PARTITION ( dt='2008-06-08', country='CA' )

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'CA'

INSERT OVERWRITE TABLE page_view

PARTITION ( dt='2008-06-08', country='UK' )

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url WHERE pvs.country = 'UK';

Page 23: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Inser<ngDataDuringTableCrea<on

•  UseASSELECTintheCREATETABLEstatementtopopulateatableasitiscreated

CREATE TABLE page_view AS

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url

FROM page_view_stg pvs

WHERE pvs.country = 'US';

Page 24: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

LoadingAndInser<ngData:Summary

Usethis Forthispurpose

LOAD LoaddatafromafileordirectoryINSERT Loaddatafromaquery

•  Onepar<<onata<me•  Usemul<pleINSERTstoinsertinto

mul<plepar<<onsintheonequery

CREATE TABLE AS (CTAS) Insertdatawhilecrea<ngatable

Add/modifyexternalfile Loadnewdataintoexternaltable

Page 25: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

SampleSelectClauses

•  SelectfromasingletableSELECT* FROMsales WHEREamount>10AND region="US";

•  Selectfromapar<<onedtableSELECTpage_views.*

FROMpage_viewsWHEREpage_views.date>='2008-03-01'AND

page_views.date<='2008-03-31'

Page 26: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Rela<onalOperators

•  ALLandDISTINCT–  Specifywhetherduplicaterowsshouldbereturned– ALListhedefault(allmatchingrowsarereturned)– DISTINCTremovesduplicaterowsfromtheresultset

•  WHERE–  Filtersbyexpression– DoesnotsupportIN,EXISTSorsub-queriesintheWHEREclause

•  LIMIT–  Indicatesthenumberofrowstobereturned

Page 27: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

Rela<onalOperators

•  GROUPBY– Groupdatabycolumnvalues– SelectstatementcanonlyincludecolumnsincludedintheGROUPBYclause

•  ORDERBY/SORTBY– ORDERBYperformstotalordering

•  Slow,poorperformance– SORTBYperformspar<alordering

•  Sortsoutputfromeachreducer

Page 28: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

AdvancedHiveOpera<ons

•  JOIN–  Ifonlyonecolumnineachtableisusedinthejoin,thenonlyoneMapReducejobwillrun•  Thisresultsin1MapReducejob:

SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key = c.key

•  Thisresultsin2MapReducejobs:SELECT * FROM a JOIN b ON a.key = b.key JOIN c ON b.key2 = c.key

–  Ifmul<pletablesarejoined,putthebiggesttablelastandthereducerwillstreamthelasttable,buffertheothers

–  Uselelsemi-joinstotaketheplaceofIN/EXISTSSELECT a.key, a.val FROM a LEFT SEMI JOIN b on a.key = b.key;

Page 29: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

AdvancedHiveOpera<ons•  JOIN

–  Donotspecifyjoincondi<onsintheWHEREclause•  Hivedoesnotknowhowtoop<misesuchqueries•  WillcomputeafullCartesianproductbeforefilteringit

•  JoinExample

SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd = b.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND a.ymd > '2010-01-01';

Page 30: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

HiveS<nger

•  MPP-styleexecu<onofHivequeries•  AvailablesinceHive0.13•  NoMapReduce•  WewilltalkaboutthismorewhenwegettoSQLonHadoop

Page 31: Apache Hive - Inspiring Innovationshadam1/491s16/lectures/07-Hive.pdfWhat Hive Is Not • Hive, like Hadoop, is designed for batch processing of large datasets • Not an OLTP or real-

References

•  hvp://hive.apache.org