32
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam

Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Impala AModern,OpenSourceSQLEngineforHadoop

YogeshChockalingam

Page 2: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Agenda

•  Introduction• Architecture•  FrontEnd• BackEnd•  Evaluation• ComparisonwithSparkSQL

Page 3: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Introduction

Page 4: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table
Page 5: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table
Page 6: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Why not use Hive or HBase?

• HBaseisaNoSQLdatabasethatrunsontopofHDFSthatprovidesreal-timeread/writeaccess.

• HiveisadatawarehousingtoolbuiltontopofHadoopandusesHiveQueryLanguage(HQL)forqueryingdatastoredinaHadoopcluster.

• HQLautomaticallytranslatesqueriesintoMapReducejobs.

• Hivedoesn’tsupporttransactions.

Page 7: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Impala

• GeneralpurposeSQLqueryengine:• Worksacrossanalyticalandtransactionalworkloads

• Highperformance:•  ExecutionenginewritteninC++• RunsdirectlywithinHadoop• DoesnotuseMapReduce

• MPPdatabasesupport:• Multi-userworkloads

Page 8: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Creating tables

CREATETABLET(...)PARTITIONEDBY(dayint,month

int)LOCATION'<hdfs-path>'STOREDASPARQUET;

Forapartitionedtable,dataisplacedinsubdirectorieswhosepathsreflectthepartitioncolumns'values.Forexample,forday17,month2oftableT,alldatafileswouldbelocatedin

<root>/day=17/month=2/

Page 9: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Metadata

•  Tablemetadataincludingthetabledefinition,columnnames,datatypes,schemaetc.arestoredinHCatalog.

Page 10: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

INSERT / UPDATE / DELETE

•  Theusercanadddatatoatablesimplybycopying/movingdatafilesintothedirectory!

• DoesNOTsupportUPDATEandDELETE.•  LimitationofHDFS,asitdoesnotsupportanin-placeupdate.•  Recomputethevaluesandreplacethedatainthepartitions.

• COMPUTESTATS<table>afterinserts.•  Thosestatisticswillsubsequentlybeusedduringqueryoptimization.

Page 11: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Architecture

Page 12: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

I: Impala Daemon Impaladaemonserviceisduallyresponsiblefor:

1.  Acceptingqueriesfromclientprocessesandorchestratingtheirexecutionacrossthecluster.Inthisroleit’scalledthequerycoordinator.

2.  ExecutingindividualqueryfragmentsonbehalfofotherImpaladaemons.

•  TheImpaladaemonsareinconstantcommunicationwiththestatestore,toconfirmwhichnodesarehealthyandcanacceptnewwork.

•  Theyalsoreceivebroadcastmessagesfromthecatalogdaemonviathestatestore,tokeeptrackofmetadatachanges.

Catalog Statestore

ImpalaDaemon

... ...

Page 13: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

II: Statestore Daemon

• Handlesclustermembershipinformation.• PeriodicallysendstwokindsofmessagestoImpaladaemons:

•  Topicupdate:Thenewchangesmadesincethelasttopicupdatemessage•  Keepalive:Aheartbeatmechanism

•  IfanImpaladaemongoesoffline,thestatestoreinformsalltheotherImpaladaemonssothatfuturequeriescanavoidmakingrequeststotheunreachablenode.

Page 14: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

III: Catalog Daemon

•  Impala'scatalogserviceservescatalogmetadatatoImpaladaemonsviathestatestorebroadcastmechanism,andexecutesDDLoperationsonbehalfofImpaladaemons.

•  ThecatalogservicepullsinformationfromHiveMetastoreandaggregatesthatinformationintoanImpala-compatiblecatalogstructure.

•  ThisstructureisthenpassedontothestatestoredaemonwhichcommunicateswiththeImpaladaemons.

Page 15: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

1.RequestarrivesfromclientviaThriftAPI

SQLApp

ODBCSQL

request

ImpalaDaemon ImpalaDaemon ImpalaDaemon

HiveMetastore HDFSNN Statestore

Page 16: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

SQLApp

ODBC

HiveMetastore HDFSNN Statestore

2.Plannerturnsrequestintocollectionsofplanfragments.CoordinatorinitiatesexecutiononremoteImpaladaemons.

Page 17: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

3.IntermediateresultsarestreamedbetweenImpaladaemons.Queryresultsarestreamedbacktoclient.

SQLApp

ODBC

QueryExecutorHDFSDN HBase

QueryPlanner

QueryCoordinator

QueryResults

HiveMetastore HDFSNN Statestore

Page 18: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Front-End

Page 19: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Query Plans

•  TheImpalafrontendisresponsibleforcompilingSQLtextintoqueryplansexecutablebytheImpalabackends.

•  Thequerycompilationprocessproceedsasfollows:•  Queryparsing•  Semanticanalysis•  Queryplanning/optimization

• Queryplanning1.  Singlenodeplanning2.  Planparallelizationandfragmentation

Page 20: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Query Planning: Single Node

•  Inthefirstphase,theparsetreeistranslatedintoanon-executablesingle-nodeplantree.

E.g.QueryjoiningtwoHDFStables(t1,t2)andoneHBasetable(t3)followedbyanaggregationandorderbywithlimit(top-n).

HashJoin

Scan: t1

Scan: t3

Scan: t2

HashJoin

Agg SELECTt1.custid,SUM(t2.revenue)ASrevenueFROMLargeHdfsTablet1JOINLargeHdfsTablet2ON(t1.id1=t2.id)JOINSmallHbaseTablet3ON(t1.id2=t3.id)WHEREt3.category='Online'GROUPBYt1.custidORDERBYrevenueDESCLIMIT10;

Page 21: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

•  Thesecondplanningphasetakesthesingle-nodeplanasinputandproducesadistributedexecutionplan.Goal:

•  Tominimizedatamovement•  Maximizescanlocalityasremotereadsareconsiderablyslowerthanlocalones.

•  Cost--baseddecisionbasedoncolumnstats/estimatedcostofdatatransfers

• Decideparalleljoinstrategy:•  BroadcastJoin:Joiniscollocatedwithleft-handsideinput;right--handsidetableisbroadcasttoeachnodeexecutingjoin.Preferredforsmallright-handsideinput.

•  PartitionedJoin:Bothtablesarehash-partitionedonjoincolumns.Preferredforlargejoins.

Query Planning: Distributed Nodes

Page 22: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table
Page 23: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Back-End

Page 24: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Executing the Query

•  Impala'sbackendreceivesqueryfragmentsfromthefront-endandisresponsiblefortheirexecution.

• Highperformance:• WritteninC++forminimalexecutionoverhead•  Internalin-memorytupleformatputsfixed-widthdataatfixedoffsets•  Usesintrinsic/specialCPUinstructionsfortasksliketextparsingandCRCcomputation.

•  Runtimecodegenerationfor“bigloops”

Page 25: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Runtime Code Generation

Impalausesruntimecodegenerationtoproducequery-specificversionsoffunctionsthatarecriticaltoperformance.•  Forexample,toconverteveryrecordtoImpala’sin-memorytupleformat:

•  Knownatquerycompiletime:#oftuplesinabatch,tuplelayout,columntypes,etc.

•  Generateatcompiletime:unrolledloopthatinlinesallfunctioncalls,deadcodeeliminationandminimizesbranches.

•  CodegeneratedusingLLVM

Page 26: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Evaluation

Page 27: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Comparisonofqueryresponsetimesonsingle-userruns.

Page 28: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Comparisonofqueryresponsetimesandthroughputonmulti-userruns.

Page 29: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

ComparisonoftheperformanceofImpalaandacommercialanalyticRDBMS.https://github.com/cloudera/impala-tpcds-kit

Page 30: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Comparison with Spark SQL

Page 31: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

•  ImpalaisfasterthanSparkSQLasitisanenginedesignedespeciallyforthemissionofinteractiveSQLoverHDFS,andithasarchitectureconceptsthathelpsitachievethat.

•  ForexampletheImpala‘always-on’daemonsareupandwaitingforqueries24/7 — somethingthatisnotpartofSparkSQL.

Page 32: Impala - University of Wisconsin–Madisonpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-yogesh... · 2018-10-23 · INSERT / UPDATE / DELETE • The user can add data to a table

Thank you!