26
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL + Pig-La.n Combine Query Language and Data Flow Language for Data Science Jeff Zhang (zjff[email protected]) May 16, 2017

Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang([email protected])May16,2017

Page 2: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

2 ©HortonworksInc.2011–2016.AllRightsReserved

WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks

Page 3: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

3 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwouldhappen

Page 4: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

4 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline

Page 5: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

5 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

§  CollectandTransformServerLogData•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData

Page 6: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

6 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

BeforeDataMunging AcerDataMunging

Page 7: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

7 ©HortonworksInc.2011–2016.AllRightsReserved

DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcs,BItoolstogetinsightfromData–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest

Page 8: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

8 ©HortonworksInc.2011–2016.AllRightsReserved

DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

Clean,NormalizedStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Datayouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?

Page 9: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

9 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure

Page 10: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

10 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.

Page 11: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

11 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility

Page 12: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

12 ©HortonworksInc.2011–2016.AllRightsReserved

WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?

Page 13: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

13 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard

DataFlowLanguage•  lazyevaluaYon•  supportpipelinesplit

DataSource StructuredData Structured/UnstructuredIntegraYon IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis

Page 14: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

14 ©HortonworksInc.2011–2016.AllRightsReserved

IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

PigScript

Page 15: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

15 ©HortonworksInc.2011–2016.AllRightsReserved

CombineSparkSQL+Pig-La.n

SparkDataFrameTable

SparkSQL

DataMunging

DataAnalysis

SparkScalaAPI

SparkPythonAPI

SparkRAPI

PigLa.n

Page 16: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

16 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-Lain+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis

Page 17: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

17 ©HortonworksInc.2011–2016.AllRightsReserved

SparkTable(bank)

PigLaYn

SQL

Page 18: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

18 ©HortonworksInc.2011–2016.AllRightsReserved

WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.

Page 19: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

19 ©HortonworksInc.2011–2016.AllRightsReserved

JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

Pig-LaYn+SparkSQLinZeppelin

Page 20: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure(Recap)

Page 21: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

21 ©HortonworksInc.2011–2016.AllRightsReserved

Demo

Page 22: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

22 ©HortonworksInc.2011–2016.AllRightsReserved

Conclusion

Ã  LeveragethepowerofbothQueryLanguageandDataFlowLanguage

Ã  UseSparkasUnifiedExecuYonEngine.

Ã  ShareDatabetweenDataMunging&DataAnalysis

Ã  UseZeppelinasUnifiedDataSciencePlajorm

Page 23: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

23 ©HortonworksInc.2011–2016.AllRightsReserved

Summary

Ã  DataMunging&DataAnalysis

Ã  UsePig-LaYnforDataMunging,UseSQLforDataAnalysis

Ã  RununderSparkEngine

Ã  UseZeppelinasunifiedDataSciencePlajorm

Page 24: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

24 ©HortonworksInc.2011–2016.AllRightsReserved

CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  SupporttoIntegratePigwithotherSparkAPIs,likeR,Python

Page 25: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

25 ©HortonworksInc.2011–2016.AllRightsReserved

Q&A

Page 26: Combine Query Language and Data Flow Language for Data Science · What is Apache Pig à Apache Pig is a high-level plaorm for creang programs that run on Apache Hadoop. The language

26 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou