View
520
Download
2
Category
Preview:
Citation preview
RobertHryniewiczDataEvangelist@RobHryniewicz
Hands-onIntrotoSpark&ZeppelinCrash�Course
2 ©HortonworksInc.2011–2016.AllRightsReserved
The“BigData”Problem
à Asinglemachinecannotprocessorevenstoreallthedata!Problem
Solutionà Distributedataoverlargeclusters
Difficultyà Howtosplitworkacrossmachines?
à Movingdataovernetworkisexpensive
à Mustconsiderdata&networklocality
à Howtodealwithfailures?
à Howtodealwithslownodes?
3 ©HortonworksInc.2011–2016.AllRightsReserved
SparkBackground
4 ©HortonworksInc.2011–2016.AllRightsReserved
AccessRates
Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed
FAST slow slow
5 ©HortonworksInc.2011–2016.AllRightsReserved
WhatisSpark?
à ApacheOpenSourceProject - originallydevelopedatAMPLab (UniversityofCaliforniaBerkeley)
à DataProcessingEngine - focusedonin-memorydistributedcomputinguse-cases
à API - Scala,Python,JavaandR
6 ©HortonworksInc.2011–2016.AllRightsReserved
SparkEcosystem
SparkCore
SparkSQL SparkStreaming MLLib GraphX
7 ©HortonworksInc.2011–2016.AllRightsReserved
WhySpark?
à ElegantDeveloperAPIs– Singleenvironmentfordatamunging andMachineLearning(ML)
à In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML
à MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)
8 ©HortonworksInc.2011–2016.AllRightsReserved
HistoryofHadoop &Spark
9 ©HortonworksInc.2011–2016.AllRightsReserved
ApacheSparkBasics
10 ©HortonworksInc.2011–2016.AllRightsReserved
SparkContext
à MainentrypointforSparkfunctionality
à RepresentsaconnectiontoaSparkcluster
à Representedassc inyourcode
Whatisit?
11 ©HortonworksInc.2011–2016.AllRightsReserved
RDD- ResilientDistributedDatasetà PrimaryabstractioninSpark
– AnImmutable collectionofobjects(orrecords,orelements)thatcanbeoperatedoninparallel
à Distributed– Collectionofelementspartitioned acrossnodesinacluster– EachRDDiscomposedofoneormorepartitions– Usercancontrolthenumberofpartitions– Morepartitions=>moreparallelism
à Resilient– Recoverfromnodefailures– AnRDDkeepsitslineageinformation->itcanberecreatedfromparentRDDs
à CreatedbystartingwithafileinHadoop DistributedFileSystem(HDFS)oranexistingcollectioninthedriverprogram
à Maybepersisted inmemoryforefficient reuse acrossparalleloperations(caching)
12 ©HortonworksInc.2011–2016.AllRightsReserved
RDD– ResilientDistributedDataset
Partition1
Partition2
Partition3
RDD2
Partition1
Partition2
Partition3
Partition4
RDD1
ClusterNodes
13 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQL
14 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQLOverview
à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles)
à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI
à Sameexecutionengineforallthree
à SparkSQLinterfaces providemoreinformationaboutbothstructure andcomputationbeingperformedthanbasicSparkRDDAPI
15 ©HortonworksInc.2011–2016.AllRightsReserved
DataFrames
à Conceptually equivalent toatableinrelationalDBordataframeinR/Python
à APIavailableinScala,Java,Python,andR
à Richeroptimizations(significantlyfasterthanRDDs)
à Distributedcollectionofdataorganizedintonamedcolumns
à UnderneathisanRDD
16 ©HortonworksInc.2011–2016.AllRightsReserved
DataFramesCSVAvro
HIVE
SparkSQL
Text
Col1 Col2 … … ColN
DataFrame(withRDDunderneath)
Column
Row
CreatedfromVariousSources
à DataFrames fromHIVE:– ReadingandwritingHIVEtables,
includingORC
à DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro
à DataFrames fromexistingRDDs– withtoDF()function
DataisdescribedasaDataFramewithrows,columnsandaschema
17 ©HortonworksInc.2011–2016.AllRightsReserved
SQLContextandHiveContext
à EntrypointintoallfunctionalityinSparkSQL
à AllyouneedisSparkContextval sqlContext = SQLContext(sc)
SQLContext
à SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs
HiveContext
val hc = HiveContext(sc)
Usewhenyourdataresidesin
Hive
18 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQLExamples
19 ©HortonworksInc.2011–2016.AllRightsReserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select("Origin", "Dest", "DepDelay").show(5)
ReadingDataFromTable
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+
20 ©HortonworksInc.2011–2016.AllRightsReserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)
UsingDataFrame APItoFilterData(showdelaysmorethan15min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
21 ©HortonworksInc.2011–2016.AllRightsReserved
SQLExample
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5").show
UsingSQLtoQueryandFilterData(again,showdelaysmorethan15min)
+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+
22 ©HortonworksInc.2011–2016.AllRightsReserved
RDDvs.DataFrame
23 ©HortonworksInc.2011–2016.AllRightsReserved
RDDsvs.DataFrames
RDD
DataFrame
à Lower-levelAPI(morecontrol)
à Lotsofexistingcode&users
à Compile-timetype-safety
à Higher-levelAPI(fasterdevelopment)
à Fastersorting,hashing,andserialization
à Moreopportunitiesforautomaticoptimization
à Lowermemorypressure
24 ©HortonworksInc.2011–2016.AllRightsReserved
Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example
dept name ageBio HSmith 48CS ATuring 54Bio BJones 43Phys E Witten 61
Findaverageagebydepartment?
25 ©HortonworksInc.2011–2016.AllRightsReserved
SparkSQLOptimizationsà SparkSQLusesanunderlyingoptimizationengine(Catalyst)
– Catalystcanperformintelligentoptimizationsinceitunderstands theschema
à SparkSQLdoesnotmaterializeallthecolumns(aswithRDD)onlywhat’sneeded
26 ©HortonworksInc.2011–2016.AllRightsReserved
Catalyst:SparkSQLoptimizer
à Queryordataframeoperationsmodeledasatree
à Logicalplancreatedandoptimized
à Variousphysicalplanscreated;bestplanchosen
à Codegenerationandexecution
27 ©HortonworksInc.2011–2016.AllRightsReserved
SparkStreaming
28 ©HortonworksInc.2011–2016.AllRightsReserved
SparkStreaming
à ExtensionofSparkCoreAPI
à Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant
Overview
29 ©HortonworksInc.2011–2016.AllRightsReserved
SparkStreaming
30 ©HortonworksInc.2011–2016.AllRightsReserved
SparkStreaming
à Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations
31 ©HortonworksInc.2011–2016.AllRightsReserved
ApacheZeppelin&HDPSandbox
32 ©HortonworksInc.2011–2016.AllRightsReserved
ApacheZeppelin– AModernWeb-basedDataScienceStudio
à Dataexplorationanddiscovery
à Visualization
à DeeplyintegratedwithSparkandHadoop
à Pluggableinterpreters
à Multiplelanguagesinonenotebook:R,Python,Scala
33 ©HortonworksInc.2011–2016.AllRightsReserved
34 ©HortonworksInc.2011–2016.AllRightsReserved
35 ©HortonworksInc.2011–2016.AllRightsReserved
36 ©HortonworksInc.2011–2016.AllRightsReserved
What’snotincludedwithSpark?
ResourceManagement
Storage
Applications
SparkCoreEngine
ScalaJavaPythonlibraries
MLlib(Machinelearning)
SparkSQL*
SparkStreaming*
SparkCoreEngine
37 ©HortonworksInc.2011–2016.AllRightsReserved
HDPSandbox
What’sincludedintheSandbox?
à Zeppelin
à LatestHortonworksDataPlatform(HDP)– Spark– YARNà ResourceManagement– HDFSà DistributedStorageLayer– Andmanymorecomponents... YARN
ScalaJava
PythonR
APIs
Spark Core Engine
Spark SQL
Spark StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
38 ©HortonworksInc.2011–2016.AllRightsReserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS Hadoop Distributed File System
Interactive Real-TimeBatch
Applications BatchNeeds to happen but, no timeframe limitations
InteractiveNeeds to happen at Human time
Real-Time Needs to happen at Machine Execution time.
39 ©HortonworksInc.2011–2016.AllRightsReserved
WhySparkonYARN?
à UtilizeexistingHDPclusterinfrastructure
à Resourcemanagement– shareSparkworkloadswithotherworkloadslikePIG,HIVE,etc.
à Schedulingandqueues
SparkDriver
ClientSpark
ApplicationMaster
YARNcontainer
SparkExecutor
YARNcontainer
Task Task
SparkExecutor
YARNcontainer
Task Task
SparkExecutor
YARNcontainer
Task Task
40 ©HortonworksInc.2011–2016.AllRightsReserved
Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomlyacrossthecluster• ProcessingDataLocality
• NotJuststoragebutcomputation
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
41 ©HortonworksInc.2011–2016.AllRightsReserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle & Governance
FalconAtlas
AdministrationAuthenticationAuthorizationAuditingData Protection
RangerKnoxAtlasHDFSEncryptionData Workflow
SqoopFlumeKafkaNFSWebHDFS
Provisioning, Managing, & Monitoring
AmbariCloudbreakZookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBaseAccumuloPhoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
HortonworksDataPlatform2.4.x
DeploymentChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System
42 ©HortonworksInc.2011–2016.AllRightsReserved
HDP2.5TP
43 ©HortonworksInc.2011–2016.AllRightsReserved
44 ©HortonworksInc.2011–2016.AllRightsReserved
45 ©HortonworksInc.2011–2016.AllRightsReserved
ViewUserSessions
46 ©HortonworksInc.2011–2016.AllRightsReserved
HortonworksCommunityConnection
47 ©HortonworksInc.2011–2016.AllRightsReserved
HortonworksCommunityConnection
Read access for everyone, join to participate and be recognized
• FullQ&APlatform(likeStackOverflow)
• KnowledgeBaseArticles
• CodeSamplesandRepositories
48 ©HortonworksInc.2011–2016.AllRightsReserved
CommunityEngagement
Participate now at: community.hortonworks.com©HortonworksInc.2011–2015.AllRightsReserved
7,500+RegisteredUsers
15,000+Answers
20,000+TechnicalAssets
One Website!
49 ©HortonworksInc.2011–2016.AllRightsReserved
LabPreview
50 ©HortonworksInc.2011–2016.AllRightsReserved
LinktoTutorialwithLabInstructions
http://tinyurl.com/hwx-intro-to-spark
RobertHryniewicz@RobHryniewicz
Thanks!
Recommended