45
Apache Spark 101 June 2016 Abdullah Cetin CAVDAR @accavdar #AnkaraSparkDay

Apache Spark 101

Embed Size (px)

Citation preview

Page 1: Apache Spark 101

Apache Spark 101June 2016

Abdullah Cetin CAVDAR

@accavdar

#AnkaraSparkDay

Page 2: Apache Spark 101

Apache Spark's Goal

Page 3: Apache Spark 101

Apache Sparkis a fast and general engine for

large-scale data processing

Page 4: Apache Spark 101

Most Active Project in Big Data

Spark Survey 2015

Page 5: Apache Spark 101

Top 10 Industries Using Spark

Page 6: Apache Spark 101

Many Types of Product

Page 7: Apache Spark 101

Spark Engine

unified engine across diverse workloads &environments

Page 8: Apache Spark 101

Programming Languages

Page 9: Apache Spark 101

Open Source Spark Ecosystem

Page 10: Apache Spark 101

Most Important Aspects

Page 11: Apache Spark 101

SparkProgramming

Model

Page 12: Apache Spark 101

Challenge?Fast data sharing across parallel

jobs

Page 13: Apache Spark 101

Data Sharing in MapReduce

Page 14: Apache Spark 101

Data Sharing in Apache Spark

Page 15: Apache Spark 101

Components

Page 16: Apache Spark 101

Cluster Managers

Page 17: Apache Spark 101

Initializing Apache SparkSparkConf and SparkContext

Page 18: Apache Spark 101

Apache Spark ShellPython and Scala

Page 19: Apache Spark 101

RDD (Resilient Distributed Dataset)An RDD is a read-only collection of objectspartitioned across a set of machines that

can be rebuilt if a partition is lost

Page 20: Apache Spark 101

RDDRead-Only = Immutable

ParallelismCaching

Page 21: Apache Spark 101

RDDPartitioned = Distributed

More partitions = More parallelism

Page 22: Apache Spark 101

RDDRebuilt = Resilient

Recover lost data partitionsBy replaying data lineage

Page 23: Apache Spark 101

RDD Operations

Page 24: Apache Spark 101

RDD Operations

Page 25: Apache Spark 101

Partitions

logical division of data / basic unit ofparallelisim

Page 26: Apache Spark 101

RDD Lineage

Lazy Evaluation

Page 27: Apache Spark 101

DAG (Directed Acyclic Graph)

Page 28: Apache Spark 101

Transformation & Action

Page 29: Apache Spark 101

RDD CreationParallelizing a collection

into driver application memoryfor only prototyping and testing

Loading an external data set�le://, hdfs://, s3n://sc.textFile()sc.hadoopFile(), sc.newAPIHadoopFile()sqlContext.read()

Page 30: Apache Spark 101

Word Count :)

Page 31: Apache Spark 101

Driver & WorkersMain Program is executed on DriverTransformations are executed on WorkersActions transfer from Workers to DriverDriver cannot get data from executors except action and accumulator

Page 32: Apache Spark 101

RDD Dependencies

Minimize shuffle / WideDependencies

Page 33: Apache Spark 101

RDD Persistence / Caching

persist() orcache()

Without cache, it will restart from the �rst RDDLRU (Least Recently Used)Default Storega Level: MEMORY_ONLY

Page 34: Apache Spark 101

Storage Levels

Page 35: Apache Spark 101

Shared VariablesAccumulators and Broadcast

Variables

Page 36: Apache Spark 101

AccumulatorsUsed to implement counters or sums

Page 37: Apache Spark 101

Broadcast VariablesKeep a read-only variable cached on each

machine

Page 38: Apache Spark 101

Spark UIDefault port 4040

Page 39: Apache Spark 101

Deploying to a ClusterUse spark-submit

Page 40: Apache Spark 101

Data Frames & PerformanceDistributed  collection  of  rows  organized 

into  named   columns

Page 41: Apache Spark 101
Page 42: Apache Spark 101

TipsAvoid groupByKey and wide dependenciesUse enough number of partitionsUse coalesce not to make too many small �lesBe cautious on Serialization/Deserialization

Page 43: Apache Spark 101

Major Features in 2.0

Page 44: Apache Spark 101

Thank you

Page 45: Apache Spark 101

#AnkaraSparkDay