Apache Spark 101

Preview:

Citation preview

Apache Spark 101June 2016

Abdullah Cetin CAVDAR

@accavdar

#AnkaraSparkDay

Apache Spark's Goal

Apache Sparkis a fast and general engine for

large-scale data processing

Most Active Project in Big Data

Spark Survey 2015

Top 10 Industries Using Spark

Many Types of Product

Spark Engine

unified engine across diverse workloads &environments

Programming Languages

Open Source Spark Ecosystem

Most Important Aspects

SparkProgramming

Model

Challenge?Fast data sharing across parallel

jobs

Data Sharing in MapReduce

Data Sharing in Apache Spark

Components

Cluster Managers

Initializing Apache SparkSparkConf and SparkContext

Apache Spark ShellPython and Scala

RDD (Resilient Distributed Dataset)An RDD is a read-only collection of objectspartitioned across a set of machines that

can be rebuilt if a partition is lost

RDDRead-Only = Immutable

ParallelismCaching

RDDPartitioned = Distributed

More partitions = More parallelism

RDDRebuilt = Resilient

Recover lost data partitionsBy replaying data lineage

RDD Operations

RDD Operations

Partitions

logical division of data / basic unit ofparallelisim

RDD Lineage

Lazy Evaluation

DAG (Directed Acyclic Graph)

Transformation & Action

RDD CreationParallelizing a collection

into driver application memoryfor only prototyping and testing

Loading an external data set�le://, hdfs://, s3n://sc.textFile()sc.hadoopFile(), sc.newAPIHadoopFile()sqlContext.read()

Word Count :)

Driver & WorkersMain Program is executed on DriverTransformations are executed on WorkersActions transfer from Workers to DriverDriver cannot get data from executors except action and accumulator

RDD Dependencies

Minimize shuffle / WideDependencies

RDD Persistence / Caching

persist() orcache()

Without cache, it will restart from the �rst RDDLRU (Least Recently Used)Default Storega Level: MEMORY_ONLY

Storage Levels

Shared VariablesAccumulators and Broadcast

Variables

AccumulatorsUsed to implement counters or sums

Broadcast VariablesKeep a read-only variable cached on each

machine

Spark UIDefault port 4040

Deploying to a ClusterUse spark-submit

Data Frames & PerformanceDistributed  collection  of  rows  organized 

into  named   columns

TipsAvoid groupByKey and wide dependenciesUse enough number of partitionsUse coalesce not to make too many small �lesBe cautious on Serialization/Deserialization

Major Features in 2.0

Thank you

#AnkaraSparkDay

Recommended