Apache Spark 101

Apache Spark 101June 2016

Abdullah Cetin CAVDAR

@accavdar

#AnkaraSparkDay

Apache Spark's Goal

Apache Sparkis a fast and general engine for

large-scale data processing

Most Active Project in Big Data

Spark Survey 2015

Top 10 Industries Using Spark

Many Types of Product

Spark Engine

unified engine across diverse workloads &environments

Programming Languages

Open Source Spark Ecosystem

Most Important Aspects

SparkProgramming

Challenge?Fast data sharing across parallel

Data Sharing in MapReduce

Data Sharing in Apache Spark

Components

Cluster Managers

Initializing Apache SparkSparkConf and SparkContext

Apache Spark ShellPython and Scala

RDD (Resilient Distributed Dataset)An RDD is a read-only collection of objectspartitioned across a set of machines that

can be rebuilt if a partition is lost

RDDRead-Only = Immutable

ParallelismCaching

RDDPartitioned = Distributed

More partitions = More parallelism

RDDRebuilt = Resilient

Recover lost data partitionsBy replaying data lineage

RDD Operations

Partitions

logical division of data / basic unit ofparallelisim

RDD Lineage

Lazy Evaluation

DAG (Directed Acyclic Graph)

Transformation & Action

RDD CreationParallelizing a collection

into driver application memoryfor only prototyping and testing

Loading an external data set�le://, hdfs://, s3n://sc.textFile()sc.hadoopFile(), sc.newAPIHadoopFile()sqlContext.read()

Word Count :)

Driver & WorkersMain Program is executed on DriverTransformations are executed on WorkersActions transfer from Workers to DriverDriver cannot get data from executors except action and accumulator

RDD Dependencies

Minimize shuffle / WideDependencies

RDD Persistence / Caching

persist() orcache()

Without cache, it will restart from the �rst RDDLRU (Least Recently Used)Default Storega Level: MEMORY_ONLY

Storage Levels

Shared VariablesAccumulators and Broadcast

Variables

AccumulatorsUsed to implement counters or sums

Broadcast VariablesKeep a read-only variable cached on each

machine

Spark UIDefault port 4040

Deploying to a ClusterUse spark-submit

Data Frames & PerformanceDistributed collection of rows organized

into named columns

TipsAvoid groupByKey and wide dependenciesUse enough number of partitionsUse coalesce not to make too many small �lesBe cautious on Serialization/Deserialization

Major Features in 2.0

Thank you

#AnkaraSparkDay

Apache Spark 101

Data & Analytics

Running Apache Spark & Apache Zeppelin in Production

Performance-Analyse von Apache Spark und Apache Hadoop€¦ · Apache Spark, Apache Hadoop, Big Data, Benchmarking, Performance-Analyse Kurzzusammenfassung Diese Bachelorarbeit beschäftigt

State of Security: Apache Spark & Apache Zeppelin

Accelerator for Apache Spark Functional Specification · Accelerator for Apache Spark – Functional Specification 12 Table 1: Accelerator for Apache Spark Components Component Software

Apache spark

Budapest Spark Meetup - Apache Spark @enbrite.ly

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

R + Apache Spark

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Apache Spark and Distributed Programming - CS-E4110 ... · Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements

Spark SQL | Apache Spark

KNIME Extension for Apache Spark Installation Guide · Apache Livy (recommended) Spark Job Server (deprecated) Supported Spark and Hadoop distributions KNIME Extension for Apache

Apache Spark RDDs

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

Apache spark session

Apache Spark Introduction

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Apache Spark PDF

Apache Spark - LMU

Apache Spark & Hadoop