Intro to Apache Spark - York UniversityApache Spark 2 Spark is a cluster computing engine. Provides...

Intro to Apache Spark

EECS 4415

Big Data Systems

Tilemachos Pechlivanoglou

tipech@eecs.yorku.ca

Apache Spark

■ Spark is a cluster computing engine.

■ Provides high-level API in Scala, Java, Python and R.

■ Provides high level tools:– Spark SQL.– MLib.– GraphX.– Spark Streaming.

■ The basic abstraction in Spark is the RDD.

■ Stands for: Resilient Distributed Dataset.

■ A collection of items, with source:– Hadoop (HDFS).– JDBC.– ElasticSearch.– others…

RDD concepts

Main concepts regarding RDD:

■ Partitions.

■ Dependencies.

■ Lazy computation

RDD partitions

■ An RDD is partitioned.

■ A partition is usually computed on a different process– (usually on a different machine).

■ This is the implementation of the distributed part of the RDD.

RDD dependency

■ RDDs can depend on other RDDs.

■ RDD calculations are lazy– map operation on RDD gives new RDD which depends on original– new RDD only contains meta-data (i.e., the computing function)

■ Flow is only computed on a specific command – i.e. when we calculate something final (reduce)

Lazy RDDs & dependency

■ RDDs can depend on other RDDs.

■ RDD calculations are lazy– map operation on RDD gives new RDD which depends on original– new RDD only contains meta-data (i.e., the computing function)

■ Flow is only computed on a specific command – i.e. when we calculate something final (reduce)

Spark structure

■ Driver:– Executes the main program– Creates the RDDs – Collects the results

■ Executors:– Execute the RDD operations– Participate in the shuffle

Taken from Spark wiki - https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 9

■ Normal process:– Data ingestion: turn any source of data to RDDs– Transformations: modify the RDDs in some way– Final actions: evaluate the RDDs and return some result

Spark flow

RDD creation

■ Spark supports reading files, directories, streams, etc.

■ Some out-of-the-box methods:– textFile - retrieving an RDD[String]– sequenceFile - Hadoop sequence files RDD[(K,V)]– socketTextStream - text stream RDD[String]

RDD transformation

■ Transformations are divided to two main types:– Those who shuffle – Those who don’t

■ Remember these are lazy operations!

RDD transformations, no shuffle

■ map(func):– return new RDD by passing each element through a function

■ filter(func):– return new RDD by selecting elements on which func returns true

■ flatMap(func):– similar to map, but each input item is mapped to 0 or more output items– (so func should return a Seq rather than a single item)– e.g. (lyrics1, lyrics2) -> flatmap -> (word1, word2, word3, word4)

fromlyrics1

fromlyrics2

RDD transformations, shuffle

■ Shuffle operations repartition the data across the network.■ Can be very expensive operations in Spark.■ You must be aware where and why shuffle happens.■ Order is not guaranteed inside a partition.

■ Popular operations that cause shuffle are:– groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect RDDs

Final actions (1)

■ The following (selected) methods evaluate the RDD (not lazy):– collect() – returns an list containing all the elements of the RDD

main RDD evaluation method– count() – returns the number of the elements in the RDD

– first() – returns the first element of the RDD

– foreach(f) – performs a function on each element of the RDD

– isEmpty

– max/min– reduce((T,T) => T) – parallel reduction.

Final actions (2)

■ More evaluating methods– take(n) – returns the first n elements

– takeSample()– takeOrdered(n) – returns the first (smallest) n elements

– top(n) – returns the first (largest) n elements

– countByKey – for pair RDDs

– save*File

An example workflow17

Demo streaming Twitter app18

Running the demo Twitter app

■ Demo is executed in two different Docker containers– one responsible for connecting to Twitter stream and forwarding it locally– one responsible for getting the local stream and processing it in Spark– we make them talk to each other by “linking” them

■ Running twitter_app.py– docker run -it -v $PWD:/app --name twitter -w /app python bash– pip install -U git+https://github.com/tweepy/tweepy.git– python twitter_app.py

■ Running spark_app.py– docker run -it -v $PWD:/app --link twitter:twitter eecsyorku/eecs4415– spark-submit spark_app.py

Installs latest version,previous one has a bug

Twitter app credentials

■ Twitter requires app developer account for access to stream.– Normally requires applying for it– This is the best option

■ If that isn’t possible, you can use credentials below:– May cause limiting issues with too many people running at the same time

ACCESS_TOKEN = '2591998746-Mx8ZHsXJHzIxAaD2IxYfmzYuL3pYNVnvWoHZgR5'

ACCESS_SECRET = 'LJDvEa0jL7QJXxql0NVrULTAniLobe2TAAlnBdXRfm1xF'

CONSUMER_KEY = 'ZAPfZLcBhYEBCeRSAK5PqkTT7'

CONSUMER_SECRET = 'M81KvgaicyJIaQegdgXcdKDeZrSsJz4AVrGv3yoFwuItQQPMay'

Thank you!

Based on:http://trainologic.com/wp-content/uploads/2017/06/SparkForDataScienceMeetup1.pptxhttps://www.toptal.com/apache/apache-spark-streaming-twitter

Intro to Apache Spark - York UniversityApache Spark 2 Spark is a cluster computing engine. Provides...

Documents

Parallelizing Existing R Packages with SparkR · What is SparkR? An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by

Junior Reading Spark Level 2 - book.visang.com

EDISON Data Science Framework (EDSF): Education and ...€¦ · •Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model

Higher level data processing in Apache Spark - ut · PDF fileHigher level data processing in Apache Spark Pelle Jakovits ... Let?s Do It, Let?...| Cole Porter ... –Ability to do

WHIRLPOOL GAS RANGE DIRECT SPARK IGNITION SYSTEMThis Job Aid, “Whirlpool Gas Range Direct Spark Ignition System,” (Part No. 8177893), provides the technician with information on

Intro to Apache Spark - Stanford Universitystanford.edu/~rezab/dao/slides/itas_workshop.pdfHadoop @ Yahoo! 2004 2006 2008 2010 2012 2014 2014 Apache Spark top-level 2010 Spark paper

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

SPARK TESTERS - Beta · PDF fileSI900 Spark Tester Indicator Beta LaserMike Spark Testers can be connected to the SI900 or SI900-RC indicator which provides a 5-digit LED

CoGeneration Spark Plugs - Denso · This CoGeneration overview provides full details of DENSO’s Iridium CoGeneration Spark Plug programme, making it easier to choose the right plug

Preliminary Curriculum Considerations - sparkpe.orgsparkpe.org/wp-content/uploads/K2-PECAT.pdf · Teaching Strategies for SPARK HS PE SPARK provides teaching strategies for instructors

A Java Implementer's Guide to - Jfokus · Apache Spark APIs Spark Core –Provides APIs for working with raw data collections –Map / reduce functions to transform and evaluate the

Trace Level Analyzer for Moisture in Carbon Dioxide · 2020. 2. 11. · Spark H 2O in CO 2 Trace Level Analyzer for Moisture in Carbon Dioxide Optional Packages Customize your Spark

Dataframes - GitHub Pages. Spark SQL_CLASS.pdf · Parquet files is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

Spark - ce.uniroma2.it · Spark as unified engine • A number of integrated higher-level modules built on top of Spark • Spark SQL – To work with structured data – Allows querying

Introduction to Big Data with Apache Spark...The Spark Computing Framework" • Provides programming abstraction and parallel runtime to hide complexities of fault-tolerance and slow

Getting to Know Your 2016 Spark - GMC · 1 2016 Spark Getting to Know Your This Quick Reference Guide provides some tips to help you become familiar with the sophisticated Spark EV

CoGeneration Spark Plugs - Europe - Denso · 2018. 10. 11. · This CoGeneration overview provides full details of DENSO’s Iridium CoGeneration Spark Plug programme, making it easier

Spark & Spark SQL

Lumina Spark Practitioner Qualification from inemmo Group U.K. · Lumina Spark Psychometrics The next-generation of professional development tools “Lumina Spark Portrait provides