Spark tutorial

Learning PySparkA Tutorial

By:Maria Mestre

(@mariarmestre)Sahan Bulathwela

(@in4maniac)Erik Pazos (@zerophewl)

This tutorial

Skimlinks | Spark… A view from the trenches !!

● Some key Spark concepts (2 minute crash course)

● First part: Spark core○ Notebook: basic operations○ Spark execution model

● Second part: Dataframes and SparkSQL○ Notebook : using DataFrames and Spark SQL ○ DataFrames execution model

● Final note on Spark configs and useful areas to go from here

How to setup the tutorial

● Directions and resources to setup the tutorial in your local environment can be found at the below mentioned blog post

https://in4maniac.wordpress.com/2016/10/09/spark-tutorial/

● Data Extracted from Amazon Dataset o Image-based recommendations on styles and substitutes , J. McAuley, C. Targett, J.

Shi, A. van den Hengel, SIGIR, 2015o Inferring networks of substitutable and complementary products, J. McAuley, R.

Pandey, J. Leskovec, Knowledge Discovery and Data Mining, 2015

● sample of Amazon product reviewso fashion.json, electronics.json, sports.jsono fields: ASIN, review text, reviewer name, …

● sample of product metadatao sample_metadata.jsono fields: ASIN, price, category, ...

The datasets

Skimlinks | Spark… A view from the trenches

Some Spark definitions (1)

● An RDD is a distributed dataset● The dataset is divided into partitions● It is possible to cache data in memory

Some Spark definitions (2)

● A cluster = a master node and slave nodes● Transformations through the Spark context● Only the master node has access to the Spark context● Actions and transformations

Notebook - Spark core parts 1-3

Why understanding Spark internals?

● essential to understand failures and improve performance

This section is a condensed version of: https://spark-summit.org/2014/talk/a-deeper-understanding-of-spark-internals

From code to computations

rd = sc.textFile(‘product_reviews.txt’)

rd.map(lambda x: (x[‘asin’], x[‘overall’]))

.groupByKey()

.filter(lambda x: len(x[1])> 1)

.count()

From code to computations

1. You write code using RDDs

2. Spark creates a graph of RDDs

rd = sc.textFile(‘product_reviews.txt’)rd..map(lambda x: (x[‘asin’], x[‘overall’]))

.groupByKey().filter(lambda x: len(x[1])> 1).count()

Execution model

Stage 1

3. Spark figures out logical execution plan for each computation

Stage 2

Execution model

4. Schedules and executes individual tasks

If your shuffle fails...● Shuffles are usually the bottleneck:

o if very large tasks ⇒ memory pressureo if too many tasks ⇒ network overheado if too few tasks ⇒ suboptimal cluster utilisation

● Best practices:o always tune the number of partitions!o between 100 and 10,000 partitionso lower bound: at least ~2x number of coreso upper bound: task should take at least 100 ms

● https://spark.apache.org/docs/latest/tuning.html

Other things failing...

● I’m trying to save a file but it keeps failing...○ Turn speculation off!

● I get an error “no space left on device”!○ Make sure the SPARK_LOCAL_DIRS use the right disk

partition on the slaves

● I keep losing my executors○ could be a memory problem: increase executor

memory, or reduce the number of cores

Notebook - Spark core part 4

Apache Spark

DataFrames API

DataFrames and Spark SQL

A DataFrame is a collection of data that is organized with named columns.

● API very similar to Pandas/R DataFrames

Spark SQL is a functionality that allows to query from DataFrames using SQL-like schematic language

● Catalyst SQL engine

● Hive Context opens up most of HQL functionality with DataFrames

RDDs and DataFrames

RDDData is stored as independent objects in partitions

Does process optimization on RDD level

More focus on “HOW” to obtain the required data

DataFrameData has higher level column information in

addition to partitioning

Does optimizations on schematic structure

More focus on “WHAT” data is required

Transformable

Notebook - Spark DataFrames

How do DataFrames work?

●WHY DATAFRAMES??●Overview

This section is inspired by: http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science

Main Considerations

Chart extracted from : https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Fundamentals

Un Resolved Logical

Plan Logical Plan

Optimized Logical

Efficient Physical

PhysicalPlans

SELECT colsFROM tablesWHERE cond

Code: more_code more() Code=1

DataFrame SparkSQL

Catalog

COMPANYNAME.COM | PRESENTATION

Notebook - Spark SQL

New stuff: Data Source APIs●Schema Evolution

o In parquet, you can start from a basic schema and keep adding new fields.

●Run SQL directly on the fileo In Parquet files, run the SQL on the file itself

as parquet has got structure

Data Source APIs●Partition Discovery

oTable partitioning is used in systems like HiveoData is normally stored in different directories

spark-sklearn ●Parameter Tuning is the problem

oDataset is smalloGrid search is BIG

More info: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html

New stuff: DataSet API● Spark : Complex

analyses with minimal programming effort

● Run Spark applications faster o Closely knit to Catalyst

engine and Tungsten Engine

● Extension of DataFrame API: type safe, object oriented programming interface

More info: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html

Spark 2.0● API Changes● A lot of work on

Tungsten Execution engine

● Support of Dataset API

● Unification of DataFrame & Dataset APIs

More info: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-

rdds-dataframes-and-datasets.html

Important Links

● Amazon Dataset : https://snap.stanford.edu/data/web-Amazon.html● Spark DataFrames :

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html● More resources about Apache Spark:

○ http://www.slideshare.net/databricks○ https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q-_UUbA

● Spark SQL programming guide for 1.6.1:https://spark.apache.org/docs/latest/sql-programming-guide.html● Using Apache Spark in real world applications:

http://files.meetup.com/13722842/Spark%20Meetup.pdf ● Tungsten

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html ● Further Questions:

○ Maria : @mariarmestre○ Erik : @zerophewl○ Sahan : @in4maniac

Skimlinks is hiring Data Scientists and Senior Software Engineers !!

● Machine Learning● Apache Spark and Big Data

Get in touch with: ● Sahan : sahan@skimlinks.com● Erik : erik@skimlinks.com

Spark tutorial

Data & Analytics

Spark SQL and DataFrames Spark GraphX Spark Mlib Spark ...Spark GraphX! Spark Mlib! Spark Streaming Lightning-fast cluster computing. Chaining transformations 2. ... Covert RDD to

IC3 GS5 Spark (Office 2016) Exam Tutorial - Pearson VUE · GS5 spark To minimize enors and contusion during the test. read the following notes end instructions carefully This IC3

Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Spark summit2014 techtalk - testing spark

Hadoop and Spark Tutorial for Statisticians · Hadoop and Spark Tutorial for Statisticians Feng Li November 30, 2015 Contents ... -files specify

Apache Spark Tutorial - boss.dima.tu-berlin.de · Apache Spark •The most popular and de-facto framework for big data (science) •APIs in SQL, R, Python, Scala, Java •Support

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for Ford

Learning spark ch09 - Spark SQL

Learning spark ch10 - Spark Streaming

Spark SQL | Apache Spark

Download Apache Spark Tutorial (PDF Version)

Spark streaming , Spark SQL

How to change: Ford Focus 2 DA saloon spark plugs – … · How to change: Ford Focus 2 DA saloon spark plugs – replacement guide VIDEO TUTORIAL This replacement procedure can

big data tutorial w2 spark - ee.columbia.edu

Big Data Frameworks: Scala and Spark Tutorial · PDF file Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides:

Excel tutorial by Sir Spark Microsoft Xcel

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Spark Plug Thread Repair Spark Plug Spark Plug Sockets for

spark blueprint-manual v6.1 se · Manual Open manual If you cannot open the manual, try installing Adobe Reader from here: https„l/get.adobe r/ Tutorial Reset tutorial Brings up

Spark Tutorial with Set Up and Basic File Processingcis.csuohio.edu/~sschung/cis612/CIS612_SparkBasicProcessingTutorialHeideloff.pdf9) Show schema of review100B 10) Register the SchemaRDDs