The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
Solutions Architect
[email protected], @Miklos_C

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

The Nitty Gritty of Advanced Analytics

Using Apache Spark in Python

Miklos Christine Solutions [email protected], @Miklos_C

About Me
Miklos Christine
Solutions Architect @ Databricks

[email protected]
@Miklos_C

Systems Engineer @ Cloudera
Supported a few of the largest clusters in the world

Software Engineer @ Cisco
UC Berkeley Graduate

We are Databricks, the company behind Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricks in 2014



Data Value

Created Databricks on top of Spark to make big data simple.

Apache Spark Engine

Spark Core

Spark Streaming, Spark SQL, MLlib, GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 6: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python








& donated to ASF


Spark 1.0 & libraries

(SQL, ML, GraphX)



ML Pipelines


Spark 2.0

Spark Community Growth
• Spark Survey 2015

Highlights
• End of Year Spark Highlights

2015: A Great Year for Spark

Most active open source project in (big) data
• 1000+ code contributors

New language: R

Widespread industry support & adoption

on a public cloud


of respondents identify themselves as Data Engineers


of respondents identify themselves as Data Scientists


Spark User Highlights

Source: Slide 5 of Spark Community Update

Large-Scale Usage

Largest cluster: 8000 Nodes (Tencent)

Largest single job: 1 PB (Alibaba, Databricks)

Top Streaming Intake: 1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort Record
Fastest Open Source Engine for sorting a PB

Spark API Performance

History of Spark APIs



Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations


Internally rows, externally JVM objects

Almost the "Best of both worlds": type safe + fast

But slower than DF
Not as good for interactive analysis, especially Python

Benefit of Logical Plan: Performance Parity Across Languages



ETL with Spark

ETL: Extract, Transform, Load

● Key factor for big data platforms

● Provides Speed Improvements in All Workloads

● Typically Executed by Data Engineers

File Formats

● Text File Formats
○ CSV
○ JSON

● Avro Row Format

● Parquet Columnar Format

File Formats + Compression

● File Formats
○ JSON


○ Avro

○ Parquet

● Compression Codecs
○ No compression

○ Snappy

○ Gzip


● Industry Standard File Format: Parquet

○ Write to Parquet:



○ For compression:

spark.sql.parquet.compression.codec = (gzip, snappy)

Spark Parquet Properties

Small Files Problem

● Small files problem still exists

● Metadata loading

● APIs:
df.coalesce(N)
df.repartition(N)



● RDD / DataFrame Partitions
df.rdd.getNumPartitions()

● SparkSQL Shuffle Partitions
spark.sql.shuffle.partitions

● Table Level Partitions
df.write.partitionBy("year").save("data.parquet")

All About Partitions

df = sqlContext.read.


options(header='true', inferSchema='true').



df = sqlContext.read.json("/tmp/test.json")


PySpark ETL APIs - Text Formats

PySpark ETL APIs - Container Formats

# Binary Container Formats

# Avro

df = sqlContext.read.



# Parquet

df = sqlContext.read.parquet("/path/to/files/")


● Manage Number of Files
○ APIs manage the number of files per directory








PySpark ETL APIs

Common ETL Problems

● Malformed JSON Records
sqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL")

● Mismatched DataFrame Schema
○ Null Representation vs Schema DataType

● Many Small Files / No Partition Strategy
○ Parquet Files: ~128MB - 256MB Compressed

Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html

Debugging Spark

Spark Driver Error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException

Spark Executor Error:
16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.

java.text.ParseException: Unparseable number: "\N"

at java.text.NumberFormat.parse(NumberFormat.java:385)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)

at scala.util.Try.getOrElse(Try.scala:77)

at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

Debugging Spark

SQL with Spark

SparkSQL Best Practices

● DataFrames and SparkSQL are synonyms
● Use builtin functions instead of custom UDFs

○ import pyspark.sql.functions

● Examples:
○ to_date()
○ get_json_object()
○ regexp_extract()
○ hour() / minute()



SparkSQL Best Practices

● Large Table Joins

○ Largest Table on LHS

○ Increase Spark Shuffle Partitions

○ Leverage "cluster by" API included in Spark 1.6
sqlCtx.sql("select * from large_table_1 cluster by num1")


sqlCtx.sql("cache table sorted_large_table_1");

PySpark API Best Practices
● User Defined Functions (UDFs)

from pyspark.sql import functions as F

add_n = udf(lambda x, y: x + y, IntegerType())

# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.

df = df.withColumn('id_offset',

add_n( F.lit(1000), df.id.cast(IntegerType())))

PySpark API Best Practices

● Built-in Functions

corpus_df = df.select( \

F.lower( F.col('body')).alias('corpus'), \


corpus_df = df.select( \

F.date_format( F.from_utc_timestamp( \

F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias('dayofweek'))

Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

PySpark API Best Practices

● User Defined Functions (UDFs)

def squared(s):

return s * s

sqlContext.udf.register("squaredWithPython", squared)

display(df.select("id", squared_udf("id").alias("id_squared")))

ML with Spark

Why Spark ML

Provide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability
• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib's Design:
• Simplicity
• Scalability
• Streamlined end-to-end
• Compatibility

Page 42: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

High-level functionality in MLlib

Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent



Workflow utilities
• Model import/export
• Pipelines
• DataFrames
• Cross validation

Data utilities
• Feature

extraction & selection

• Statistics
• Linear algebra

Machine Learning: What and Why?

ML uses data to identify patterns and make decisions.
Core value of ML is automated decision making

• Especially important when dealing with TB or PB of data

Many Use Cases including:
• Marketing and advertising optimization
• Security monitoring / fraud detection
• Operational optimizations

Algorithm coverage in MLlib
Classification
• Logistic regression w/ elastic net
• Naive Bayes
• Streaming logistic regression
• Linear SVMs
• Decision trees
• Random forests
• Gradient-boosted trees
• Multil

Regression• Least squares w/ elastic net• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods

Recommendation• Alternating Least Squares

Frequent itemsets• FP-growth• Prefix span

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation

Linear algebra• Local dense & sparse vectors & matrices• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Model import/exportPipelines

Feature extraction & selection• Binarizer• Bucketizer• Chi-Squared selection• CountVectorizer• Discrete cosine transform• ElementwiseProduct• Hashing term frequency• Inverse document frequency• MinMaxScaler• Ngram• Normalizer• One-Hot Encoder• PCA• PolynomialExpansion• RFormula• SQLTransformer• Standard scaler• StopWordsRemover• StringIndexer• Tokenizer• StringIndexer• VectorAssembler• VectorIndexer• VectorSlicer• Word2Vec List based on Spark

1.5 44

Spark ML Best Practices

● Spark MLLib vs SparkML ○ Understand the differences

● Don’t Pipeline Too Many Stages ○ Check Results Between Stages

PySpark ML API Best Practices

PySpark ML API Best Practices

● DataFrame to RDD Mapping

def tokenize(text):

tokens = word_tokenize(text)

lowercased = [t.lower() for t in tokens]

no_punctuation = []

for word in lowercased:

punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])


no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]

stemmed = [STEMMER.stem(w) for w in no_stopwords]

return [w for w in stemmed if w]

rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))

PySpark ML API Best Practices

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)


Spark Demo

Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community-edition-beta-waitlist