The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

The Nitty Gritty of Advanced Analytics

Using Apache Spark in Python

Miklos Christine Solutions Architectmwc@databricks.com, @Miklos_C

About MeMiklos ChristineSolutions Architect @ Databricks

- mwc@databricks.com- Miklos_C@twitter

Systems Engineer @ Cloudera Supported a few of the largest clusters in the world

Software Engineer @ CiscoUC Berkeley Graduate

We are Databricks, the company behind Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

Data Value

Created Databricks on top of Spark to make big data simple.

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

started@

Berkeley

researchpaper

Databricksstarted

& donatedto ASF

Spark 1.0 & libraries

(SQL, ML, GraphX)

DataFramesTungsten

ML Pipelines

Spark 2.0

Spark Community Growth• Spark Survey 2015

Highlights• End of Year Spark Highlights

2015: A Great Year for Spark

Most active open source project in (big) data• 1000+ code contributors

New language: R

Widespread industry support & adoption

HOW RESPONDENTS ARE RUNNING SPARK

on a public cloud

TOP ROLES USING SPARK

of respondents identifythemselves as Data Engineers

of respondents identifythemselves as Data Scientists

Spark User Highlights

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

Large-Scale Usage

Largest cluster:8000 Nodes (Tencent)

Largest single job:1 PB (Alibaba, Databricks)

Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

Spark API Performance

History of Spark APIs

RDD(2011)

DataFrame(2013)

Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations

DataSet(2015)

Internally rows, externally JVM objects

Almost the “Best of both worlds”: type safe + fast

But slower than DF Not as good for interactive analysis, especially Python

Benefit of Logical Plan:Performance Parity Across Languages

DataFrame

ETL with Spark

ETL: Extract, Transform, Load

● Key factor for big data platforms

● Provides Speed Improvements in All Workloads

● Typically Executed by Data Engineers

File Formats

● Text File Formats○ CSV○ JSON

● Avro Row Format

● Parquet Columnar Format

File Formats + Compression

● File Formats○ JSON

○ CSV

○ Avro

○ Parquet

● Compression Codecs○ No compression

○ Snappy

○ Gzip

○ LZO

● Industry Standard File Format: Parquet

○ Write to Parquet:

df.write.format(“parquet”).save(“namesAndAges.parquet”)

df.write.format(“parquet”).saveAsTable(“myTestTable”)

○ For compression:

spark.sql.parquet.compression.codec = (gzip, snappy)

Spark Parquet Properties

Small Files Problem

● Small files problem still exists

● Metadata loading

● APIs:df.coalesce(N)df.repartition(N)

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

● RDD / DataFrame Partitionsdf.rdd.getNumPartitions()

● SparkSQL Shuffle Partitionsspark.sql.shuffle.partitions

● Table Level Partitionsdf.write.partitionBy(“year”).\save(“data.parquet”)

All About Partitions

df = sqlContext.read.\

format('com.databricks.spark.csv').\

options(header='true', inferSchema='true').\

load('/path/to/data')

# JSON

df = sqlContext.read.json("/tmp/test.json")

df.write.json("/tmp/test_output.json")

PySpark ETL APIs - Text Formats

PySpark ETL APIs - Container Formats

# Binary Container Formats

# Avro

df = sqlContext.read.\

format("com.databricks.spark.avro").\

load("/path/to/files/")

# Parquet

df = sqlContext.read.parquet("/path/to/files/")

df.write.parquet("/path/to/files/")

● Manage Number of Files○ APIs manage the number of files per directory

df.repartition(80).\

write.\

parquet("/path/to/parquet/")

df.repartition(80)

partitionBy("year")\

write.\

parquet("/path/to/parquet/")

PySpark ETL APIs

Common ETL Problems

● Malformed JSON RecordssqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL")

● Mismatched DataFrame Schema○ Null Representation vs Schema DataType

● Many Small Files / No Partition Strategy○ Parquet Files: ~128MB - 256MB Compressed

Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html

Debugging Spark

Spark Driver Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException

Spark Executor Error:16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.

java.text.ParseException: Unparseable number: "\N"

at java.text.NumberFormat.parse(NumberFormat.java:385)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)

at scala.util.Try.getOrElse(Try.scala:77)

at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

Debugging Spark

SQL with Spark

SparkSQL Best Practices

● DataFrames and SparkSQL are synonyms● Use builtin functions instead of custom UDFs

○ import pyspark.sql.functions

● Examples:○ to_date()○ get_json_object() ○ regexp_extract()○ hour() / minute()

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

SparkSQL Best Practices

● Large Table Joins

○ Largest Table on LHS

○ Increase Spark Shuffle Partitions

○ Leverage “cluster by” API included in Spark 1.6sqlCtx.sql("select * from large_table_1 cluster by num1")

.registerTempTable("sorted_large_table_1");

sqlCtx.sql(“cache table sorted_large_table_1”);

PySpark API Best Practices● User Defined Functions (UDFs)

from pyspark.sql import functions as F

add_n = udf(lambda x, y: x + y, IntegerType())

# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.

df = df.withColumn('id_offset',

add_n( F.lit(1000), df.id.cast(IntegerType())))

PySpark API Best Practices

● Built-in Functions

corpus_df = df.select( \

F.lower( F.col('body')).alias('corpus'), \

F.monotonicallyIncreasingId().alias('id'))

corpus_df = df.select( \

F.date_format( F.from_utc_timestamp( \

F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias('dayofweek'))

Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

PySpark API Best Practices

● User Defined Functions (UDFs)

def squared(s):

return s * s

sqlContext.udf.register("squaredWithPython", squared)

display(df.select("id", squared_udf("id").alias("id_squared")))

ML with Spark

Data Science Time

Why Spark ML

Provide general purpose ML algorithms on top of Spark• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

High-level functionality in MLlib

Learning tasksClassificationRegressionRecommendationClusteringFrequent

itemsets

Workflow utilities• Model import/export• Pipelines• DataFrames• Cross validation

Data utilities• Feature

extraction & selection

• Statistics• Linear algebra

Machine Learning: What and Why?

ML uses data to identify patterns and make decisions.Core value of ML is automated decision making

• Especially important when dealing with TB or PB of data

Many Use Cases including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations

Algorithm coverage in MLlibClassification• Logistic regression w/ elastic net• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron• One-vs-rest

Regression• Least squares w/ elastic net• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods

Recommendation• Alternating Least Squares

Frequent itemsets• FP-growth• Prefix span

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation

Linear algebra• Local dense & sparse vectors & matrices• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Model import/exportPipelines

Feature extraction & selection• Binarizer• Bucketizer• Chi-Squared selection• CountVectorizer• Discrete cosine transform• ElementwiseProduct• Hashing term frequency• Inverse document frequency• MinMaxScaler• Ngram• Normalizer• One-Hot Encoder• PCA• PolynomialExpansion• RFormula• SQLTransformer• Standard scaler• StopWordsRemover• StringIndexer• Tokenizer• StringIndexer• VectorAssembler• VectorIndexer• VectorSlicer• Word2Vec List based on Spark

1.5 44

Spark ML Best Practices

● Spark MLLib vs SparkML ○ Understand the differences

● Don’t Pipeline Too Many Stages ○ Check Results Between Stages

PySpark ML API Best Practices

● DataFrame to RDD Mapping

def tokenize(text):

tokens = word_tokenize(text)

lowercased = [t.lower() for t in tokens]

no_punctuation = []

for word in lowercased:

punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])

no_punctuation.append(punct_removed)

no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]

stemmed = [STEMMER.stem(w) for w in no_stopwords]

return [w for w in stemmed if w]

rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))

PySpark ML API Best Practices

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

Spark Demo

Thanks!

Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community-edition-beta-waitlist

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Technology

Job Offers. Fit... Dollars... & The Nitty Gritty

THE NITTY GRITTY OF HOW TO GET WRITING DONE FOR …

Nitty gritty moonstruck marketing plan - spreads

The Nitty Gritty - Family Planning NSW · 2018. 12. 17. · The Nitty Gritty topics Core modules: • Communicating with young people about reproductive and sexual health • Defining

The nitty gritty of game development

The Nitty Gritty of Hedge - Chatham Financialinfo.chathamfinancial.com/rs/chatham/images/Chatham Webinar Slides... · 1 The Nitty‐Gritty of Commodity Hedge Accounting June 2013

The Librarian’s Nitty-Gritty Guide to Content Marketing Workshop

The nitty gritty of creating alternative economiesSocial Alternatives Vol. 30 No.1, 2011 29 The Nitty Gritty of Creating Alternative Economies J.K. giBson-grahaM and gerda roelvinK

Nitty Gritty of Adaptive Video Transmuxing in JS

Nitty Gritty of Smoke Cooking, The - Dan Stair

Nitty Gritty of Trade

CLINOX POWER ES ENG - Nitty-Gritty

Nitty Gritty Menu 2013

The Nitty Gritty of OER Adoption

THE NITTY GRITTY - .NET Framework

Premium seats $74.95 STAGE General Seats $59.95 No …docs.cranbrook.ca/downloads/leisure/NITTY GRITTY DIRT BAND - FI… · NO ALCOHOL IN SECTIONS NITTY GRITTY DIRT BAND– FRIDAY–

Green Funds 2.0 Nitty Gritty Of Campus Sustainability Fund Management

The Nitty Gritty of Setting Up Customer Discovery Meetings

Fracking – the nitty gritty UKELA – 23 September 2015

Fun Team Building: The Nitty Gritty of Team Building