51
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python Miklos Christine Solutions Architect [email protected], @Miklos_C

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Embed Size (px)

Citation preview

Page 1: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

The Nitty Gritty of Advanced Analytics

Using Apache Spark in Python

Miklos Christine Solutions [email protected], @Miklos_C

Page 2: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

About MeMiklos ChristineSolutions Architect @ Databricks

- [email protected] Miklos_C@twitter

Systems Engineer @ Cloudera Supported a few of the largest clusters in the world

Software Engineer @ CiscoUC Berkeley Graduate

Page 3: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

We are Databricks, the company behind Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

3

Data Value

Created Databricks on top of Spark to make big data simple.

Page 4: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 5: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Page 6: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

2012

started@

Berkeley

2010

researchpaper

2013

Databricksstarted

& donatedto ASF

2014

Spark 1.0 & libraries

(SQL, ML, GraphX)

2015

DataFramesTungsten

ML Pipelines

2016

Spark 2.0

Page 7: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Spark Community Growth• Spark Survey 2015

Highlights• End of Year Spark Highlights

Page 8: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

2015: A Great Year for Spark

Most active open source project in (big) data• 1000+ code contributors

New language: R

Widespread industry support & adoption

Page 9: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Page 10: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Page 11: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Page 12: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Page 13: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

HOW RESPONDENTS ARE RUNNING SPARK

51%

on a public cloud

TOP ROLES USING SPARK

of respondents identifythemselves as Data Engineers

41%

of respondents identifythemselves as Data Scientists

22%

Page 14: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Spark User Highlights

Page 15: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

Page 16: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Large-Scale Usage

Largest cluster:8000 Nodes (Tencent)

Largest single job:1 PB (Alibaba, Databricks)

Top Streaming Intake:1 TB/hour (HHMI Janelia Farm)

2014 On-Disk Sort RecordFastest Open Source Engine for sorting a PB

Page 17: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Spark API Performance

Page 18: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

History of Spark APIs

RDD(2011)

DataFrame(2013)

Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations

DataSet(2015)

Internally rows, externally JVM objects

Almost the “Best of both worlds”: type safe + fast

But slower than DF Not as good for interactive analysis, especially Python

Page 19: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Benefit of Logical Plan:Performance Parity Across Languages

DataFrame

RDD

Page 20: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

ETL with Spark

Page 21: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

ETL: Extract, Transform, Load

● Key factor for big data platforms

● Provides Speed Improvements in All Workloads

● Typically Executed by Data Engineers

Page 22: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

File Formats

● Text File Formats○ CSV○ JSON

● Avro Row Format

● Parquet Columnar Format

Page 23: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

File Formats + Compression

● File Formats○ JSON

○ CSV

○ Avro

○ Parquet

● Compression Codecs○ No compression

○ Snappy

○ Gzip

○ LZO

Page 24: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

● Industry Standard File Format: Parquet

○ Write to Parquet:

df.write.format(“parquet”).save(“namesAndAges.parquet”)

df.write.format(“parquet”).saveAsTable(“myTestTable”)

○ For compression:

spark.sql.parquet.compression.codec = (gzip, snappy)

Spark Parquet Properties

Page 25: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Small Files Problem

● Small files problem still exists

● Metadata loading

● APIs:df.coalesce(N)df.repartition(N)

Ref:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Page 26: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

● RDD / DataFrame Partitionsdf.rdd.getNumPartitions()

● SparkSQL Shuffle Partitionsspark.sql.shuffle.partitions

● Table Level Partitionsdf.write.partitionBy(“year”).\save(“data.parquet”)

All About Partitions

Page 27: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

# CSV

df = sqlContext.read.\

format('com.databricks.spark.csv').\

options(header='true', inferSchema='true').\

load('/path/to/data')

# JSON

df = sqlContext.read.json("/tmp/test.json")

df.write.json("/tmp/test_output.json")

PySpark ETL APIs - Text Formats

Page 28: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark ETL APIs - Container Formats

# Binary Container Formats

# Avro

df = sqlContext.read.\

format("com.databricks.spark.avro").\

load("/path/to/files/")

# Parquet

df = sqlContext.read.parquet("/path/to/files/")

df.write.parquet("/path/to/files/")

Page 29: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

● Manage Number of Files○ APIs manage the number of files per directory

df.repartition(80).\

write.\

parquet("/path/to/parquet/")

df.repartition(80)

partitionBy("year")\

write.\

parquet("/path/to/parquet/")

PySpark ETL APIs

Page 30: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Common ETL Problems

● Malformed JSON RecordssqlContext.sql("SELECT _corrupt_record FROM jsonTable WHERE _corrupt_record IS NOT NULL")

● Mismatched DataFrame Schema○ Null Representation vs Schema DataType

● Many Small Files / No Partition Strategy○ Parquet Files: ~128MB - 256MB Compressed

Ref: https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dealing_with_bad_data.html

Page 31: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Debugging Spark

Spark Driver Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 362.0 failed 4 times, most recent failure: Lost task 1.3 in stage 362.0 (TID 275202, ip-10-111-225-98.ec2.internal): java.nio.channels.ClosedChannelException

Spark Executor Error:16/04/13 20:02:16 ERROR DefaultWriterContainer: Aborting task.

java.text.ParseException: Unparseable number: "\N"

at java.text.NumberFormat.parse(NumberFormat.java:385)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply$mcD$sp(TypeCast.scala:58)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)

at com.databricks.spark.csv.util.TypeCast$$anonfun$castTo$4.apply(TypeCast.scala:58)

at scala.util.Try.getOrElse(Try.scala:77)

at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:58)

Page 32: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Debugging Spark

Page 33: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

SQL with Spark

Page 34: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

SparkSQL Best Practices

● DataFrames and SparkSQL are synonyms● Use builtin functions instead of custom UDFs

○ import pyspark.sql.functions

● Examples:○ to_date()○ get_json_object() ○ regexp_extract()○ hour() / minute()

Ref:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Page 35: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

SparkSQL Best Practices

● Large Table Joins

○ Largest Table on LHS

○ Increase Spark Shuffle Partitions

○ Leverage “cluster by” API included in Spark 1.6sqlCtx.sql("select * from large_table_1 cluster by num1")

.registerTempTable("sorted_large_table_1");

sqlCtx.sql(“cache table sorted_large_table_1”);

Page 36: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark API Best Practices● User Defined Functions (UDFs)

from pyspark.sql import functions as F

add_n = udf(lambda x, y: x + y, IntegerType())

# We register a UDF that adds a column to the DataFrame, and we cast the id column to an Integer type.

df = df.withColumn('id_offset',

add_n( F.lit(1000), df.id.cast(IntegerType())))

Page 37: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark API Best Practices

● Built-in Functions

corpus_df = df.select( \

F.lower( F.col('body')).alias('corpus'), \

F.monotonicallyIncreasingId().alias('id'))

corpus_df = df.select( \

F.date_format( F.from_utc_timestamp( \

F.from_unixtime(F.col('created_utc'), "PST"), 'EEEE')).alias('dayofweek'))

Ref: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Page 38: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark API Best Practices

● User Defined Functions (UDFs)

def squared(s):

return s * s

sqlContext.udf.register("squaredWithPython", squared)

display(df.select("id", squared_udf("id").alias("id_squared")))

Page 39: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

ML with Spark

Page 41: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Why Spark ML

Provide general purpose ML algorithms on top of Spark• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

Page 42: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

High-level functionality in MLlib

Learning tasksClassificationRegressionRecommendationClusteringFrequent

itemsets

42

Workflow utilities• Model import/export• Pipelines• DataFrames• Cross validation

Data utilities• Feature

extraction & selection

• Statistics• Linear algebra

Page 43: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Machine Learning: What and Why?

ML uses data to identify patterns and make decisions.Core value of ML is automated decision making

• Especially important when dealing with TB or PB of data

Many Use Cases including:• Marketing and advertising optimization• Security monitoring / fraud detection• Operational optimizations

Page 44: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Algorithm coverage in MLlibClassification• Logistic regression w/ elastic net• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron• One-vs-rest

Regression• Least squares w/ elastic net• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods

Recommendation• Alternating Least Squares

Frequent itemsets• FP-growth• Prefix span

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation

Linear algebra• Local dense & sparse vectors & matrices• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Model import/exportPipelines

Feature extraction & selection• Binarizer• Bucketizer• Chi-Squared selection• CountVectorizer• Discrete cosine transform• ElementwiseProduct• Hashing term frequency• Inverse document frequency• MinMaxScaler• Ngram• Normalizer• One-Hot Encoder• PCA• PolynomialExpansion• RFormula• SQLTransformer• Standard scaler• StopWordsRemover• StringIndexer• Tokenizer• StringIndexer• VectorAssembler• VectorIndexer• VectorSlicer• Word2Vec List based on Spark

1.5 44

Page 45: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Spark ML Best Practices

● Spark MLLib vs SparkML ○ Understand the differences

● Don’t Pipeline Too Many Stages ○ Check Results Between Stages

Page 46: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark ML API Best Practices

Page 47: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

PySpark ML API Best Practices

Page 48: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

● DataFrame to RDD Mapping

def tokenize(text):

tokens = word_tokenize(text)

lowercased = [t.lower() for t in tokens]

no_punctuation = []

for word in lowercased:

punct_removed = ''.join([letter for letter in word if not letter in PUNCTUATION])

no_punctuation.append(punct_removed)

no_stopwords = [w for w in no_punctuation if not w in STOPWORDS]

stemmed = [STEMMER.stem(w) for w in no_stopwords]

return [w for w in stemmed if w]

rdd = wordsDataFrame.map(lambda x: (x.__getitem__('id'), tokenize(x.__getitem__('corpus'))))

PySpark ML API Best Practices

Page 49: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

49

Page 50: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Spark Demo

Page 51: The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Thanks!

Sign Up For Databricks Community Edition! http://go.databricks.com/databricks-community-edition-beta-waitlist