Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Click to edit Master text styles

After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

Barcelona Spark Meetup

Oct 20th, 2015

Chris Fregly Principal Data Solutions Engineer

IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

Streaming Data Engineer Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced (2016)

Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

Surface and share the patterns and idioms of these well-designed, distributed, big data components

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4

Spark Streaming

real-time Spark SQL structured data

MLlib machine learning

GraphX graph

analytics

BlinkDB approx queries

What is Spark?

Spark Deployments In Production

Tools of the Talk

  Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK)

and…

SMACK Stack!

S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

Themes of this Talk

 Parallelism  Performance  Streaming  Approximations  Similarity Measures  Recommendations

and…

Goals of Spark After Dark   Generate high-quality recommendations

  Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates

Spark SQL -> DataFrames, Cassandra

  GraphX -> PageRank, Shortest Path

  MLlib -> Matrix Factor, Word2Vec

Popular Dating Sites

Click to edit Master text styles Parallelism

My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”

Parallel Algorithm: O(log n)

O(log n)

Non-Parallel Algorithm: O(n)

Spark is Parallel!

Click to edit Master text styles Performance

Spark Beats Hadoop @ 100 TB GraySort

  On-disk only   28,000 partitions   No in-memory caching

(2014) (2013) (2014)

Improved Shuffle and Network Layer   “Sort-based shuffle”

  Minimize OS resources

  Switched to async Netty

  Keep CPUs hot

  Reuse byte buffers to minimize GC

  Use epoll for I/O to stay in kernel space 18

Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize

  CPU-cache-aware data structs and algos -->

  Custom memory management Serializers Performance New HashMap

DataFrames and Catalyst Optimizer

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Please Use DataFrames!

--> -->

JVM bytecode generation

Columnar Storage Format

Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Parquet File Format  Based on Google Dremel

 Implemented by Twitter and Cloudera

 Columnar storage format

 Optimized for fast columnar aggregations

 Tight compression

 Supports pushdowns

 Nested, self-describing, evolving schema 22

Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values

  Delta, Prefix Encoding: Sorted data

Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown

SELECT b FROM table WHERE a in [a2,a3]

Click to edit Master text styles Streaming

Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL

At least once delivery guarantee <--

Click to edit Master text styles Approximations

Count Min Sketch   Approximate counters

  Better than HashMap

  Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase

HyperLogLog   Approximate cardinality Approx count distinct !  From Twitter’s Algebird!

  Low memory 1.5KB @ 2% error, 10^9 elements !

  Streaming example in Spark codebase

  RDD: countApproxDistinctByKey() 29

Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)

Click to edit Master text styles Recommendations

Click to edit Master text styles Interactive Demo!

Audience Participation Needed!

  Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors

-> You are here

Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

Click to edit Master text styles Non-Personalized Recommendations

Summary Statistics and Aggregations   Top Users by Like Count

“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame = Aggregations

Graph Analytics   Top Influencers by Like Graph

“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

Click to edit Master text styles Demo!

Spark SQL/DataFrames + GraphX/PageRank

Click to edit Master text styles Similarities

Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash 43

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

(index,value)

Click to edit Master text styles Personalized Recommendations

Recommendation Terminology User User seeking recommendations Item

Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

Dimension reduction

Collaborative Filtering Personalized Recs   Like behavior of similar users

“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

Click to edit Master text styles Demo!

Spark SQL/DataFrames + MLlib/Alternating Least Squares

Text-based Personalized Recs (1/3)   Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Text Based Personalized Recs (2/3)

 Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.

I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

^ Her Email < My Profile

Click to edit Master text styles The Future of Recommendations!

Facial Recognition   Eigenfaces

“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Natural Language Processing: Convo Bot   NLP and DecisionTrees

“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis

Positive Negative

Maintaining the Spark!

Recommendations for Couples   Pathways of Similarity

“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

Click to edit Master text styles Final Recommendation!

 Get Off the Computer & Meet People! Thank you!!

Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

Relevant Links advancedspark.com

Signup for the book and meetup! github.com/fluxcapacitor/pipeline

Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline

Run all demos presented today!

Image courtesy of http://www.duchess-france.org/

Power of data. Simplicity of design. Speed of innovation.

IBM Spark

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Software

Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Generalized Linear Models in Spark MLlib and SparkR by Xiangrui Meng

Recent Developments in Spark MLlib and Beyond

Apache Spark MLlib 2.0 Preview: Data Science and Production

Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

Cloudera’sInvestmentsin theSparkEcosystem - Meetupfiles.meetup.com/12063092/Olson - Spark at Cloudera.pdf · • Mllib"–Machine"Learning"toolkit ... • SQL"supportin"Spark"Streaming"

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk by Yanbo Liang

Machine Learning using Apache Spark MLlib

Spark MLlib is the Spark component providing the machine ......Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification

Spark Streaming and MLlib - Hyderabad Spark Group

Realtime Web avec Kafka, Spark et Mesos

Spark streaming with kafka

Spark MLlib - Training Material

Learning Spark - Chapter 11: Machine learning with MLlib

Best practices for productionizing Apache Spark MLlib models

MLlib: Scalable Machine Learning on Spark

Spark streaming with apache kafka