Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

  • View
    1.353

  • Download
    5

  • Category

    Software

Preview:

Citation preview

Click to edit Master text styles

Click to edit Master text styles

After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

Barcelona Spark Meetup

Oct 20th, 2015

Chris Fregly Principal Data Solutions Engineer

IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced (2016)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

Surface and share the patterns and idioms of these well-designed, distributed, big data components

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4

Core

Spark Streaming

real-time Spark SQL structured data

MLlib machine learning

GraphX graph

analytics

BlinkDB approx queries

What is Spark?

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark Deployments In Production

5

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Tools of the Talk

6

  Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK)

and…

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

SMACK Stack!

7

S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Themes of this Talk

 Parallelism  Performance  Streaming  Approximations  Similarity Measures  Recommendations

8

and…

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Goals of Spark After Dark   Generate high-quality recommendations

  Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates

Spark SQL -> DataFrames, Cassandra

  GraphX -> PageRank, Shortest Path

  MLlib -> Matrix Factor, Word2Vec

9

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Popular Dating Sites

10

Click to edit Master text styles

Click to edit Master text styles Parallelism

11

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”

12

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parallel Algorithm: O(log n)

13

O(log n)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Non-Parallel Algorithm: O(n)

14

O(n)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark is Parallel!

15

Click to edit Master text styles

Click to edit Master text styles Performance

16

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark Beats Hadoop @ 100 TB GraySort

17

  On-disk only   28,000 partitions   No in-memory caching

(2014) (2013) (2014)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Improved Shuffle and Network Layer   “Sort-based shuffle”

  Minimize OS resources

  Switched to async Netty

  Keep CPUs hot

  Reuse byte buffers to minimize GC

  Use epoll for I/O to stay in kernel space 18

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize

  CPU-cache-aware data structs and algos -->

  Custom memory management Serializers Performance New HashMap

19

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

DataFrames and Catalyst Optimizer

20

20

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Please Use DataFrames!

--> -->

JVM bytecode generation

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Columnar Storage Format

21

Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Parquet File Format  Based on Google Dremel

 Implemented by Twitter and Cloudera

 Columnar storage format

 Optimized for fast columnar aggregations

 Tight compression

 Supports pushdowns

 Nested, self-describing, evolving schema 22

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values

  Delta, Prefix Encoding: Sorted data

23

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown

SELECT b FROM table WHERE a in [a2,a3]

24

Click to edit Master text styles

Click to edit Master text styles Streaming

25

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL

26

At least once delivery guarantee <--

Click to edit Master text styles

Click to edit Master text styles Approximations

27

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Count Min Sketch   Approximate counters

  Better than HashMap

  Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase

28

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

HyperLogLog   Approximate cardinality Approx count distinct !  From Twitter’s Algebird!

  Low memory 1.5KB @ 2% error, 10^9 elements !

  Streaming example in Spark codebase

  RDD: countApproxDistinctByKey() 29

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)

30

Click to edit Master text styles

Click to edit Master text styles Recommendations

31

Click to edit Master text styles

Click to edit Master text styles Interactive Demo!

32

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Audience Participation Needed!

33

  Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors

-> You are here

->

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

34

Click to edit Master text styles

Click to edit Master text styles Non-Personalized Recommendations

35

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Summary Statistics and Aggregations   Top Users by Like Count

“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame = Aggregations

36

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Graph Analytics   Top Influencers by Like Graph

“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

37

Click to edit Master text styles

Click to edit Master text styles Demo!

Spark SQL/DataFrames + GraphX/PageRank

38

Click to edit Master text styles

Click to edit Master text styles Similarities

39

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z!

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

41

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash 43

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

44

(index,value)

(index,value)

Click to edit Master text styles

Click to edit Master text styles Personalized Recommendations

45

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Recommendation Terminology User User seeking recommendations Item

Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

Dimension reduction

46

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Collaborative Filtering Personalized Recs   Like behavior of similar users

“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

47

Click to edit Master text styles

Click to edit Master text styles Demo!

Spark SQL/DataFrames + MLlib/Alternating Least Squares

48

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text-based Personalized Recs (1/3)   Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

49

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text Based Personalized Recs (2/3)

50

 Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.

I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

51

^ Her Email < My Profile

Click to edit Master text styles

Click to edit Master text styles The Future of Recommendations!

52

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Facial Recognition   Eigenfaces

“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Natural Language Processing: Convo Bot   NLP and DecisionTrees

“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis

54

Positive Negative

Click to edit Master text styles

Click to edit Master text styles

55

Maintaining the Spark!

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Recommendations for Couples   Pathways of Similarity

“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

56

Click to edit Master text styles

Click to edit Master text styles Final Recommendation!

57

Click to edit Master text styles

Click to edit Master text styles

spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

 Get Off the Computer & Meet People! Thank you!!

Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

Relevant Links advancedspark.com

Signup for the book and meetup! github.com/fluxcapacitor/pipeline

Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline

Run all demos presented today!

58

Image courtesy of http://www.duchess-france.org/

Click to edit Master text styles

Click to edit Master text styles

Power of data. Simplicity of design. Speed of innovation.

IBM Spark

Recommended