NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

User of Spark since 2012

Organiser of the London Spark Meetup

Run Data Science team at Skimlinks

Who am I

Apache Spark

The RDD

RDD.map

>>> thisrdd = sc.parallelize(range(12), 4)

>>> thisrdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

>>> otherrdd = thisrdd.map(lambda x:x%3)

>>> otherrdd.collect()

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

RDD.map

>>> otherrdd.zip(thisrdd).collect()

[(0, 0), (1, 1), (2, 2), (0, 3), (1, 4), (2, 5), (0, 6), (1, 7), (2, 8), (0, 9), (1, 10), (2, 11)]

>>> otherrdd.zip(thisrdd).reduceByKey(lambda x,y: x+y).collect()

[(0, 18), (1, 22), (2, 26)]

RDD.reduceByKey

Set the number of reducers sensibly

Configure your pyspark cluster properly

Don’t shuffle (unless you have to)

Don’t groupBy

Repartition your data if necessary

How to not crash your spark job

Lots of people will say 'use scala'

Don't listen to those people.

Naive bayes - recap

# get (class label, word) tupleslabel_token = gettokens(docs)

# [(False, u'https'), (True, u'fashionblog'), (True, u'dress'), (False, u'com')),...]

tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))

# [(u'https', [0, 1]), (u'fashionblog', [1, 0]), (u'dress', [1, 0]), (u'com', [0, 1])), ...]

# get the word count for each classtermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com', [95, 100])),

Naive Bayes in Spark

termcounts_plus_pseudo = termcounts.map(lambda (term, counts): (term, map(add,

counts, (1, 1))))

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]

# => [(u'https', [101, 113]), (u'fashionblog', [1, 101]), (u'dress', [6, 16]),...]

# get the total number of words in each class

values = termcounts_plus_pseudo.map(lambda (term, (truecounts, falsecounts)):

(truecounts, falsecounts))

totals = values.reduce(lambda x,y: map(add, x,y))

# [1321, 2345]

P_t = termcounts_plus_pseudo.map(lambda (label, counts): (label, map(truediv,

counts, totals)))

reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

{k1: 10, …}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …}

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

reduceByKey(combineByKey)

{k1: 2, …} (k1, 2)

(k1, 3)

{k1: 10, …}

combineLocally _mergeCombiners

{k1: 3, …}

{k1: 5, …} reduceByKey(numPartitions)

(k1, 1)(k1, 1)

(k1, 2)(k1, 1)

(k1, 5)

RDD.aggregate(zeroValue, seqOp, combOp)Aggregate the elements of each partition, and then the results for all the partitions, using a given

combine functions and a neutral “zero value.”

class WordFrequencyAgreggator(object):

def __init__(self):

self.S = {}

def add(self, (token, count)):

if token not in self.S:

self.S[token] = (0,0)

self.S[token] = map(add, self.S[token], count)

return self

def merge(self, other):

for term, count in other.S.iteritems():

if term not in self.S:

self.S[term] = (0,0)

self.S[term] = map(add, self.S[term], count)

return self

Naive Bayes in Spark: Aggregation

With aggregatetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]),...]

# => [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]),...]

With aggregateaggregates = tokencounter.aggregate(WordFrequencyAgreggator(), lambda x,y:x.add(y),

lambda x,y: x.merge(y))

RDD.aggregate(zeroValue, seqOp, combOp)

Naive Bayes in Spark: Aggregation

Naive Bayes in Spark: treeAggregation

RDD.treeAggregate(zeroValue, seqOp, combOp, depth=2)

Aggregates the elements of this RDD in a multi-level tree pattern.

With reducetermcounts = tokencounter.reduceByKey(lambda x, y: map(add, x, y))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,

1])),...]

# ===>

# [(u'https', [100, 112]), (u'fashionblog', [0, 100]), (u'dress', [5, 15]), (u'com',

[95, 100])),...]

With treeAggregateaggregates = tokencounter.treeAggregate(WordFrequencyAgreggator(), lambda x,y:x.add

(y), lambda x,y: x.merge(y), depth=4)

Naive Bayes in Spark: treeAggregate

On 1B short documents:

RDD.reduceByKey: 18 min

RDD.treeAggregate: 10 min

https://gist.github.com/martingoodson/aad5d06e81f23930127b

treeAggregate performance

Word2Vec

Training Word2Vec in Spark

from pyspark.mllib.feature import Word2Vec

inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))

word2vec = Word2Vec()

model = word2vec.fit(inp)

AveragingClusteringConvolutional Neural Network

How to use word2vec vectors for classification problems

K-Means in Spark

from pyspark.mllib.clustering import KMeans, KMeansModel

word=sc.textFile('GoogleNews-vectors-negative300.txt')

vectors = word.map(lambda line: array(

[float(x) for x in line.split('\t')[1:]])

clusters = KMeans.train(vectors, 50000, maxIterations=10,

runs=10, initializationMode="random")

clusters_b = sc.broadcast(clusters)

labels = parsedData.map(lambda x:clusters_b.value.predict(x))

Semi Supervised Naive Bayes

● Build an initial naive Bayes classifier, ŵ, from the labeled documents, X, only● Loop while classifier parameters improve:

○ (E-step) Use the current classifier, ŵ, to estimate component membership of each unlabeled document, i.e., the probability that each class generated each document,

○ (M-step) Re-estimate the classifier, ŵ, given the estimated class membership of each document.

Kamal Nigam, Andrew McCallum and Tom Mitchell. Semi-supervised Text Classification Using EM. In Chapelle, O., Zien, A., and Scholkopf, B. (Eds.) Semi-Supervised Learning. MIT Press: Boston. 2006.

instead of labels:

tokencounter = label_token.map(lambda (label, token): (token, (label, not label)))

# [(u'https', [0, 1]), (u'fashionblog', [0, 1]), (u'dress', [0, 1]), (u'com', [0,

1])),...]

use probabilities:

# [(u'https', [0.1, 0.3]), (u'fashionblog', [0.01, .11]), (u'dress', [0.02, 0.02]),

(u'com', [0.13, .05])),...]

Naive Bayes in Spark: EM

500K labelled examplesPrecision: 0.27Recall: 0.15F1: 0.099

Add 10M unlabelled examples. 10 EM iterations.Precision of 0.26Recall of 0.31F1 of 0.14

240M training examplesPrecision: 0.31Recall: 0.19F1: 0.12

Add 250M unlabelled examples. 10 EM iterations.Precision of 0.26 and Recall of 0.22F1: 0.12

PySpark Memory: worked example

PySpark Configuration: Worked Example

10 x r3.4xlarge (122G, 16 cores)

Use half for each executor: 60GB

Number of cores = 120

OS: ~12GB

Each python process: ~4GB = 48GB

Cache = 60% x 60GB x 10 = 360GB

Each java thread: 40% x 60GB / 12 = ~2GB

more here: http://files.meetup.com/13722842/Spark%20Meetup.pdf

We are hiring!martin@skimlinks.com

@martingoodson

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark

Technology

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Hortonworks Data Platform - Apache Spark Component …€¦ · · 2018-04-15Hortonworks Data Platform: Apache Spark Component Guide ... Tuning Spark ... and debugging Spark shell

Apache Spark Streaming

Apache spark meetup

R + Apache Spark

Apache Spark Introduction

Apache Spark and Distributed Programming - CS-E4110 ... · Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements

Performance-Analyse von Apache Spark und Apache Hadoop€¦ · Apache Spark, Apache Hadoop, Big Data, Benchmarking, Performance-Analyse Kurzzusammenfassung Diese Bachelorarbeit beschäftigt

Apache Spark & Hadoop

Apache Spark 101

Apache Spark - LMU

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Apache Spark Briefing

Apache Spark Overview

NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA · NEW ARCHITECTURES FOR APACHE SPARK AND BIG DATA The Apache Spark Platform for Big Data The Apache Spark platform is an open-source

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Budapest Spark Meetup - Apache Spark @enbrite.ly

State of Security: Apache Spark & Apache Zeppelin

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or