Corpus methods in linguistics and NLP Lecture 7 ... · -20pt UNIVERSITY OF GOTHENBURG today's lecture I as you've seen, processing large corpora can take time! I for instance, building

Corpus methods in linguistics and NLPLecture 7: Programming for large-scale data

processing

UNIVERSITY OF

GOTHENBURG

Richard Johansson

December 1, 2015

-20pt

UNIVERSITY OF

GOTHENBURG

today's lecture

I as you've seen, processing large corpora can take time!I for instance, building the frequency tables in the word sketch

assignment

I in this lecture, we'll think of how we can process large volumesof data by parallelizing our programs

I some basic ideas, some techniques, and pointers to software

I we'll just dip our toes, but there will be pointers for furtherreading

-20pt

UNIVERSITY OF

GOTHENBURG

overview

basics of parallel programming

parallel programming in Python

architectures for large-scale parallel processing: MapReduce,Hadoop, Spark

-20pt

UNIVERSITY OF

GOTHENBURG

speeding up by parallelizing

I can we buy a machine that runs our code 10 times faster?I I have a 2 GHz CPU: can I get a 20 GHz CPU instead?

I it's probably easier to buy 10 machines, or a machine with 10CPUs, and then try to make the program parallel

-20pt

UNIVERSITY OF

GOTHENBURG

Moore's law

I Moore's law was formulated by Gordon Moore at Intel in theearly 1970s

I �overall processing power for computers doubles every 2 years�

I until about 2000, this used to mean that processors got faster

I Moore's law still holds, but its e�ect nowadays is increasedparallelization

I increased number of CPUs in computersI and each CPUs can run more than one process at a time

-20pt

UNIVERSITY OF

GOTHENBURG

Moore's law (Wikipedia)

-20pt

UNIVERSITY OF

GOTHENBURG

computer clusters

I computations may be distributed over large collections ofmachines: clusters

I for instance, Sweden has the SNIC infrastructure that connectsclusters at di�erent universitites

http://allabttech.com/2014/09/08/worlds-largest-data-centers

picture borrowed from Peter Exner's slides


-20pt

UNIVERSITY OF

GOTHENBURG

parallelizing an algorithm

I making an algorithm work in a parallel fashion may involvesigni�cant changes

I a parallel algorithm is e�cient if

Tparallel ≈ T

sequential

number of processors

I for instance, if I can compute a frequency table in a corpus10 times as fast by using 10 machines

I often, this isn't exactly the case because there is some�administrative� overhead when we parallelize

-20pt

UNIVERSITY OF

GOTHENBURG

processing �embarassingly parallel� tasks

I an embarrassingly parallel (or trivially parallel) jobI can be split into separate pieces with little or no e�ort,I pieces be processed independently: when processing one

piece, we don't need to care about what other pieces containI and where it's easy to collect the result in the end

I how can I process such a task if I have 10 machines (orCPUs)?

I split the data into 10 pieces (of roughly equal size)I assign a piece to each machineI run the 10 machines in parallelI concatenate the 10 results

-20pt

UNIVERSITY OF

GOTHENBURG

processing �embarassingly parallel� tasks

I an embarrassingly parallel (or trivially parallel) jobI can be split into separate pieces with little or no e�ort,I pieces be processed independently: when processing one

piece, we don't need to care about what other pieces containI and where it's easy to collect the result in the end

I how can I process such a task if I have 10 machines (orCPUs)?

I split the data into 10 pieces (of roughly equal size)I assign a piece to each machineI run the 10 machines in parallelI concatenate the 10 results

-20pt

UNIVERSITY OF

GOTHENBURG

what kinds of tasks are embarrassingly parallel?

I lowercasing the 1000 text �les in a directory?

I building a frequency table of the words in a corpus?

I PoS tagging? parsing?

I machine learning tasks:I Naive Bayes training?I perceptron training?

-20pt

UNIVERSITY OF

GOTHENBURG

embarrassing parallelization on Unix-like systems

I on Unix-like systems (e.g. Mac or Linux), commands such assplit can be handy

I for instance, we split a �le bigfile into 10 parts

split bigfile -n 10 smallfile_

I then we will get smallfile_aa, smallfile_ab, etc

I if we have more than one CPU on the machine, we can startmultiple processes at once:

python3 do_something.py smallfile_aa &

python3 do_something.py smallfile_ab &

...

I if we have many machines, we may need to copy �les; on acomputer cluster with many machines, the �le system isusually shared between the machines

-20pt

UNIVERSITY OF

GOTHENBURG

when parallelization is not trivial

I typically, algorithms that work in an incremental fashion arehard to parallelize

I when the result in the current step depends on what hashappened before

I a good example is the perceptron learning algorithmI what we do in this step depends on all the errors we made

before

I parallelized versions of the perceptron (and related algorithmssuch as SVM, LR) use mini-batches rather than singleinstances

-20pt

UNIVERSITY OF

GOTHENBURG

overview




-20pt

UNIVERSITY OF

GOTHENBURG

simple parallelization in Python

I in programming, we distinguish between two types of parallelactivities:

I threads are parallel activities that share memory (variables,data structures, etc)

I processes run with separate memory, so they need tocommunicate over the network or through �les

I in Python, for various technical reasons, using threads is lesse�cient in general than using separate processes

I but threading can be useful for many other purposes, forinstance to process events in a server application

I if you're interested, take a look at the threading library

https://wiki.python.org/moin/GlobalInterpreterLock

-20pt

UNIVERSITY OF

GOTHENBURG

the multiprocessing library

I the multiprocessing library (included in Python's standardlibrary) contains some functions for managing processes:

I creating a processI waiting for a process to end, or stop it �violently�I communicating between processesI synchronization: making sure that processes don't mess up for

each otherI managing a group of �slave� processes: the Pool

-20pt

UNIVERSITY OF

GOTHENBURG

simple multiprocessing example

import time

import multiprocessing as mp

import random

def do_something(job_nbr):

while True:

print('Process {0} says hello!'.format(job_nbr))

time.sleep(random.random())

if __name__ == '__main__':

nbr_workers = 5

for i in range(nbr_workers):

worker = mp.Process(target=do_something,

args=[i])

worker.start()

-20pt

UNIVERSITY OF

GOTHENBURG

master�slave architecture and the Pool

I in many cases, we have a master process (the mainprogram) that creates tasks for a number of slaveprocesses that work in parallel to do the hard work

I with the Pool class from the multiprocessing library,we can simplify the management of slaves:

I the master process submits tasks to the Pool, whichdistributes the tasks to the slaves

I the slaves process the tasks in parallelI the master collects the results

-20pt

UNIVERSITY OF

GOTHENBURG

Pool exampleimport multiprocessing as mp

### THIS PART IS EXECUTED IN THE SLAVE PROCESS ###

def compute_square(number):

return number*number

### THIS PART IS EXECUTED IN THE MASTER PROCESS ###

square_list = []

def add_square(square):

square_list.append(square)

if __name__ == '__main__':

pool = mp.Pool(processes=4) # or mp.cpu_count()

for i in range(10):

# submit a job

pool.apply_async(compute_square, args=[i],

callback=add_square)

pool.close() # tell the pool that we're done

pool.join() # wait for all jobs to finish

print(square_list)

-20pt

UNIVERSITY OF

GOTHENBURG

word counting example: not parallelized

I now we'll do something more useful: computing frequencies

I we'll start from this non-parallelized example:

from collections import Counter

filename = ... something ...

freqs = Counter()

with open(filename) as f:

for l in f:

freqs.update(l.split())

print(freqs.most_common(5))

I now, let's divide this into master and slave

-20pt

UNIVERSITY OF

GOTHENBURG

parallelized word counting example: slave part

def compute_frequencies(lines):

# make a frequency table for these lines

freqs = Counter()

for l in lines:

freqs.update(l.split())

# send the frequency table back to the master

return freqs

-20pt

UNIVERSITY OF

GOTHENBURG

parallelized word count example: master part (1)

if __name__ == '__main__':

filename = ... something ...

pool = mp.Pool(processes=mp.cpu_count())

with open(filename) as f:

chunk = read_chunk(f, 100000)

while chunk:

# submit a job

pool.apply_async(compute_frequencies,

args=[chunk],

callback=merge)

chunk = read_chunk(f, 100000)

pool.close() # tell the pool we're done

pool.join() # wait for all jobs to finish

print(total_result.most_common(5))

-20pt

UNIVERSITY OF

GOTHENBURG

parallelized word count example: master part (2)

# this is the callback function, called every time we

# get a partial frequency table from a slave

total_result = Counter()

def merge(partial_result):

total_result.update(partial_result)

# helper function to read a number of lines that should

# be sent to a slave

def read_chunk(f, chunk_size):

chunk = []

for line in f:

chunk.append(line)

if len(chunk) == chunk_size:

break

return chunk

-20pt

UNIVERSITY OF

GOTHENBURG

word counting example: how much improvement?

1 2 3 4 5 6 7 8 9 10number of processes

0

5

10

15

20seconds

-20pt

UNIVERSITY OF

GOTHENBURG

why not half the time with twice the number of processes?

I splitting:I reading the �le, dividing into chunks

I communication overhead:I processes don't share memory, so they need to send and

receive dataI inputs (the chunks) and outputs (the partial tables) are pickled

and unpickledI this becomes even more critical if the processes run on separate

machines, because then the data is sent over a network

I assembling the end result:I for instance, merging the partial tables

I process administration:I starting processes, communicating inputs and outputs

-20pt

UNIVERSITY OF

GOTHENBURG

overview




-20pt

UNIVERSITY OF

GOTHENBURG

architectures for large-scale processing

I on a single machine, multiprocessing solutions such asPython's Pool can be useful, although a bit low-level

I we'll have a look at frameworks that can help us program forlarger systems that may be distributed on many machines


picture borrowed from Peter Exner's slides


-20pt

UNIVERSITY OF

GOTHENBURG

connections to functional programming

I some architectures for large-scale processing borrow a fewconcepts from functional programming

I FP has the following characteristics:I data structures are immutable (not modi�able): instead of

modifying, they are transformed into new structuresI many standard operations on data structures (transforming,

collecting, �ltering, etc) are implemented as higher-orderfunctions: functions that take other functions as input

I (in Python, list comprehension plays much of the same role)I uses small �on-the-�y� functions a lot: lambda in Python

I FP is attractive for this purpose because it separates the whatfrom the how

I we want to transform a list, but we don't want to worry abouthow its parts are distributed to di�erent machines or in whichorder the parts are processed

-20pt

UNIVERSITY OF

GOTHENBURG

a higher-order function in Python: map

I the function map applies a function to all elements in acollection

def add1(x):

return x + 1

print(list(map(add1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(lambda x: x + 1, [1, 2, 3, 4, 5])))

# prints [2, 3, 4, 5, 6]

print(list(map(len, ['a', 'few', 'strings'])))

# prints [1, 3, 7]

-20pt

UNIVERSITY OF

GOTHENBURG

another higher-order function: reduce

I the function reduce applies a function to �accumulate� theelements in a collection

I typical example: summing or multiplying all elements

I reduce lives in the functools library in Python

def add(x, y):

return x + y

print(reduce(add, [1, 2, 3, 4, 5]))

# prints 15

print(reduce(lambda x, y: x + y, [1, 2, 3, 4, 5]))

# prints 15

-20pt

UNIVERSITY OF

GOTHENBURG

contrived example using map and reduce

I sum the lengths of some words:

words = ['a', 'few', 'strings']

print(reduce(lambda x, y: x + y, map(len, words)))

# prints 11

-20pt

UNIVERSITY OF

GOTHENBURG

less contrived example

I the parallelized word counting program we wrote before can bethought of as mapping and reducing:

I map: for each �chunk�, compute a partial frequency tableI reduce: combine all partial tables into a complete table

-20pt

UNIVERSITY OF

GOTHENBURG

MapReduce

I MapReduce [Dean and Ghemawat, 2004] is an architecturedeveloped by Google that models large-scale computationtasks in terms of mapping and reducing

I see also this paper for a popular-scienti�c introduction

I the user de�nes the map and reduce tasks to be carried out

I MapReduce was designed to take care of many of thecomplexities in distributed processing:

I large �les can be distributed across several machinesI to minimize network tra�c, tasks are carried out locally as

much as possible: a machine handles the �piece� it storesI sometimes computers break down, so the system may need to

reprocess tasks that have disappeared

http://people.csail.mit.edu/matei/courses/2015/6.S897/readings/mapreduce-cacm.pdf

-20pt

UNIVERSITY OF

GOTHENBURG

Hadoop

I Hadoop is an open-source implementation of anarchitecture similar to Google's ideas

I https://hadoop.apache.org/

I its central parts areI processing part: Hadoop MapReduceI �le system: HDFS (Hadoop Distributed File System)

I . . . but it also has many other components

https://hadoop.apache.org/

-20pt

UNIVERSITY OF

GOTHENBURG

Spark

I Spark [Zaharia et al., 2012] is a more recent frameworkthat addresses some of the drawbacks of Hadoop

I most importantly, it tries to keep data in memory, ratherthan in �les, which can lead to signi�cant speedups forsome tasks

I Spark can be installed not only on a cluster but also on asingle machine (standalone mode)

I see http://spark.apache.org/

I

http://spark.apache.org/

-20pt

UNIVERSITY OF

GOTHENBURG

word counting example in Spark

I the Spark engine is implemented in the Scala languageI a fairly new functional programming language that runs on the

Java virtual machine

I however, we can write Spark programs not only in Scala orJava but also other languages including Python

I here's a Python example from the Spark web page:

-20pt

UNIVERSITY OF

GOTHENBURG

intuition of the word counting program

-20pt

UNIVERSITY OF

GOTHENBURG

Spark's fundamental data structure: the RDD

I Spark works by processing RDDs: Resilient DistributedDatasets

I Resilient: it recomputes data in case of lossI Distributed: may be spread out over di�erent machines

I conceptually, an RDD is similar to a Python list (or moreprecisely, a generator)

I word counting example:

1. RDD with lines2. RDD with tokens3. RDD with (token, 1) pairs4. RDD with (token, count) pairs

-20pt

UNIVERSITY OF

GOTHENBURG

transformations of RDDs

I Spark includes many transformations of RDDsI many of the transformations are well-known higher-order

functions in FPI not just map and reduce!

I check the overview here:http://spark.apache.org/docs/latest/

programming-guide.html

I see a complete list of transformations here:http://spark.apache.org/docs/latest/api/python/

pyspark.html#pyspark.RDD

I let's walk through the steps in the word counting program

http://spark.apache.org/docs/latest/programming-guide.html

http://spark.apache.org/docs/latest/programming-guide.html

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

-20pt

UNIVERSITY OF

GOTHENBURG

step 1: reading a text �le as lines

I spark.textFile reads a text �le and returns an RDDcontaining the lines

text_file = spark.textFile(NAME_OF_FILE)

-20pt

UNIVERSITY OF

GOTHENBURG

step 2: flatMap; splitting the lines into tokens

I flatMap is a transformation that applies some function to allelements in an RDD

I . . . and then �attens the result: removes lists inside the RDD

I we use this to convert the lines into a new RDD with tokens

step1 = text_file.flatMap(lambda line: line.split())

-20pt

UNIVERSITY OF

GOTHENBURG

step 3: map

I map is a transformation that applies some function to allelements in an RDD

I this is simpler than flatMap: no �attening involved

I in our case, we make a new RDD consisting of word�countpairs (but all counts are 1 so far)

step2 = step1.map(lambda word: (word, 1))

-20pt

UNIVERSITY OF

GOTHENBURG

step 4: reduceByKey

I reduceByKey is similar to reduce that we explained before,but operates on key�value pairs

I and the aggregation operation is applied to the values,separely for each key

I in our case, we sum all the 1s for each word separately

step3 = step2.reduceByKey(lambda a, b: a+b)

-20pt

UNIVERSITY OF

GOTHENBURG

the �nal VG assignment: using Spark

I do a few small word counting exercises using Spark

I we have installed Spark on the lab machinesI . . . or you may install it on your own

I we don't have a real cluster: you'll have to make believe!

-20pt

UNIVERSITY OF

GOTHENBURG

references I

I Dean, J. and Ghemawat, S. (2004). MapReduce: Simpli�ed data processingon large clusters. In OSDI'04.

I Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M.,Franklin, M. J., Shenker, S., and Stoica, I. (2012). Resilient distributeddatasets: A fault-tolerant abstraction for in-memory cluster computing. InNSDI.

Documents

Corpus methods in linguistics and NLP Lecture 7 ... · -20pt UNIVERSITY OF GOTHENBURG today's lecture I as you've seen, processing large corpora can take time! I for instance, building