NYAI - Scaling Machine Learning Applications by Braxton McKee

Say What You Mean

Braxton McKee, CEO & Founder

Scaling up machine learning algorithms directly from source code

Q: Why should I have to rewrite my program as my dataset gets larger?

def sq_distance(p1,p2):return sum((p1[i]-p2[i])**2 for i in range(len(p1)))

def index_of_nearest(p, points):return min((sq_distance(p, points[i]),i) for i in range(len(points)))[1]

def nearest_center(points, centers):return [index_of_nearest(p, centers) for p in points]

Example: Nearest Neighbors

Unfortunately, this is not fast.

A: You shouldn’t have to!

Q: Why should I have to rewrite my program as my dataset gets larger?

PyforaAutomatically scalable Python

for large-scale machine learning and data science

100% Open Source

http://github.com/ufora/ufora

http://docs.pyfora.com/

Goals of Pyfora•Provide identical semantics to regular Python•Easily use hundreds of CPUs / GPUs and TBs of RAM•Scale by analyzing source code, not by calling libraries

No more complex frameworks or APIs.

Approaches to ScalingAPIs and Frameworks• Library of functions for

specific patterns of parallelism• Programmer (re)writes

program to fit the pattern.

Approaches to ScalingAPIs and Frameworks• Library of functions for

specific patterns of parallelism• Programmer (re)writes

program to fit the pattern.

Programming Language• Semantics of calculation

entirely defined by source-code• Compiler and Runtime are

responsible for efficient execution.

Approaches to ScalingAPIs and Frameworks•MPI•Hadoop•Spark

Programming Languages•CUDA•CILK•SQL•Python with Pyfora

API LanguagePros • More control over performance

• Easy to integrate lots of different systems.

• Simpler code

• Much more expressive

• Programs are easier to understand.

• Cleaner failure modes

• Much deeper optimizations are possible.

Cons • More code

• Program meaning obscured by implementation details

• Hard to debug when something goes wrong

• Very hard to implement

With a strong implementation, “language approach” should win

• Any pattern that can be implemented in an API can be recognized in a language.

• Language-based systems have the entire source code, so they have more to work with than API based systems.

• Can measure behavior at runtime and use this to optimize.

Example: Nearest Neighborsdef sq_distance(p1,p2):

return sum((p1[i]-p2[i])**2 for i in range(len(p1)))

def index_of_nearest(p, points):return min((sq_distance(p, points[i]),i) for i in xrange(len(points)))[1]

def nearest_center(points, centers):return [index_of_nearest(p, centers) for p in points]

How can we make this fast?

• JIT compile to make single-threaded code fast

• Parallelize to use multiple CPUs

• Distribute data to use multiple machines

Why is this tricky?Optimal behavior depends on the sizes and shapes of data.

Centers Points

If both sets are small, don’t bother to distribute.

Why is this tricky?

Centers

Points

If “points” is tall and thin, it’s natural to split it across many machines and replicate “centers”

Why is this tricky?

Centers

Points

If “points” and “centers” are really wide (say, they’re images), it would be better to split them horizontally, compute distances between all pairs in slices, and merge them.

Why is this tricky?You will end up writing totally different code for each of these different situations.

The source code contains the necessary structure.

The key is to defer decisions to runtime, when the system can actually see how big the datasets are.

Getting it right is valuable

• Much less work for the programmer• Code is actually readable• Code becomes more reusable.• Use the language the way it was intended:

For instance, in Python, the “row” objects can be anything that looks like a list.

What are some other common implementation

problems we can solve this way?

Problem: Wrong-sized chunking• API-based frameworks require you to explicitly partition

your data into chunks.• If you are running a complex task, the runtime may be

really long for a small subset of chunks. You’ll end up waiting a long time for that last mapper.• If your tasks allocate memory, you can run out of RAM

and crash.

Solution: Dynamically rebalance

CORE #1

CORE #2 CORE #3 CORE #4

Splitting

Adaptive Parallelism

Solution: Dynamically rebalance• This requires you to be able to interrupt running tasks

as they’re executing.• Adding support for this to an API makes it much more

complicated to use.• This is much easier to do with compiler support.

Problem: Nested parallelismExample:• You have an iterative model• There is lots of parallelism in each iteration• But you also want to search over many hyperparameters

With API-based approaches, you have to manage this yourself, either by constructing a graph of subtasks, or figuring out how to flatten your workload into something that can be map-reduced.

sources of parallelism

def fit_model(learning_rate, model, params):

while not model.finished(params):

params = model.update_params(learning_rate, params)

return params

fits = [[fit_model(rate, model, params) for rate in learning_rates]

for model in models]

Solution: infer parallelism from source

Problem: Common data is too bigExample:• You have a bunch of datasets (say, for a bunch of products,

the customers who bought that product)• You want to compute something on all pairs of sets (say,

some average on common customers for both)• The whole set-of-sets is too big for memory

[[some_function(s1,s2) for s1 in sets] for s2 in sets]

Problem: Common data is too bigThis creates problems because:• If you just do map-reduce on the outer loop, you still need to

get to the data for all the other sets.• If you try to actually produce all pairs of sets, you’ll end up

with something many many times larger than the original dataset.


Solution: infer cache locality• Think of each call to “f” as a separate task.• Break tasks into smaller tasks until each one’s active

working set is a reasonable size.• Schedule tasks that use the same data on the same

machine to minimize data movement.


Solution: infer cache localityf(s0,s0)

f(s0,s1)

f(s0,s2)

f(s0,s3)

f(s0,s4)

f(s0,s5)

f(s1,s0)

f(s1,s1)

f(s1,s2)

f(s1,s3)

f(s1,s4)

f(s1,s5)

f(s2,s0)

f(s2,s1)

f(s2,s2)

f(s2,s3)

f(s2,s4)

f(s2,s5)

f(s3,s0)

f(s3,s1)

f(s3,s2)

f(s3,s3)

f(s3,s4)

f(s3,s5)

f(s4,s0)

f(s4,s1)

f(s4,s2)

f(s4,s3)

f(s4,s4)

f(s4,s5)

f(s5,s0)

f(s5,s1)

f(s5,s2)

f(s5,s3)

f(s5,s4)

f(s5,s5)

f(s0,s6) f(s1,s6) f(s2,s6) f(s3,s6) f(s4,s6) f(s5,s6)

f(s6,s0)

f(s6,s1)

f(s6,s2)

f(s6,s3)

f(s6,s4)

f(s6,s5)

f(s6,s6)

f(s7,s0)

f(s7,s1)

f(s7,s2)

f(s7,s3)

f(s7,s4)

f(s7,s5)

f(s7,s6)

f(s8,s0)

f(s8,s1)

f(s8,s2)

f(s8,s3)

f(s8,s4)

f(s8,s5)

f(s8,s6)

So how does Pyfora work?• Operate on a subset of Python that restricts mutability.• Built a JIT compiler that can “pop” code back into the interpreter

• Can move sets of stackframes from one machine to another• Can rewrite selected stackframes to use futures if there is parallelism to

exploit.• Carefully track what data a thread is using.• Dynamically schedule threads and data on machines to

optimize for cache locality.

import pyforaexecutor = pyfora.connect(“http://...”)data = executor.importS3Dataset(“myBucket”,”myData.csv”)

def calibrate(dataframe, params):#some complex model with loops and parallelism

with executor.remotely:dataframe = parse_csv(data)models = [calibrate(dataframe, p) for p in params]

print(models.toLocal().result())

What are we working on?• More libraries!• Better predictions on how long functions will take and what

data they consume. This helps to make better scheduling decisions.• Compiler optimizations (immutable Python is a rich source

of these)• Automatic compilation and scheduling of data and compute

on GPU

Thanks!

• Check out the repo: github.com/ufora/ufora

• Follow me on Twitter and Medium: @braxtonmckee

• Subscribe to “This Week in Data” (see top of ufora.com)

• Email me: [email protected]

Technology

NYAI - Scaling Machine Learning Applications by Braxton McKee