Introduction to MPI · •MPI is a message passing framework for high-performance computing •When you want the very most from the best hardware MPI has been the model of choice

Map Reduce

Remainder of the Course

1. Why bother with HPC 2. What is MPI

3. Point to point communication

4. User-defined datatypes / Writing parallel code / How to use a super-computer

5. Collective communication

6. Communicators7. Process topologies

8. File I/O

9. Hadoop/Spark (today)

10. Classic Parallelism

11. The largest computations in history / general interest / exam prep

Lecture Outline

• MPI Recap (brief)

• Introduction to Map Reduce

• Introduction to Hadoop/Spark

• Introduction to Assignment Two

MPI Recap

• MPI is a message passing framework for high-performance computing

• When you want the very most from the best hardware MPI has been the model of choice for decades

MPI Recap

• You should have a grip on

• Sending and Receiving

• Blocking vs. Nonblocking communication

• MPI Datatypes and derived datatypes

• Collective Communication• Collective data transfer

• Collective computation (reduction)

• Process topologies (Cartesian)

• MPI File I/O

These components plus practice will give you

access to the majority of performance available

Some Problems

• Clusters are expensive• Exotic machines

• Exotic interconnect

• People to build

• People to fix

• Power etc.

• You need a very good reason to use one, an extremely good reason to own one

An Observation

• An increasing portion of problems are bottlenecked by data requirements

• Perhaps using lots and lots of inexpensive hardware is a good solution• More machines →More problems

A Motivating Example

• Processing Web Data• 20+ Billion Pages, ~10s KB = 400+TB• One machine can read ~100MB/sec • ~ Four months to read • 100s of hard drives to store• What if we wanted to do something with that data?

• How about 1000 machines?• A few hours• Programming this beast?

• Communication, machine failure, reporting, debugging, data locality

• What if we wanted to do something similar later?

Enter: MapReduce

• A different programming model for processing and generating enormous datasets

• Utilises functional programming style for automatic parallelization

• Is Turing Complete (you can achieve any computation you need)• Just because you can, doesn’t mean you should

• Designed to solve this very problem• Invented by Google (Proprietary implementation)

• Open Source Hadoop (Main contributors Yahoo and Facebook)

• Google's Paper

https://static.googleusercontent.com/media/research.google.com/en/archive/mapreduce-osdi04.pdf

Aside: What should you take away

• What the MapReduce programming model is

• What Hadoop is

• How this is different to MPI

• Have a few example problems to write map/reduce functions for

Enter: MapReduce

• ‘Simple’ concept, users specify two functions

• Map → Takes a <key, value> to another <key2, value2>

• Reduce → Aggregates over key2

• Elegant but surprisingly expressive with the observation• Most data can be mapped to <key, value> pairs somehow

• Your keys and values can be of any type• Strings, integers, <key, value> pairs themselves etc

MapReduce: Typical Problem

• Read data• Split into small chunks over many machines

• Map from each record to intermediate values• Executed as a set of tasks

• Shuffle and sort• To order intermediate key values together

• Reduce to aggregate values

• Write the results

MapReduce: Typical Problem

• Core idea: A boss node allocates map, shuffle and reduction tasks to various nodes

• Built on top of a distributed file system

• Requires the user to supply• Data

• Map function

• Reduction function

• Configuration file (fault-tolerance, machine specs etc.)

Example: Counting Occurrences of Words

• The ‘Hello world’ of Map Reduce programming

• Data: An enormous number of text files <string title, string text>

• Map function: Output <word, 1> for each word in the text

• Reduce function: Sum over common words


• Some pseudocode

map(String key, String value){//key: document name//value: document textfor each word w in value:

Emit(w, "1") }

reduce(String key, Iterator values){//key: a word//values: a list of int result = 0for each v in values:

result += vEmit(result)

}


‘you spin me right round like a record right round round

round’

‘you spin me right’

‘round like a record right’

‘round roundround’

Splitting



round’




you, 1spin, 1

me, 1right, 1

round, 1like, 1a, 1

record, 1right, 1

round, 1round, 1round, 1

Splitting Mapping



round’




you, 1spin, 1

me, 1right, 1

round, 1like, 1a, 1

record, 1right, 1


A, 1Like, 1Me, 1

Record, 1Right, 1Right, 1

Round, 1Round, 1Round, 1

Round, 1You, 1

Splitting Mapping Shuffling



round’




you, 1spin, 1

me, 1right, 1

round, 1like, 1a, 1

record, 1right, 1


A, 1Like, 1Me, 1



Round, 1You, 1

A, 1Like, 1Me, 1

Record, 1Right, 2

Round, 4You, 1

Splitting Mapping Shuffling Reducing



round’




you, 1spin, 1

me, 1right, 1

round, 1like, 1a, 1

record, 1right, 1


A, 1Like, 1Me, 1



Round, 1You, 1

A, 1Like, 1Me, 1

Record, 1Right, 2

Round, 4You, 1

A, 1Like, 1Me, 1

Record, 1Right, 2

Round, 4You, 1

Splitting Mapping Shuffling Reducing

MapReduce: Parallelism

• Map functions can be run in parallel• Requires a logical input split

• Reduce tasks can also be run in parallel

• All values are processed independently

MapReduce: Data Locality

• A boss program creates ‘tasks’ (both map and reduce) to act on a chunk of data

• By default splits input data into 64MB blocks or at a file boundary

MapReduce: Fault Tolerance

• Boss tracks progress of each task and node• Node failure: Re-allocate the task to another

• Task failure: Restart

• Input key/val error • Master blacklists the value

• Tolerates small failures, allows the job to run on a best-effort • If we are processing billions of elements, we don’t want a few to halt the

entire process

• User can set the tolerance level

MapReduce: Optimizations

• Reduction starts only after mapping has finished • A ‘combiner’ is an extra mini-reducer to do local reductions speeding up

communication

• Speculative execution – New feature, enables multiple task attempts to run in parallel only using the one that finished first

MapReduce: Other Examples

• Distributed Grep• Map emits a line if it matches a supplied pattern

• Reduction is an ‘identity’ function that copies the intermediate data to the output

• Count or URL access frequency• Map logs of URL accesses to <URL, 1> pairs

• Reduction counts these (very similar to word counting)

• Reverse Web-link graph• Maps from log files to <target, source>

• Reduction concatenates all sources per target <target, list(source)>

MapReduce: Limitations

• MapReduce tasks must be stateless and acyclic → Inherently batch-processed• Cannot change data live• Restriction for Machine learning

• Small files are problematic

• Generality reduces performance

• No caching → Saving of intermediate results

• MPI is to MapReduce what F1 cars are to buses• Both have their uses but important to know when

Introduction to Hadoop

What is Hadoop?

• Open Source implementations of Google’s distributed computing products

• Supported by Apache

Google’s Offering Hadoop Equivalent Description

MapReduce MapReduce Java implementation of the MapReduce programming model

Google File System (GFS) Hadoop Distributed File System (HDFS)

Distributed file system

BigTable Hbase Distributed database

Chubby ZooKeeper Distributed Co-ordination service

Hadoop MapReduce

• Three Largest components

• HDFS• Distributed self-healing file system designed for commodity hardware

• MapReduce• Java based implementation of the MapReduce model

• Built on top of other Hadoop tools

• YARN (Yet another resource negotiator)

HDFS

• Highly fault tolerant distributed file system

• Built to work with MapReduce

• Assumes• Large data sets• Hardware failures abound• Write once, read many

• Key Features• Fault tolerance• Data replication• Load balancing• Scalability

HDFS

• Uses a boss/worker architecture

• Name Node (Boss)• Manages file system

• Executes file system operations (opening, closing, renaming etc.)

• Determines mapping of data to nodes

• Monitors executing tasks

• Data Node (Worker)• Manages attached storage

• Serves read and write requests

• Performs block creation, deletion and replication upon instruction

Hadoop Components

• Job Tracker (Boss)• Receives jobs from client

• Communicates with Name Nodes

• Splits and assigns tasks to workers

• Monitors task trackers

• Task Tracker (Worker)• Executes map and reduce tasks

• Notifies job tracker upon success or failure

YARN

• Separates resource management from data processing

• Provides resources for Hadoop

• Schedules Hadoop jobs

• Resides in a service node

• Uses Zookeeper• Cluster management

• Lock management

• Synchronization of nodes

Apache Spark

Hadoop

• System for storing and processing big-data

• Distributed clusters of computers

• Scales from single to thousands of machines

Spark

• Cluster computing packages for fast computation

• Extends MapReduce model formore types of computations

• Interactive queries

• Stream processing

• Focussed on reducing the number of read/write operations

• Offers Java, Python and Scala APIs

Apache Spark

Hadoop

• Requires job scheduling

• Very secure

• Requires more storage

Spark

• In-memory means no scheduling needed

• Less secure

• Requires more computers

Comparison to MPI

MPI

• Hundreds to a few thousand machines

• Extremely generalised framework

• Exploits very expensive hardware

• Code can be difficult to scale• But extremely fast when

accomplished

‘We give the primitives, you give the thinking’

MapReduce

• One to many thousands of machines

• Built around commodity hardware

• Code scales effortlessly• Can be rather slow at times

‘You give the goal, we’ll do the rest live’

Summary

• MapReduce is a functional-based programming model used to process enormous datatsets• Comprised of Map and Reduce tasks

• Idea is to build an enormous

• An implementation requires many different components. Mainly:• Distributed file system

• Implementation of MapReduce itself

• Resource management

• Here to introduce a different style of parallelism

Assignment 2All pairs shortest paths

Assignment 1 Vs. Assignment 2

• Marks are generally, looking good• Lots of people put in lots of effort

• Assignment 2 is a little different• MPI based not OpenMP

• Can be undertaken in pairs

• Input format is simplified to a single type

• Output format and testing is unspecified. That’s up to you now

Assignment 2: Briefly

• Tasked to investigate parallelizing the all-pairs shortest paths problem

• Use either Dijkstra’s or Floyd-Warshall’s algorithm

• ‘Investigate’ • Demonstrate correctness

• Experiment with scalability

• Explain your code

Graphs

• Graphs are extremely general structures

• Based on vertices and edges

• In our case• Directed

• Weighted (positively)

• Represented as an adjacency matrix • 0 if unconnected

• Weight from i to j otherwise

Dijkstra’s Algorithm

• Starting from a single-source iteratively find the shortest path to all other vertices by investigating the edge closest to any vertex we have already seen

• One of the most widely known algorithms

• Extracting the ‘closest’ edge may be tricky in C

• Hints for parallelism: Consider that Dijkstra’s by itself is a single-source algorithm

Floyd-Warshall

• Consider all paths of increasing length. If adding another vertex to the path is shorter than a path we already had, update it. Boils down to iterating through a matrix

• Is naturally an all-pairs shortest path algorithm

• Hint for Parallelisation: Consider breaking the adjacency matrix into a grid.

Hints

• Start sequential, understand why both algorithms work• Will give you insight into how to parallelize

• Formulate your parallel approach before-coding• Convince yourself that this will work

• Write it down for your report

• Do not wait until the last night to do your testing• There is a queue for the cluster

• If everyone wants to run their code at the same time it will take a while

Documents

Introduction to MPI · •MPI is a message passing framework for high-performance computing •When you want the very most from the best hardware MPI has been the model of choice