Optimizing Big Data Frameworks for Multi-core Systems · Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou Introduction Most big data systems use cores

6.824 Project Report:

Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou

Introduction Most big data systems use cores separately for different tasks, causing inefficient

utilization of CPU, Memory and IO resources on each multicore node. In this project,

we explored the benefits of using using multiple cores together to execute a single task.

We show that the modified version of KMeans can reduce average heap memory

utilization in a 6core node by 2.5x compared to the unmodified KMeans in Mllib (a

machine learning library built on top of Spark). A modified version of PageRank is 15%

faster than the PageRank implementation in GraphX (a graph library built on top of

Spark) running on a 4core node. We expect the speed up to increase more in a cluster

setting for PageRank working on larger graphs. We believe the same optimizations can

be applied to other platforms such as Hadoop MapReduce.

Background Current big data systems, such as Hadoop MapReduce and Spark, treat every core as

an independent machine. The runtime uses multicore systems by decomposing a job

into smaller tasks. For example, in Hadoop MapReduce, a MapReduce job is

decomposed into a series of Map and Reduce tasks, where each task operates on a

“input split”. In Spark, the RDDs are splitted into a number of partitions and a separate

spark task is created to process a “partition”. Each task perform operations on its own

input sequentially without communicating with other tasks. The big data systems

schedules multiple tasks on each multicore node to exploit the CPU resources. For

example, Hadoop and Spark would assign 8 or more tasks to a node with 8cores to

fully utilize the node.

0

This model of parallelism hurts the memory efficiency of many popular data analytics

applications, including KMeans and K Nearest Neighbors, that retains a large

inmemory accumulator data structure, which stores partial results. These accumulators

are required to alleviate the network delay by allowing the system to send much fewer

messages. For example, KMeans stores newly computed cluster centroids data in

memory during the execution of the tasks.

In Hadoop MapReduce, the memory inefficiency is made worse by the fact individual

tasks are running inside separate JVMs, preventing sharing of large inmemory

readonly data structures. The problem is alleviated in Spark by running multiple tasks

in the same executor JVM and providing a builtin BroadCast variables that reduces

memory usage for readonly data.

Additionally, running a large number of tasks on each multicore node would result in a

large number of partial results to be generated, degrading the performance for

communication heavy applications such as PageRank. For example, if we are running 8

tasks on an 8core machines for PageRank, we will end up with a reduce stage that

needs to merge 8 partial results. On the other hand, if we only run a single task, then

the reduce stage would be much faster. However, if we run only a single task, the

current model of parallelism would be able to only utilize 1 out of 8 cores on the

multicore node.

Motivating Applications

KMeans

KMeans is a clustering algorithm used in many applications. The algorithm partitions a

set of n sample objects into k clusters. For example, a popular application of KMeans

involves finding topics in news articles. Such application picks n news articles as

sample objects and cluster them into k topics.

1

The algorithm first chooses k objects randomly as the centroids. It then assigns every

sample object to a cluster that it is closest to. After assigning all objects to their potential

clusters, KMeans recalculates the location of the centroid in each cluster. This process

runs repeatedly until the centroid locations stabilize, or until a fixed iteration limit.

Since each sample object is independent of each other, they can be processed in

parallel. Therefore, a common technique in implementing KMeans is to split the sample

objects into subgroups (called “slices” in our code) and process the slices in parallel.

Given the description of the algorithm, it is natural to implement each iteration of

KMeans in a MapReduce fashion. As shown in Figure 1, the map phase takes a slice as

input, computes the similarity between each sample object (represented as vectors) and

each centroid, and assigns the object to the closest cluster. The process of finding the

closest cluster is the most computationally intensive part of the algorithm, and thus the

map phase dominates the running time of each iteration. Finally, it yields a partial sum

of sample vectors along with the number of vectors in each cluster. The reduce task

phase adds up the partial sums from each cluster , divides it by the number of vectors in

each cluster and generates new locations for the centroids.

KMeans is a memory intensive application, because each map task needs to store a

large data structure containing the information about the cluster centroids. The number

of this data structure is proportional to both the number of clusters and the number of

concurrent map tasks. To reduce the memory footprint, we targeted at reducing the

number of map tasks while making each task utilize multiple cores so that the overall

performance does not degrade (Figure 2).

2

Figure 1: KMeans running on Spark with singlethreaded tasks

Figure 2: KMeans running on Spark with multithreaded tasks

3

PageRank

PageRank is an algorithm that was first used by Google to compute the relative

importance of web pages. It takes as input a graph that consists of Vertices and Edges

(G = (V, E)) and outputs rank information for each vertex in the graph. PageRank is an

iterative improvement algorithm that improves the rank measures of each vertex during

each iteration.

There are two variants of the algorithm. The first runs for a certain number of iterations

and stop (PageRank.run(graph, numIter) in GraphX). The second algorithm runs until all

the rank updates are less than a tolerance (PageRank.runUntilConvergence(graph, tol)

in GraphX). We focused on the later algorithm as it is more useful because it it is hard to

determine the number of iterations the algorithm should run before hand and the

amount of computations reduces significantly in the later iterations when most edges

are no longer active.

The algorithm first partitions the graph into a number of partitions of edges based on

random vertex partition. In each iteration, it updates the active vertices by joining the

vertices with the rank updates that are above the tolerance. It then starts a MapReduce

job that performs two steps. The first step is a per task scan of edges to calculate the

rank updates and do a pre aggregation within the task. The second reduce step

aggregates all the partial rank updates to compute the final rank updates. A high level

pseudocode is shown below.

while (activeMessages > 0) {

val newVerts = g.vertices.innerJoin(messages) g = g.outerJoinVertices(newVerts) messages = g.mapReduceTriplets(sendMsg, mergeMsg))

}

4

Design and Implementation

KMeans

We started by modifying the KMeans implementation provided by the machine learning

library (mllib) in Spark. It follows the MapReduce pattern as described above. In

particular, the implementation makes use of the mapPartitions() API, which is a

coarsegrain RDD operation that applies a user defined function to each partition in

parallel. Each partition represents a slice of sample objects that needs processing. For

each partition, we specify a map task that computes the similarities and assigns the

centroids.

Note that the original loop that iterates over the points is singlethreaded. Since

calculating similarity for one point is independent of calculating that of another and the

order does not matter, we modified this particular loop to make use of multiple cores.

We changed the original point array to a Scala parallel collection, and created a

ForkJoinPool of some size to house the concurrent worker threads. ForkJoinPool is an

advanced Java concurrent library that implements workstealing, where each task

attempts to reduce idling by executing subtasks spawned by other tasks.

To ensure correctness, we used locks to synchronize accesses and updates to the

partial sum and the count of the points in each cluster. We think that using locks instead

of an accumulator is the better approach. Since KMeans is compute intensive, the

contention is low and the locked code path contributes to a very small percentage in the

execution time. We compared the computed final cost against that of the single

threaded version and obtained the same result across multiple runs.

5

The following code snippet outlines the structure of the modified code. Note that the

outer foreach() call is executed in parallel within a ForkJoinPool. The findClosest() call is

computeintensive and takes up most of the execution time.

pointsArray.foreach { point => // executed on multiple cores in parallel

(0 until runs).foreach { i =>

val (bestCenter, cost) = KMeans.findClosest(thisActiveCenters(i), point) // intensive

writeLock.lock()

// update sum and count

writeLock.unlock()

}

}

PageRank (1) Design

Currently GraphX utilizes multicore systems by partition graph into multiple partitions

and process each partition in parallel as shown in the code below in GraphImp.scala.

val preAgg =

view.edges.partitionsRDD.mapPartitions(aggreagateMessagesForEachPartition)

The aggregateMessagesForEachPartition is serialized. Parallelization comes from

processing multiple graph partitions in parallel. The PageRank application is not as

compute intensive as KMeans. As a result, the bottleneck of the application is

communication. By using fewer partitions, the performance can be improved

significantly. However, the tradeoff there is that the aggregate part would be executed

sequentially.

6

We want to parallelize aggregateMessagesForEachPartition part of the program. With

this modification, we can use multiple cores to work on a single partition of the graph.

The benefit of this approach would be generating a small amount of partial results with

good CPU utilization.

(2) Implementation

We implemented our optimizations in Graph X application code directly without

modifying the underlying spark RDD implementation. We focused on the algorithm

implemented in PageRank.runUntilConvergence(graph, tol).

The algorithm switches between a vertex based aggregation that iterates through

vertices and an edge based approach that loops through all the edges and updates the

destination vertex. The difference is that when the graph has few active vertices,

looping through the vertices would be much faster.

The parallelization strategy for aggregating through vertices is to subdivide the vertex

array into a few chunks. For each chunk we use a separate worker thread to perform

the aggregation. As a result, each worker thread has a partial rank accumulator. Once

the workers have finished, we added a local reduction phase that merge together the

accumulators from different worker threads. Parallelization for aggregation through

edges is similar, except for that we are now dividing the edge array instead of the vertex

array.

We chose not to use a locking scheme as we did in KMeans because PageRank is less

compute intensive and the contention on writes is relatively high. Additionally, acquiring

a lock is expensive given we are not doing much with each vertex. An experiment

showed that using locks actually resulted in a slow down with multiple threads.

Furthermore, the memory footprint of PageRank is largely dominated by the Graph

7

representation. Using additional rank accumlators did not seem to have a big impact on

the memory footprint of the application.

Evaluation

KMeans

Set up We ran KMeans on a desktop machine with 6 hyperthreaded Intel i7 cores and 16GB

RAM running Ubuntu Linux and Oracle Java 8. We used Spark’s local execution mode,

which simulates running the application on a cluster but makes the driver itself a task

executor. We allocated 8GB for the Spark driver memory.

Dataset

We used the 20 newsgroup dataset (80M), which is a popular dataset widely used in

machine learning applications. We ran the algorithm for 10 iterations. In order to get it to

work with the sparse vector representation in Mllib, we converted the data set into a

separate index and value file. Currently the data set uses 10000 features or 10000

unique words. More realistic dataset could have as much as 30000 features, incurring

even more memory pressure.

To monitor running time, we wrote scripts to parse the log output from Spark. To obtain

an accurate record of heap usage, we relied on the jstat tool included in the Java 8 JDK.

We wrote a script that invokes jstat to print the space usage of each Java garbage

collection generation every second and sums the numbers up to generate the total heap

usage. We also used the jvisualvm tool for a good visualization of both CPU and heap

usage.

8

We tested 5 configurations of parallelism on 6 cluster sizes. Note that the number of

tasks is the number of executor threads we ask Spark to spawn, while the number of

threads per task is the degree of parallelism we use in the ForkJoinPool in KMeans

(number of threads we used within in each task). The charts below (Figure 3 and 4)

show the running time and average heap utilization for all configurations and cluster

sizes.

Figure 3: KMeans running time

9

Figure 4: KMeans average heap utilization

With 5000 clusters, the singlethreaded configuration with 10 tasks exceeded Java GC

overhead limit and failed. With 6000 clusters, both singlethreaded configurations failed.

This shows that the system do not have enough heap memory to process 6000 clusters.

We observe that for a given configuration, the running time scales linearly with the

number of clusters as shown in Figure 3. For all configurations, the running time (y axis)

increases as the cluster size (x axis) increases. This is expected because for n sample

objects and k clusters, the algorithm needs to perform O(|n| * |k|) similarity computations

in each iteration.

We observe that average heap usage scales linearly to the number of concurrent tasks

as shown in Figure 4. 8 tasks (blue) and 10 tasks (orange) use significantly more

memory than 2 (yellow) and 3 (green and brown) tasks configurations. This is expected,

because the number of data structures holding the information about centroids is

proportional to the number of map tasks.

10

In all cluster sizes, our multithreaded version performed better than singlethreaded

versions. Since the computer we used had 6 hyperthreaded cores, using 6 threads per

task generally performed better than using 4. Spawning 3 tasks with 6 threads per task

yielded the best performance. To reach comparable performance processing 4000

clusters, the singlethreaded versions had to use about 2.5 times more heap space than

that of the multithreaded version with 2 tasks as shown in Figure 4.

We should point out that while the map phase dominates the running time, the reduce

phase is the most memoryintensive part in the current implementation. Spark’s shuffle

operations, such as the reduceByKey() used in KMeans, build a hash table to perform

the grouping, which can be too large if the working set is large. As shown in the

jvisualvm CPU monitor graph (Figure 5), the reduce phases generate spikes in heap

usage. The spike increased average heap usage and reduced the maximal cluster size

we could support. We will investigate into shrinking these spikes, but it will likely involve

more changes to the Spark core.

Figure 5: Spikes of heap usage in the reduce phase of KMeans

As a sidenote, we found that the JVM heap size parameter (Spark driver memory in this

particular case) has a rather significant impact on performance. Running KMeans with

larger heap size enables the single threaded versions to complete with 5000 or 6000

clusters. However, specifying a large driver memory caused performance to degrade in

all configurations. We found it surprising that our fastest configuration (3 tasks with 6

threads per task) took 20 more seconds (about 9% of the running time) to run with 16G

driver memory, even though no apparent swapping was going on. We suspect that the

larger heap made full GC much slower, but more investigation would be necessary.

11

PageRank

Set up We tested on a local computer with 4 cores, 2.3 GHz Intel Core i7, 16 GB memory and

12 GB of heap. Due to other programs running on the same laptop, there is usually 11

GB of memory available for the PageRank Application.

We used Spark’s local executor mode. The number of threads used for local[n], n is the

number of partitions. For each multithreaded task, we used 4 threads per task.

Dataset

We tested on two datasets, webGoogle data set and LiveJournal dataset.

The Google data set with 5 million edges, 70 MB. The dataset is downloaded from

https://snap.stanford.edu/data/webGoogle.html. The problem is that the dataset is too

small and the aggregateIndex and aggregateEdge phase only takes 300ms (the overall

map phase seems to take about 1s, implying additional time consuming components

beyond aggregateIndex or aggregateEdge). As a result, there is not much that can be

done to speed it up. The overhead of spawning new workers would outweigh the benefit

we get from doing work in parallel. Additionally, even with one partition, the reduce

phase takes 12 seconds, making any speed up insignificant.

Next, we tried the Live Journal social network data with 68 million edges, 1 GB. The

dataset is downloaded from https://snap.stanford.edu/data/socLiveJournal1.html . This

dataset is a bit too large unmodified, using up all available memory and resulted in

swapping of some data. To make matters worse, when some RDDs don’t fit in the

memory, tasks start to recompute a lot of the data, resulting in a significant slow down

and reading additional data (recomputed data) in the aggregate phase.

12

https://snap.stanford.edu/data/web-Google.html

https://snap.stanford.edu/data/soc-LiveJournal1.html

To deal with this problem, we decided to manually reduce the size of the graph. Since

the data mainly consisted of an edge list with an edge on each line, we simply took the

first 54 / 68 million lines or 54 / 68 million edges of the graph to form a smaller dataset

of around 800 MB. The data set can mostly fit into the memory with minimal swapping

and no recomputation.

Next we present the results for 54 million edges data set.

1 partition multi threaded tasks (seconds)

1 partition single threaded task (seconds)

2 partitions single threaded task (seconds)

4 partitions single threaded task (seconds)

Running Time (overall)

521 609 601 779

Reduction Time (per iteration)

4.4 4.4 5.6 10

Number of Iterations

66 66 66 66

Overall, we see a speedup of 15% in our multithreaded version. Additionally, we ran

each configuration multiple times and the difference from run to run is about 1%.

We also investigated where the speed up come from. The local aggregation part takes

about 7 seconds in the beginning and slowly degrades. With two partitions, the

aggregation is speed up about 2 times to around 3.8 seconds. Using the multithreaded

scheme, we were able to speed it up to about 2.5 seconds. This gives us about 1

second speed up in the earlier iterations. In later stages, almost no time is spent in the

aggregation phase since most of the vertices are no longer active (the rank update is

below the threshold). As a result, the speedup compared to two threads comes from a

13

faster reduction phase since we are processing only one message, whereas the two

partitions configuration is processing two messages.

As we can see, as the number of partitions increases, the overall running time is much

slower (4 partitions). This is because despite a slight speedup in the map phase, the

time spent in reduce phase have increased significantly compared to 1 partition.

We expect the speed up to be more evident in a cluster setting, where communicating

across tasks on different machines is going to be much slower than communicating in

the local computer. Furthermore, with a larger dataset, the local aggregation phase is

going to take longer, resulting in a better speed up of the map phase using multiple

threads.

Conclusion

● We evaluated the CPU and Memory utilization of two popular data analytics

applications: KMeans and PageRank

● We showed potential benefits of using multiple cores to work together on

individual tasks on real optimized implementations in GraphX and Mlib

● We reduced the memory usage of KMeans by 2.5x

● We sped up PageRank by 15% for a real data set

Future Directions

● Run applications in an EC2 cluster

● More applications:

○ Look into ALS and LDA applications, which are suppose to expose the

same performance bottlenecks as PageRank

14

● Better profiling of the applications

○ We weren’t able to speed up the map phase in PageRank linearly

because aggregation only takes half of the time of the map phase.

○ The reduce phase for PageRank is too long for the amount of data it is

shuffling.

● Involve parallel IO operations within each task

Lessons Learned

● JVisualVM is a great tool for profiling heap usage, CPU utilization and sampling

time each function takes in a JVM based application

● JStat is good for measuring the garbage collected heap size of the applications

● Reduce phase takes a lot more memory than map phase in KMeans due to a

hash table based implementation in Spark with ReduceByKey

The code of our modified Spark repository can be found at:

https://github.com/Dominator008/spark

15

https://github.com/Dominator008/spark

Documents

Optimizing Big Data Frameworks for Multi-core Systems · Optimizing Big Data Frameworks for Multi-core Systems Yunming Zhang, Mo Zhou Introduction Most big data systems use cores