Rooﬂine Scaling Trajectories: A Method for Parallel

Roofline Scaling Trajectories: A Method for ParallelApplication and Architectural Performance Analysis

Khaled Z. Ibrahim, Samuel Williams, Leonid OlikerLawrence Berkeley National Laboratory

One Cyclotron Road, Berkeley, CA 94720, USA{kzibrahim, swwilliams, loliker}@lbl.gov

Abstract—The end of Dennard scaling signaled a shift in HPCsupercomputer architectures from systems built from single-core processor architectures to systems built from multicoreand eventually manycore architectures. This transition substan-tially complicated performance optimization and analysis as newprogramming models were created, new scaling methodologiesdeployed, and on-chip contention became a bottleneck to per-formance. Existing distributed memory performance models likelogP and logGP were unable to capture this contention. TheRoofline model was created to address this contention and itsinterplay with locality. However, to date, the Roofline model hasfocused on full-node concurrency. In this paper, we extend theRoofline model to capture the effects of concurrency on datalocality and on-chip contention. We demonstrate the value ofthis new technique by evaluating the NAS parallel benchmarkson both multicore and manycore architectures under both strong-and weak-scaling regimes. In order to quantify the interplaybetween programming model and locality, we evaluate scalingunder both the OpenMP and flat MPI programming models.

Index Terms—Roofline Model, Performance Analysis, ParallelScaling, OpenMP.

I. INTRODUCTION

Dennard scaling enabled energy-efficient scaling from oneCMOS process generation to the next [1]. This scaling trendwas exhausted by 2005 at which point computer architectswere forced to reevaluate architectural paradigms and priori-tize those techniques that maximized energy efficiency ratherthan single thread performance. This revolution in computerarchitecture gave birth to multicore and eventually manycoreprocessors where dozens of cores are integrated on a singlechip. Whereas multicore architectures tended to be based onlarge superscalar cores with large shared last-level caches,manycore architectures were further energy-optimized to max-imize performance and concurrency through the use of simplecores, wide vector units, and a sea of private caches thatmotivated data parallelism and coarse-grained partitioning ofcomputation.

The transition from supercomputers built from single-corenodes to multi- and manycore nodes significantly complicatedperformance optimization and analysis as new programmingmodels like OpenMP were deployed, scaling regimes couldbe applied at the core, node, or system level, and on-chipcontention emerged as a potential bottleneck to performance.

Many programmers were incentivized to experiment withOpenMP only to find that performance paled when comparedto flat MPI on a chip (one process per core). Without a per-formance analysis methodology capable of understanding theinterplay between programming model, locality, concurrency,and architecture, they were generally unable to effectivelyremedy their performance gap.

In 2008, the Roofline model was created to understandthe interplay between locality, bandwidth, and computationacross a wide variety of architectures [2], [3]. However, priorRoofline research was typically used to understand the ultimateperformance potential of a machine and not the interplaybetween concurrency and locality — a key metric in theapplication optimization and architecture co-design processes.

In this work, we introduce the roofline scaling trajectorytechnique to analyze the performance on multi/many-corearchitectures. We show that these trajectories help in assessingthe application leverage of the cache hierarchy and how willthe architecture is helping or impeding the scaling of anapplication. We present the application of the proposed methodon the NAS Parallel Benchmarks [4]. The trajectory analysisenables applying the roofline method to workloads with integeror arbitrary user-defined operations.

The rest of this paper is organized as follows: We discussthe motivation for this work in § II. We introduce the rooflinescaling trajectory method in § III. In § IV, we present theexperimental setup and the workloads. In § V, we analyzethe application of the proposed method on the NPB suite. Weconclude in § VII after discussing related work in § VI.

II. MOTIVATION

Performance analysis is notoriously difficult especially inthe many-core era. It is becoming critical to driving the sys-tem efficiency because architectural efficiency from improve-ments in transistor miniaturization, lithographic advancementin CMOS technology, is slowing down.

Many-core architectures have multiple features that couldinfluence the application performance. Among notable featuresare simple core design, reliance on distributed caches, and theadoption of memory technologies that are throughput opti-mized. Simpler core designs typically necessitate the relianceon SIMD/vectorization as a means for improving instruction

level parallelism. The use of distributed cache/local storageenables servicing a larger number of memory requests con-currently. On the other hand, maintaining coherence betweenthese distributed caches could result in reduced effective cachecapacity, due to the replicated state of frequently read data, orlonger latency to service memory write requests due to themulti-hop transactions in the directory-based cache coherenceprotocols that are typically used for distributed caches. An-other challenge is feeding data to many-core architectures tomake them busy. Some memory technologies are more opti-mized for throughput than latency. These memory technologiesmake it possible to handle many concurrent requests but not toimprove on the latency to service the request and may increaseperformance variability as well.

Understanding the performance limits intuitively is a chal-lenge with these architectural developments especially if wetreat each code independently. In HPC, relatively few compu-tational patterns dominate scientific computing. The computa-tional dwarfs report from Berkeley [5] provides an excellentsurvey of these patterns. We can focus the analysis on fewcomputational patterns and gain insight into a wide range ofusage patterns.

III. EVALUATION METHOD

In this paper, we introduce the roofline scaling trajectorytechnique for analyzing the performance and scalability ofparallel applications. We also extend the use of roofline plotsfrom being solely used for floating-point operations to anyuser-defined operation. Such an extension allows analysis ofa broader set of irregular integer-based applications.

In the context of this work, our primary focus is assessingthe change in application behavior while migrating from multi-core to many-core architectures (lightweight cores, distributedprivate caches, etc...). Rather than relying on theoretical limitsand models for application and machine characterization, werely on empirical measurements including the use of LIKWIDto measure the DRAM performance counters [6], and the IntelSDE for instruction analysis [7].

A. Roofline Scaling Trajectories

Weak-scaling and strong-scaling are typically used to as-sess the efficiency of utilizing the hardware resources as weincrease the level of concurrency. When weak-scaling, oneaims for constant run time (linearly increasing flop rates) forproblem sizes proportional to concurrency. Conversely, whenstrong-scaling, one hopes for run times inversely proportionalto concurrency for fixed problem sizes. Unfortunately, it is of-ten difficult to ascertain the root cause of any sub-linear scalingbehavior in either regime. Often one employs a variety of toolsto identify which components of the system or applicationare either not scaling or have become the bottleneck. Theseincludes tools to analyze the memory hierarchy or the impactof communication on total run time. The roofline scalingtrajectory focuses on visualizing the scaling behavior andidentifying the effects of locality and limited cache capacityon the observed application performance. Additionally, these

△x

△y

△x > 0: Distributed Cache → cache filtering due to cache capacity. Shared Cache → common working set, cross-thread prefetching.

△x < 0: Cache conflicts, false sharing and coherence traffic

△y > 0: Increased compute power,FU, vectorization, etc.(limited by roofline)

△y < 0: runtime overhead (sync, schedule, msg. proc)

Single threaded

Scattered multi-threaded(Shared/distributed LLC)

hyper-threaded(Shared L1)

Fig. 1: Roofline Scaling Trajectory. As we double the com-putational resources we expect a corresponding doubling ofperformance (∆y), but the observed performance is typicallyinfluenced by the change in computational intensity (∆x).

plots could identify the potential advantage of leveraging aparticular cache hierarchy for a given computational patternand could also steer optimization efforts towards those withthe highest possible gains.

We leverage the roofline plots [2], which are used to analyzethe observed performance of an application against machinelimits. The x-axis is the arithmetic (or computational1) in-tensity, computed as a ratio of floating point operations totransferred bytes from the memory system. The y-axis isobserved performance. Machine limits, such as flop rate ormemory bandwidth define the maximum attainable by anapplication, which is a function of the application’s arithmeticintensity.

The roofline scaling trajectory tracks the scaling of anapplication. Scaling can be decomposed into transitions of ∆xand ∆y. As we double the level of concurrency, we ideallyexpect ∆x to be 0 (no change in cache locality) and ∆y = 2×(linear increase in performance). This idealized scaling be-havior is also assumed to be unconstrained by the overheadsof the language, synchronization, scheduling, message setup,etc. In reality, the nature of distributed vs. centralized cachehierarchies may constrain locality, and the limitations of finitememory and cache bandwidths may constrain the performancegain to less than 2×.

Although one could look at raw values extracted fromperformance counters to identify the causes of limited scal-ability, the process could be cumbersome and error-prone.Visualization allows efficient and potentially more intuitive

1We extend the arithmetic intensity use in the Roofline Model by usingthe general term computational intensity, where a computation could be anyapplication defined operation.

analysis. The process requires building intuition about thescaling behavior possible moves and their interpretation.

We describe the architectural and scaling behaviors for eachof the four possible directions an application may move withincreased concurrency.

∆x > 0: When strong scaling, this transition is indicativeof increased temporal locality (the number of floating-pointoperations has remained constant, so data movement must havedecreased). This behavior can occur with distributed cachehierarchies as the aggregate cache capacity increases withincreased concurrency. Eventually, a working set can fit incache, and DRAM data movement is reduced. The key is tomaintain such behavior is to avoid thrashing the working set.If the cache hierarchy involves some sharing, each processingelement should limit the working-set in the last level cache toits fair share.

∆x < 0: This transition is an indication of a loss oftemporal locality during the scaling experiments, which couldhappen for various causes. First, when strong-scaling onshared cache hierarchies, if the per-thread working set is notsufficiently small, the aggregate working set of all threadscan eventually exceed the last-level cache capacity resulting insuperfluous data movement. A similar effect can occur whenweak scaling. Although the application strategy in dealing withthe memory hierarchy is very influential on performance, thememory organization and whether it has a distributed or sharedhierarchy could have a significant impact on the ability toleverage locality and as such the scaling behavior. For instance,a distributed cache is useful in reducing the interferenceor cache thrashing effect because it provides some localityisolation. On the other hand, distributed caches are challengedwhen attempting to capture a large shared working set asthe same data may be replicated in multiple private caches.Effectively, the cache capacity could look much smaller thanthe aggregate capacities of individual caches.

∆y > 0: The increase in the y direction reflects theutilization of the excess compute resources. The transitioncould exceed the added compute resources if such transitioncarries an accompanying ∆x > 0. In the Roofline Model,the ∆y transitions are typically bound by one of the systemlimits (memory or compute), depending on the computationalintensity.

∆y < 0: A decrease in the y direction indicates someruntime or architectural overhead that impedes increased per-formance. Such effects can arise from programming model(e.g. excessive synchronization overheads, substantial com-munication setup time, etc.), or contention when accessingarchitectural resources (e.g. L1/L2 or reorder buffer in hyper-threaded architectures).

Although each concurrency scaling transition could seeindependent effects on locality and performance (positive ornegative), there is also a potential coupling between ∆x and∆y transitions. If the application performance turns over athigh concurrency (∆y < 0), our roofline trajectory couldidentify whether a ∆x < 0 effect is causing the observed∆y < 0 effect on performance. This becomes more apparent

when working close to the roofline at full concurrency.For this paper, we will focus on the scaling trajectory

considering only the DRAM arithmetic intensity (localityeffects captured by the last level of cache). We designate pure∆y scaling (∆x = 0) as an ideal behavior that would requiresome performance isolation between concurrent activities inthe system. We identify three sources of this performanceisolation:

• Algorithms: This performance isolation is the hardest toachieve except in special cases, where the applicationdeveloper decomposes the problem into smaller tasks thatmaximize the data reuse within the cache hierarchy. Lackof algorithmic isolation could cause severe degradation incomputational intensity as we scale computation.

• Programming models: Distributed programming modelssuch as MPI are known to force the programmer intothe mindset of creating isolated tasks that communicateonly at specific times explicitly using communicationcalls. This decomposition proves beneficial not only ina distributed computing environment but also in a sharedmemory environment where NUMA locality exists.

• Architectures: Cache hierarchy and memory could pro-vide performance isolation through the use of distributedcaches. As such, each core has its own private space andno cache thrashing is possible.

B. Non-Floating-Point Roofline Trajectories

Roofline analysis typically targets floating-point computa-tions. This implies the use of floating-point operations as boththe performance metric as well as a numerator of the arithmeticintensity ratio. However, many applications do not relying onmeasuring or performing floating-point operations, and thusone is motivated to use non-floating-point metrics (MIPS) orapplication-specific metrics. For instance, whereas one oftenuses traversed edges per seconds (TEPS) as a performancemetric in graph analytics, one might use random numbersgenerated per second in other fields. For such applications, itis often difficult to define the roofline for operations basedon the raw performance characterization of the underlyingarchitecture and thus we may define performance and intensityrelative to these metrics using application-specific internalperformance metrics. Nevertheless, we still find applying thescaling trajectory analysis on roofline plot as an insightfultool for performance analysis, and will show that the scalingbehavior of applications with non-floating-point metrics couldhave the same pitfalls and trends observed in floating-pointintensive applications.

IV. EXPERIMENTAL SETUP

A. Benchmarking Suite

In this paper, we used the NAS Parallel Benchmarks(NPB) [4] for evaluating scaling effects among architecturalvariants. These benchmarks represent a broad set of com-putational patterns. FT represents spectral methods, CG forsparse linear algebra, LU for solving a regular-sparse lowerand upper triangular system, MG for multi-grid PDE solver

TABLE I: Systems used in this study.

Cori I Cori IIProcessor Intel Intel

Haswell KNLClock (GHz) 2.3 1.4

NUMA×Cores 2×16 1×68DP GFlop/s 1178 2611

D$/core(KB) 64+256 64+1024 (per tile, 2 cores)LL$/chip(MB) 30Memory (GB) 128 DDR4 16 MCDRAM+ 96 DDR4

System and System SoftwareNodes 2,388 9,688

Interconnect DragonflyCompiler Intel 18.0.1

using a hierarchy of meshes, in addition to multiple structuredgrid methods in the mini-apps SP, BT. NPB also includesunstructured adaptive mesh benchmark, called UA. Similarto CG, its memory access pattern is irregular. In addition tothe floating-point based scientific applications, NPB includesan integer sorting benchmark, IS, which is used in manyscientific applications. For performance measurement, NPBmeasures the performance based on the rate of performingmega operations per seconds.

B. System Setup

We use NERSC’s Cori supercomputer as our evaluationplatform. Cori is a Cray XC40 system that employs both multi-core Intel Haswell and many-core Intel Xeon Phi ”Knight’sLanding” processors. The system leverages the Cray Ariesnetwork to connect 2,388 Haswell nodes and 9,688 KNLnodes. Table I gives brief description of the architecturalfeatures of the Cori nodes.

V. TRAJECTORY ANALYSIS FOR COMPUTATIONALKERNELS

One of the most critical factors for performance is theefficiency in utilizing the cache (or memory) hierarchy. Onroofline plots, the cache efficiency manifests as the computa-tional intensity. The higher the efficiency of utilizing the cache,the higher the computational intensity. Obviously, the intensityis influenced by the kind of computation at hand and whether itis feasible to tile the computation to better leverage temporallocality. For a kernel that efficiently exploits the cache, weexpect our Roofline trajectory to move vertically. Moreover,changing the problem size should have minimal impact on thecomputational intensity, unless the problem could fit entirelywithin the last-level cache.

Reduction in the arithmetic intensity (∆x < 0 transition)is not a desirable behavior, whether we do strong or weakscaling. A ∆x > 0 transition while scaling could result fromthe increase of cache capacity. It could also result from betterproblem decomposition while scaling. While such a transitionhas favorable performance advantages, these advantages maybe difficult to maintain asymptotically.

A. Strong-scaling Trajectories

Strong scaling involves changing the computational re-sources while fixing the problem size per node. A real test

to the algorithm temporal-optimization is whether it couldmanifest a vertical trajectory on an architecture with a sharedlast-level cache.

A couple of kernels show such behavior, using theirOpenMP implementation: MG and FT. For both, we noticethat the trajectories are dominated by pure ∆y transitions.Only when we start using hyper-threading, do we observeoperational intensity degradation due to the increase in datamovement. In Figure 2, we present the benchmarks thatexhibits relatively regular behavior. For MG, the KNL andHaswell arithmetic intensities are close to each other due tothe exploitation of tiling. The scaling is dominated by verticaltransitions that continue until hitting the memory roofline.

For FT, we observe that the arithmetic intensity increaseswith the amount cached data. Given that the computationinvolves a transpose operation, we observe that the sharedcache, in Haswell, helps in doing it efficiently. With thedistributed cache in KNL, we observe a ∆x transition, changein arithmetic intensity, with the increase in the data set size.The scaling for CG, with irregular access patterns, benefitsfrom the shared cache if the data fits in cache. Concurrency inaccessing the shared cache causes a diagonal scaling behavior(∆x 6= 0) if a significant portion of the active dataset fitsin cache, i.e. Class A. With Distributed cache, we observe∆x > 0 while strong scaling and the performance improves aswe increase the level of concurrency because we’re increasingcache capacity, while not suffering from cache thrashing.Eventually, the scaling in both architectures converge to thesame computational intensity for sufficiently large problems.Larger datasets cause a vertical scaling transitions that arebounded by the memory latency.

In Figure 3, we shows a different set of benchmarks with astronger influence of the cache hierarchy. For mini-application,such as SP and BT, the scaling trajectory starts with an initialpositive ∆x due to the increase in cache capacity2. Increasingconcurrency causes ∆x < 0 scaling transition associated with∆y > 0. The challenge in these mini-applications is thedifficulty in managing locality beyond a single kernel. Movingfrom one computational phase to the other would result inevicting the working set from the cache. For these kernels, theKNL distributed cache provides a lower initial computationalintensity that keeps improving as we increase the concurrencylevel. Interference in the L1 cache reverses such scaling trend,once we start using hyper-threading. LU exhibits the mostsignificant ∆x transitions during strong scaling on the KNLdue to the performance isolation in the distributed cache. LUuses scratch memory to condense scattered data from a globaldata structure. Some of data structures, with 5 elements inits innermost dimensions, exhibit false-sharing among threadscreating either prefetching or interference effect, depending onthe cache hierarchy.

In Figure 4, we show the same set of applications im-plemented using MPI. We notice improvement of the peak

2The initial ∆x > 0 is due to scattering threads across sockets. With dualsocket systems, the full cache capacity is fully utilized only if we have atleast two threads.

MG FT CGH

asw

ell

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0

Arithmetic Intensity (Flops/Byte)

GF

lop/

s

VFMA (1229)

ADD (c32) (77)

ADD (c1) (9.2)DRAM (c

32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●●●

roofline_summary_mg_lbl

● Class AClass BClass C

c1

c2

c4

c8c16c32c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_ft_lbl


c1

c2

c4

c8

c16c32c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_cg_lbl


c1

c2

c4

c8c16

c32c64

KN

L 0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●●●

roofline_summary_mg_lbl


c1

c2

c4

c8

c16

c32c64c128c256

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●

●●

roofline_summary_ft_lbl


c1c2

c4

c8

c16

c32

c64c128

c256

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●●●

roofline_summary_cg_lbl


c1c2

c4

c8

c16

c32

c64c128c256

Fig. 2: Scaling trajectories for OpenMP-based kernels MG, FT, and CG on KNL vs. Haswell. MG and FT exhibit regularscaling behavior. CG benefits from the shared cache if the data fit within the LLC.

performance, especially with the shared cache of Haswell. Inthis case, the programming model isolation, i.e. the use of MPIthat eliminates the implicit communication between computingprocessors, partially helps in addressing the scaling problem,but the trends of computational intensity change with the con-currency level is not eliminated. MPI can help in partitioningthe problem, which reduces the dataset interference by eachcompute processor, but it cannot eliminate the loss of temporallocality due to successive kernel invocations.

The cache hierarchy plays a major role in the ∆x transitionsin the scaling trajectory, which influence the ∆y transitions forthis set of applications. Distributed caches provide the perfor-mance isolation that facilitates reducing the ∆x transitions.

In Figure 5, we show the application of our roofline trajec-tory methodology for applications with user-defined operation.In such case, we can determine only the memory rooflines, butthe computational capabilities would need to be user-defined.Both IS and UA generate irregular accesses. Following otherfloating-point benchmarks, UA experiences scaling that in-volves ∆x < 0 as we increase the concurrency on Haswell.Although sorting algorithms, such as IS, typically generateirregular access, we do not notice much ∆x change whilescaling due to the use of bucket sorting. The cache filters mostof the irregular accesses and the traffic to memory remainsstable while strong scaling. We also notice severe performancedegradation once we start exploiting hyper-threading. Signifi-

cant performance drop is observed on the KNL architecture.

B. Weak-scaling Trajectories

Weak scaling involves changing the problem size as weincrease the computation resources. NPB increases problemsizes by 4× as we change the problem size from one class tothe other up to class C. Weak scaling plot reduces the impactof runtime overheads on performance.

Using our roofline trajectory methodology to assess weakscaling could provide insights into the benchmark dependenceon temporal locality. As shown in Figure 6, LU has a ∆x > 0transitions during weak-scaling on the KNL cache hierarchy.Haswell’s trajectory follows the opposite trend ∆x < 0. Thisbehavior indicates that LU has a common working set betweenthe compute threads. This attribute is difficult to capture usingcode inspection because such common dataset stems fromfalse sharing. Interference could reduce the effectiveness ofkeeping this working set in the cache. Only a few applicationshave weak scaling without a change in ∆x, e.g. IS.

The weak-scaling analysis on KNL in particular could pro-vide insights into what kind of cache interference is happening;whether the interference is between the threads involved in thecomputation, or the cache eviction happens due to changingphases. On KNL, a ∆x < 0 when threads are scatteredindicates self-interference, i.e. cache evictions between dif-ferent phases of computations. We observed such behavior

LU BT SPH

asw

ell

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_lu_lbl


c1

c2

c4

c8

c16c32c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_bt_lbl


c1

c2

c4

c8

c16c32c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●

●●●

roofline_summary_sp_lbl


c1

c2

c4

c8c16

c32c64

KN

L 0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●●

●



c1

c2

c4

c8

c16

c32c64c128

c256

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●

●

●



c1

c2

c4

c8

c16

c32

c64c128

c256

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●

●

●

●

●

●



c1c2

c4

c8

c16

c32

c64c128c256

Fig. 3: Scaling trajectories for OpenMP-based FT kernel and mini-applications BT, SP on KNL vs. Haswell. These benchmarksshow sensitivity to cache hierarchy and exhibit significant ∆x transitions.

with FT. For instance, SP self-evict part of its working set inweak-scaling experiments, while BT does not experience suchbehavior. A ∆x > 0 indicates a shared working set betweenthe threads, where one thread effectively prefetches data forother threads.

C. Limits of Vertical Transitions

We have observed that most applications are not exceedingthe computational roofline of scalar operations. While this maynot be a limitation if the application is memory bound, someapplications are not memory bound on either the Haswell orKNL systems. We also notice that the performance relativeto peak scalar performance is lower on KNL compared withHaswell. Not leveraging vectorization is a severe limitationespecially on the KNL architecture where 16× of the fullcomputational power is at stake.

Another critical factor is how effective is the compiler invectorizing the application code, or how amenable is the appli-cation to vectorization. In Figure 7, we report the percentageof scalar floating-point operations, which were not vectorized,as a percentage of the total floating-point operations. Thereported values are based on Intel SDE profiling. The impactof having a large percentage of scalar operations is significanton architectures where vector processing units are responsiblefor executing the scalar operations. On KNL, only one of thetwo VPUs can execute scalar and legacy vector operations [8],

such as SSE. We notice that well-vectorized benchmarks, suchas MG and FT, are memory bound. As such, they did notbenefit much from their efficient vectorization. Benchmarksthat are compute bound, such as LU and BT, have a largepercentage of scalar flops. Given the generated code may notfully utilize functional units in performing application floatingpoint operations, we need to consider the efficiency of utilizingvectorization units. We define vectorization efficiency asveceff = (total flops−scalar flops)

(vector uops×max flops per uop)The numerator is a measure of the application vectorized

instructions. The denominator is proportional to the vec-tor micro-ops (µops), involving various vectorization widths,which flow through the VPU units. These µops include, inaddition to floating-point computations, integer, shuffle, broad-cast, convert, etc. For the studied application, non floating-point instructions are generated automatically by the compiler,but their occupancy of the VPU units reduces the effectivethroughput observed by the application. We used Intel SDE [7]for measuring the application level scalar and vector instruc-tions, while we used performance counters measurementsreported by LIKWID to measure the µops.

This vectorization efficiency metric averages the efficiencyof exploiting SSE, AVX2, AVX512 (with and without mask-ing), across different regions of the code. The effectiveness ofvectorization depends on many factors including data align-ment, loop iteration count, the complexity of the control flow

LU BT SPH

asw

ell

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●●



c1

c16

c32c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●



c1

c4

c16

c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

●

●

●

●



c1

c4

c16

c64

KN

L 0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●●



c1

c4

c16

c64c128

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●



c1

c4

c16

c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

●

●

●

●



c1

c4

c16

c64

Fig. 4: Scaling trajectories for MPI-based kernels LU, BT, and SP on KNL vs. Haswell, running on a single node. We observeimprovement in performance isolation on shared cache but we generally notice similar scaling trends to those with OpenMP.

instructions, etc. Given that most of the explored codes arewritten in Fortran with compile-time known loop bounds, thecompiler’s ability to vectorize these codes is usually high. Infact, we measured higher vectorization success rate on thesecodes compared with a variant of the code written in C. Theflops per instruction are 8 (or 16 for fused instructions3) doubleprecision flops per vector instruction. The results, shown inFigure 8, use an estimate of 8 flops per instructions. Weobserved the highest efficiency for FT and MG approachingup to 54%. For LU, BT, and SP the vectorization efficiencyis 22-25%. In general, the vector floating-point µops are lessthan 50% of the total µops. For CG, we measured the lowestFloating-point µops, roughly 19%. The remaining µops areused to gather and align the data for the efficient vector fused-multiply-add computations. Such low average efficiency invectorization correlates with the hovering below the rooflinesfor multiple applications.

Another performance limiting factor on KNL is the run-time overheads. To quantify such overheads, we used CoralClomp [9] benchmark for assessing OpenMP performance. Asshown in Figure 9, the barrier synchronization latency using64 threads is 4us. This latency roughly doubles at 256 threads.Generally, we observe the runtime latency to be proportionalto the log of threads on KNL. The runtime performance is

3AVX512 instruction could perform up to 16 double precision floating-pointinstructions using fused multiply-add instructions.

also generally better on Haswell compared with KNL for thesame level of concurrency, which is expected because Haswellhas faster cores. On Haswell, the latency increase rate withconcurrency is lower than the log of thread concurrency. Thisbehavior suggests that scaling is probably benefiting from thecache hierarchy on Haswell better than KNL.

D. Architectural Exploration for Performance

The presented trends show the challenge in the transitionfrom one architecture to the other, even if they share more orless the same ISA. The quest for performance could followmultiple fronts. First, some applications should put moreefforts into refactoring the code to better leverage the cachehierarchy at hand. For applications composed of multiplecomputational phases, this could be a difficult task to achieve.

The second front would be to explore the architecturalimpact on the various application computational pattern. Inthis work, we explored two design points, one is characterizedby sharing the LLC, and the other with minimal sharingof LLC. While we understand that distributed caches aremore scalable in handling many-core architecture, the levelof sharing LLC (i.e., the number of cores per tile could bechanged). Intel Haswell could be viewed as 2-tile (socket)architecture with 16 cores per tile (socket), while KNL has a36 tile architecture with 2 cores per tile. Between these twodesign points, there are other under-explored design variants.

IS UAH

asw

ell

0.001 0.100 10.000

0.01

0.05

0.50

5.00

Computational Intensity (Op/Byte)

GO

p/s

DR

AM (c

32) (

128)

DR

AM (c

1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_is_lbl


c1

c2

c4

c8

c16

c32c64

0.001 0.100 10.000

0.01

0.05

0.50

5.00


GO

p/s

DR

AM (c

32) (

128)

DR

AM (c

1) (1

4.3)

●

●

●

●

●

●●

roofline_summary_ua_lbl

● Class AClass BClass Cc1

c2

c4

c8

c16

c32c64

KN

L 0.001 0.100 10.000

0.01

0.05

0.50

5.00


GO

p/s

DR

AM (c

68)

(375

)

DR

AM (c

1) (1

4)

●

●

●

●

●

●

●●

●

roofline_summary_is_lbl


c1

c2

c4

c8

c16

c32

c64c128

c256

0.001 0.100 10.000

0.01

0.05

0.50

5.00


GO

p/s

DR

AM (c

68)

(375

)

DR

AM (c

1) (1

4)

●

●

●

●

●

●

●

●●

roofline_summary_ua_lbl


c1

c2

c4

c8

c16

c32

c64c128c256

Fig. 5: Scaling trajectories for OpenMP-based user-defined op kernels: ISand UA on KNL vs. Haswell. UA’s irregular access pattern benefits fromHaswell’s shared cache.

LU OpenMP weak scaling

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (1229)

ADD (c32) (77)


32) (128)

DRAM (c1) (1

4.3)

roofline_summary_lu_weak_lbl

A.c4

B.c16

C.c64

0.01 0.05 0.50 5.00 50.00

0.1

1.0

10.0

100.

010

00.0


GF

lop/

s

VFMA (2458)

ADD (c68) (179)

ADD (c1) ( 2.8)

DRAM (c68) (

375)

DRAM (c1) (1

4)

roofline_summary_lu_weak_lbl

A.c4

B.c16

C.c64

Fig. 6: KNL provides better weakscaling than Haswell, but starts froma lower computational intensity.

0 . 0 %

2 . 5 %

2 0 . 0 %

2 2 . 5 %

2 5 . 0 %

2 7 . 5 %

3 0 . 0 %

Perce

ntage

of Sc

alar F

lop In

struc

tions F T M G C G L U B T S P

Fig. 7: Scalar operation percentage for NAS NPB. Someapplications, such as BT and SP, have a significant fractionsof their floating-point operations not vectorized.

Moreover, the level of sharing for L1 is 4 in KNL, while it is2 for Haswell. Studying the impact of varying cache hierarchywith the visualization of roofline scaling trajectories could helpon developing better designs. Applications that benefit fromperformance isolation would require architectural features thatallow fair resource management of caches. Exploration of suchtechniques is beyond the scope of this paper.

0 %

1 0 %

2 0 %

3 0 %

4 0 %

5 0 %

6 0 %

Vecto

rizati

on Ef

ficien

cy of

Pack

ed Ve

ctor In

strs

F T M G C G L U B T S P

Fig. 8: Average vectorization efficiency on KNL architecturesas a percentage of fully utilized AVX512 instructions (assum-ing a max of 8 DP flops per instruction).

VI. RELATED WORK

In the distributed memory regime, logP and logGP werecommonly used to model the latency, overheads, and band-widths of the network communication layer that constraininter-node performance and scalability [10], [11]. Althoughapplicable to understanding MPI-dominate flat MPI executionmodels, we have chosen to focus on codes dominated by on-

4 1 6 3 2 6 4 2 5 6012345678

Barri

er La

tency

(us)

C o n c u r r e n c y

H a s w e l l K N L

Fig. 9: Barrier latency on Haswell and KNL. KNL generallyhas higher latency for synchronization for the same level ofconcurrency, in addition to having higher concurrency.

node computation where inter-process communication is keptto a minimum.

The Roofline methodology has been applied to multiplelevels of the cache hierarchy in order to estimate the averagememory bandwidth [2], [12]. However, those experimentswhere conducted at fixed concurrency while our experimentsare designed to visualize the interplay between concurrency,application parallelization strategy, and architecture on per-formance. The Cache-Aware Roofline Model (CARM) is asimilar technique used to infer locality by comparing averagememory bandwidth to the Roofline ceilings [12]. However,there are a two major differences with our work. First, CARMnominally is run at full concurrency whereas we observethe performance and locality trends as we scale concurrency.Second, when calculating arithmetic intensity, CARM onlyuses the number of loads and stores presented to the L1 cachewhere as we use the number of bytes to DRAM after filteringby all cache levels. As a result, CARM cannot observe anycontention in the cache hierarchy that gives rise to capacity orconflict misses and superfluous DRAM data movement.

VII. CONCLUSIONS

In this paper, we present the Roofline Scaling Trajectorytechnique for analysis. We also extend the use of the RooflineModel to include user-defined operations. The presentedmethod helps in identifying the performance dependency ofan application on temporal architectural structures such asthe cache hierarchy. As such, the presented method couldserve both performance analysis for applications as well asarchitectural evaluations of cache hierarchies.

Applying our analysis to the NAS Parallel Benchmarks, weshowed the correlation of performance limitations with theirscaling trajectories on the Roofline. For instance, applicationswith changing computational intensity, such as BT, SP, andLU, scaling are likely to face difficulty in achieving rooflinelimits. They are the most sensitive to the migration frombetween architectures using centralized and distributed cachehierarchies.

Studying the same application on multiple cache hierarchiescould help distinguishing loss of temporal locality due to con-currency induced contention from self-interference between

phases of computations for the same thread. This analysistechnique is likely to be instrumental as we move to futurearchitectures with a wide variety of memory hierarchies.Although we presented the technique applied to a singlelevel of the memory hierarchy, extending it to other levelsof the memory hierarchy, including between memory and I/O,should be straightforward. We believe our technique providesa simple, yet powerful, visualization analysis for temporalbehavior during scaling.

ACKNOWLEDGMENT

This research used resources in Lawrence Berkeley Na-tional Laboratory and the National Energy Research Scien-tific Computing Center, which are supported by the U.S.Department of Energy Office of Science’s Advanced ScientificComputing Research program under contract number DE-AC02-05CH11231.

REFERENCES

[1] R. Dennard, F. Gaensslen, H. Yu, V. Rideout, E. Bassous, andA. Leblanc, “Design of ion-implanted mosfets with very small physicaldimensions,” IEEE Journal of Solid-State Circuits, vol. 9, no. 5, pp.256–268, 1974.

[2] S. Williams, A. Watterman, and D. Patterson, “Roofline: An insightfulvisual performance model for floating-point programs and multicorearchitectures,” Communications of the ACM, April 2009.

[3] S. Williams, “Auto-tuning performance on multicore computers,” Ph.D.dissertation, EECS Department, University of California, Berkeley, Dec2008.

[4] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, andM. Yarrow, “The NAS Parallel Benchmarks 2.0,” Technical Report NAS-95-010, NASA Ames Research Center, 1995.

[5] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands,K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, andK. A. Yelick, “The Landscape of Parallel Computing Research: A Viewfrom Berkeley,” EECS Department, University of California, Berkeley,Tech. Rep. UCB/EECS-2006-183, Dec 2006.

[6] LIKWID, “LIKWID,” https://github.com/RRZE-HPC/likwid.[7] Intel Software Development Emulator, “Intel Software Develop-

ment Emulator,” https://software.intel.com/en-us/articles/intel-software-development-emulator.

[8] A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani,S. Hutsell, R. Agarwal, and Y. Liu, “Knights landing: Second-generationintel xeon phi product,” IEEE Micro, vol. 36, no. 2, pp. 34–46, Mar.-Apr.2016.

[9] G. Bronevetsky, J. Gyllenhaal, and B. R. de Supinski, “Clomp: Ac-curately characterizing openmp application overheads,” InternationalJournal of Parallel Programming, vol. 37, no. 3, pp. 250–265, Jun 2009.

[10] D. E. Culler, R. M. Karp, D. Patterson, A. Sahay, E. E. Santos, K. E.Schauser, R. Subramonian, and T. von Eicken, “Logp: A practical modelof parallel computation,” Commun. ACM, vol. 39, no. 11, pp. 78–85,Nov. 1996.

[11] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, “Loggp:Incorporating long messages into the logp model—one step closertowards a realistic model for parallel computation,” in Proceedingsof the Seventh Annual ACM Symposium on Parallel Algorithms andArchitectures, ser. SPAA ’95. New York, NY, USA: ACM, 1995, pp.95–105.

[12] A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgradingthe loft,” IEEE Comput. Archit. Lett., vol. 13, no. 1, pp. 21–24, Jan.2014. [Online]. Available: http://dx.doi.org/10.1109/L-CA.2013.6

Documents

Rooﬂine Scaling Trajectories: A Method for Parallel