45
Software & Services Group, Developer Products Division Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. A Performance Tuning Methodology: From the System Down to the Hardware – Diving Deeper 29 Jackson Marusarz Intel Corporation ATPESC 2014

A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

  • Upload
    lyhanh

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

A Performance Tuning Methodology: From the System Down to the Hardware – Diving Deeper

29

Jackson Marusarz

Intel Corporation

ATPESC 2014

Page 2: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Optimization: A Top-down Approach

30

System

Application

Processor

H/W tuning:

BIOS (TB, HT)

Memory

Network I/O

Disk I/O

OS tuning:Page sizeSwap fileRAM DiskPower settingsNetwork protocols

Better application design:

Parallelization

Fast algorithms / data bases

Programming language and RT libs

Performance libraries

Driver tuning

Tuning for Microarchitecture:

Compiler settings/Vectorization

Memory/Cache usage

CPU pitfalls

OS, S

yste

m E

xp

ertis

eSW

/uArc

h

Page 3: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Performance Tuning – Diving DeeperPerform System and Algorithm tuning first

This presentation uses screenshots from Intel® VTune™ Amplifier XEThe concepts are widely applicable

Page 4: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Algorithm TuningA Few Words

• There is no one-size fits all solution to algorithm tuning

• Algorithm changes are often incorporated into the fixes for common issues

• Some considerations:– Parallelizable and scalable over fastest serial implementations

– Compute a little more to save memory and communication

– Data locality -> vectorization

Page 5: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Compiler Performance Considerations

Feature Flag

Optimization levels -O0, O1, O2, O3

Vectorization -xHost, -xavx, etc…

Multi-file inter-procedural optimization -ipo

Profile guided optimization (multi-step build)

-prof-gen

-prof-use

Optimize for speed across the entire program

**warning: -fast def’n changes over time

-fast (same as: -ipo –O3 -no-prec-div -static -xHost)

Automatic parallelization -parallel

This is from the Intel compiler reference, but others are similar

• Compilers can provide considerable performance gains when used intelligently• Consider compiling hot libraries and routines with more optimizations• Always check documentation for accuracy effects• This could be a day-long talk on its own

Page 6: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

MPI Tuning

34

Is your application

MPI-bound?

Is your application

CPU-bound?

Resource usage

Largest MPI consumers

Next Steps

Intel® Trace Analyzer and Collector: http://intel.ly/traceanalyzer-collector

• Find the MPI/OpenMP sweet spot• Determine how much memory do your ranks/threads share• Communication and synchronization overhead

Page 7: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Common Scaling Barriers

• Static Thread Scheduling

• Load Imbalance

• Lock Contention

You paid for the nodes, so use them!

Page 8: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Static Thread Scheduling

NUM_THREADS = 4;

pthread_t threads[NUM_THREADS];

int rc;

long t;

int chunk = limit/NUM_THREADS;

for(t=0;t<NUM_THREADS;t++){

range *r = new range();

r->begin = t*chunk;

r->end = t*chunk+chunk-1;

rc = pthread_create(&threads[t], NULL, FindPrimes, (void *)r);

}

• Statically determining thread counts does not scale• Core counts are trending higher• Designs must consider future hardware• Commonly found in legacy applications

Page 9: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Static Thread Scheduling

NUM_THREADS = 4;

pthread_t threads[NUM_THREADS];

int rc;

long t;

int chunk = limit/NUM_THREADS;

for(t=0;t<NUM_THREADS;t++){

range *r = new range();

r->begin = t*chunk;

r->end = t*chunk+chunk-1;

rc = pthread_create(&threads[t], NULL, FindPrimes, (void *)r);

}

• Statically determining thread counts does not scale• Core counts are trending higher• Designs must consider future hardware• Commonly found in legacy applications

Page 10: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Static Thread Scheduling

• Statically determining thread counts does not scale• Core counts are trending higher• Designs must consider future hardware• Commonly found in legacy applications

NUM_THREADS = 4;

pthread_t threads[NUM_THREADS];

int rc;

long t;

int chunk = limit/NUM_THREADS;

for(t=0;t<NUM_THREADS;t++){

range *r = new range();

r->begin = t*chunk;

r->end = t*chunk+chunk-1;

rc = pthread_create(&threads[t], NULL, FindPrimes, (void *)r);

}

Create Threads Dynamically - NUM_THREADS = get_num_procs();

Page 11: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Load Imbalance

• Dynamically determining thread count helps… but isn’t a silver bullet• Workload distribution must be intelligent• Threads should be kept busy • Maximize hardware utilization

Ideally all threads would complete their work at the same time

Page 12: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Load Imbalance

• Dynamically determining thread count helps… but isn’t a silver bullet• Workload distribution must be intelligent• Threads should be kept busy • Maximize hardware utilization

The key to balancing loads is to use a threading model that supports tasking and work stealing

Some examples:

• OpenMP* dynamic scheduling

• Intel Threading® Building Blocks

• Intel® Cilk™ Plus

Page 13: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Lock Contention

• A well balanced application can still suffer from shared-resource competition• Synchronization is a necessary component• Excessive overhead can destroy performance gains• Numerous choices for where and how to synchronize

Page 14: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Lock Contention

• A well balanced application can still suffer from shared-resource competition• Synchronization is a necessary component• Excessive overhead can destroy performance gains• Numerous choices for where and how to synchronize

Page 15: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Lock Contention

• A well balanced application can still suffer from shared-resource competition• Synchronization is a necessary component• Excessive overhead can destroy performance gains• Numerous choices for where and how to synchronize

Some solutions to consider:

• Lock granularity

• Access overhead vs. wait time

• Using lock free or thread safe data structures

tbb::atomic<int> primes;

tbb::concurrent_vector<int> all_primes;

• Local storage and reductions

Page 16: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Microarchitectural Tuning

• Intel uArch specific tuning

• After high-level changes look at PMUs for more tuning– Find tuning guide for your hardware at www.intel.com/vtune-

tuning-guides

• Every architecture has different events and metrics

• We try to keep things as consistent as possible

• Start with the Top-Down Methodology– Integrated with the tuning guides

Page 17: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

• Registers on Intel CPUs to count architectural events– E.g. Instructions, Cache Misses, Branch Mispredict

• Events can be counted or sampled– Sampled events include Instruction Pointer

• Raw event counts are difficult to interpret– Use a tool like VTune or Perf with predefined metrics

45

Introduction to Performance Monitoring Unit (PMU)

Page 18: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Hardware Definitions

• Front-end:

– Fetches the program code

– Decodes them into low-level hardware operations -micro-ops (uops)

– uops are fed to the Back-end in a process called allocation

– Can allocate 4 uops per cycle

• Back-end:

– Monitors when a uop’s data operands are available

– Executes the uop in an available execution unit

– The completion of a uop’s execution is called retirement, and is where results of the uop are committed to the architectural state

– Can retire 4 uops per cycle

• Pipeline Slot:

– Represents the hardware resources needed to process one uop

Background

Page 19: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Hardware Definitions

• Front-end:

– Fetches the program code

– Decodes them into low-level hardware operations -micro-ops (uops)

– uops are fed to the Back-end in a process called allocation

– Can allocate 4 uops per cycle

• Back-end:

– Monitors when a uop’s data operands are available

– Executes the uop in an available execution unit

– The completion of a uop’s execution is called retirement, and is where results of the uop are committed to the architectural state

– Can retire 4 uops per cycle

• Pipeline Slot:

– Represents the hardware resources needed to process one uop

Therefore, modern “Big Core” CPUs have 4 “Pipeline Slots” per cycle

Background

Page 20: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

The Top-Down Characterization

• Each pipeline slot on each cycle is classified into 1 of 4 categories.

• For each slot on each cycle:

Page 21: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

The Top-Down Characterization

• Determines the hardware bottleneck in an application

• Sum to 1.0

• Unit is “Percentage of total Pipeline Slots”

• This is the core of the new Top-Down characterization

• Each category is further broken down depending on available events

• Top-Down Characterization White Paper• http://software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-

microarchitectural-issues

Page 22: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Tuning Guide Recommendations

Expected Range of Pipeline Slots in this Category, for a Hotspot in a Well-tuned:

CategoryClient/ Desktop application

Server/ Database/ Distributed application

High Performance Computing (HPC) application

Retiring 20-50% 10-30% 30-70%

Back-End Bound

20-40% 20-60% 20-40%

Front-End Bound

5-10% 10-25% 5-10%

Bad Speculation

5-10% 5-10% 1-5%

Page 23: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Efficiency Method: % Retiring Pipeline Slots

• Why: Helps you understand how efficiently your app is using the processors

Page 24: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Efficiency Method: Changes in Cycles per Instruction (CPI)

• Why: Another measure of efficiency that can be useful when comparing 2 sets of data

– Shows average time it takes one of your workload’s instructions to execute

Page 25: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Microarchitectural Tuning - Top-Down

• This code is actually pretty good. High retiring percent.

• Let’s investigate Back-End bound

Page 26: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

We’re basically hammering the compute hardware. Are we vectorizing?

Microarchitectural Tuning - Top-Down

Page 27: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

SSE Instructions! Optimize with the compiler e.g. -xhost

Microarchitectural Tuning - Top-Down

Page 28: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

AVX2 on Haswell

Before After

Microarchitectural Tuning - Top-Down

Page 29: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Top-Down with a Memory Bound issue

DRAM Bound Function

Page 30: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Top-Down with a Memory Bound issue

Array accesses are poorly addressed

Page 31: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

From Tuning Guide:

Page 32: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Top-Down with a Memory Bound issue

With a Loop-Interchange (was 97% Back-End bound)

Page 33: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Top-Down for NUMA analysis

• Multi-socket systems with NUMA require special analysis• VTune, numastat, numactl

• Remote cache and DRAM accesses can cause stalls• Now what?

• Memory allocation vs. access• Temporal locality

Page 34: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Memory Bandwidth using PMUs

62

• Know your max theoretical memory bandwidth• Locate areas of high LLC misses• PMU events available to calculate QPI bandwidth on newer processors

Page 35: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Tuning Guides Have Lots of Metrics and Hints

For example:

Page 36: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Tuning Guides Have Lots of Metrics and Hints

For example:

Page 37: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Intel Xeon Phi

• Has its own tuning guide and metrics

Page 38: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Intel Xeon Phi

• Efficiency Metric: Compute to Data Access Ratio

• Measures an application’s computational density, and suitability for Intel® Xeon Phi™ coprocessors

• Increase computational density through vectorizationand reducing data access (see cache issues, also, DATA ALIGNMENT!)

Metric Formula Investigate if

Vectorization Intensity VPU_ELEMENTS_ACTIVE /VPU_INSTRUCTIONS_EXECUTED

L1 Compute to Data Access Ratio

VPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE

< Vectorization Intensity

L2 Compute to Data Access Ratio

VPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_WRITE_MISS

< 100x L1 Compute to Data Access Ratio

Page 39: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Intel Xeon Phi

• Has its own tuning guide and metrics

• Problem Area: VPU Usage• Indicates whether an application is vectorized successfully and

efficiently

• Tuning Suggestions:– Use the Compiler vectorization report!

– For data dependencies preventing vectorization, try using Intel® Cilk™ Plus #pragma SIMD (if safe!)

– Align data and tell the Compiler!

– Restructure code if possible: Array notations, AOS->SOA

Metric Formula Investigate if

Vectorization Intensity

VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED

<8 (DP), <16(SP)

Page 40: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Performance Optimization Methodology

68

• Follow performance optimization process • Use the Top-down approach to performance optimization

• Use iterative optimization process

• Utilize appropriate tools (Intel’s or non-Intel)

• Apply scientific approach when analyzing collected results

• Practice!• Performance tuning experience helps achieving better results

• Right tools help as well

Page 41: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Performance Profiling ToolsTechnology wise selection

You have a chose of many:

69

• From simplest and fastest…

• To very complicated and/or slow

Instrumentation

Sampling

OS embedded:

Task Manager, top, vmstat

Application/platform

Simulators

Project embedded:

Proprietary perf. infrastructure

Always consider overhead vs. level of detail – it’s often a tradeoff

Page 42: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Scientific Approach to Analysis

70

• None of the tools provide exact results

• Data collection overhead or dropping details

• Define what results need to be precise

• Low overhead tools provide statistical results

• Statistical theory is applicable

• Think of proper sampling frequency (for data bandwidth)

• Think of proper length of data collection (for process)

• Think of proper number of experiments and results deviation

• Take into account other processes in a system

• Anti-virus

• Daemons and services

• System processes

• Start early – tune often!

Page 43: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

References

71

• Top-Down Performance Tuning Methodology

• www.software.intel.com/en-us/articles/de-mystifying-software-performance-optimization

• Top-Down Characterization of Microarchitectural Bottlenecks

• www.software.intel.com/en-us/articles/how-to-tune-applications-using-a-top-down-characterization-of-microarchitectural-issues

• Intel® VTune™ Amplifier XE

• www.intel.ly/vtune-amplifier-xe

• Tuning Guides

• www.intel.com/vtune-tuning-guides

Page 44: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware
Page 45: A Performance Tuning Methodology: From the …press3.mcs.anl.gov/computingschool/files/2014/08/Marusarz.2A...A Performance Tuning Methodology: From the System Down to the Hardware

Software & Services Group, Developer Products Division

Copyright © 2014, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

73