Upload
dmctek
View
224
Download
0
Embed Size (px)
Citation preview
8/10/2019 Compiler Support for Multicore
1/41
Compiler Support for Multi-Core
Stephen Blair-Chappell
8/10/2019 Compiler Support for Multicore
2/41
Agenda
!"
#"
8/10/2019 Compiler Support for Multicore
3/41
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation Parallelisation
8/10/2019 Compiler Support for Multicore
4/41
Common Optimization Switches
-openmp/QopenmpOpenMP 2.5 support
-fast/fastOptimize for speed, including IPO
-parallel/QparallelAutomatic parallelization
-ipo/QipoInter-procedural optimization
-prof-gen
-prof-use
/Qprof-gen
/Qprof-use
Profile guided optimization (muli-step build)
Linux & Mac OS*WINDOWS
/Zi
/O3
/O2
/O1
/Og
-gCreate symbols for debugging
-O3High-level optimizer, including prefetch, unroll
-O2Optimize for speed (default)
-O1Optimize for speed (no code size increase)
-O0Disable optimization
Itanium and the Intel logo are trademarks or registered trademarks of IntelCorporation or its subsidiaries in the United States or other countries
8/10/2019 Compiler Support for Multicore
5/41
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation Parallelisation
8/10/2019 Compiler Support for Multicore
6/41
Interprocedural Optimizations
Enables inlining, better register usage, deadcode elimination, etc.
usage:
icpc -ip: single file IPO
icpc -ipo: multi-file IPO Link time code generation - increases build
time
IPO: Two Step Process
Usability Tips: Try IPO on performance critical files/libs Dont run ipo on 10,000s object files,
avoid unnecessary increased build time Remember to link with -ipo option
Pass 1
Pass 2
ipo objects
executable
Compiling:icpc -c -ipo a.cxx b.cxx
Linking:icpc -ipo a.o b.o
8/10/2019 Compiler Support for Multicore
7/41
Interprocedural OptimizationExtends optimizations across file boundaries
Compile & OptimizeCompile & Optimize
Compile & OptimizeCompile & Optimize
Compile & OptimizeCompile & Optimize
Compile & OptimizeCompile & Optimize
file1.c
file2.c
file3.c
file4.c
Without IPOWithout IPO
Compile & OptimizeCompile & Optimize
file1.c
file4.c file2.c
file3.c
With IPOWith IPO
Modules of multiple files/whole application-ipo
Only between modules of one source file-ip
8/10/2019 Compiler Support for Multicore
8/41
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation Parallelisation
8/10/2019 Compiler Support for Multicore
9/41
Profile-Guided
Optimizations Optimizing with runtime feedback
Enhances all optimizations, especiallyIPO, register allocation, instruction
cache usage, switch statementoptimization, etc
Code-Coverage and Test-PrioritizationTools uses PGO technology
Usability Tips:
- Run on typical input dataset(s)- Each run generates a data file.- Compiler calculates averages of all runs
8/10/2019 Compiler Support for Multicore
10/41
Profile-Guided Optimizations (PGO)
Use execution-time feedback to guide (final) optimization
Helps I-cache, paging, branch-prediction
Enabled optimizations: Basic block ordering
Better register allocation
Better decision on which functions to inline
Function ordering
Switch-statement optimization
8/10/2019 Compiler Support for Multicore
11/41
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation
Parallelisation
8/10/2019 Compiler Support for Multicore
12/41
Automatic Compiler VectorizationProcessor Specific Optimizations
Automatically generate vector SSE/SSE2/SSE3/SSSE3/SSE4
Vector processing: Operate at once on:
4 floating point values
2 double precision floating point values
4 integer values
Etc
Optimal code generation and instruction scheduling
Large number of options for advanced control of vectorization
Specify trip count, ignore dependencies (ivdep), specify alignmeSpecify trip count, ignore dependencies (ivdep), specify alignment,nt,disable vectorization, etc.disable vectorization, etc.
8/10/2019 Compiler Support for Multicore
13/41
Auto-Vectorization (IA-32 and Intel 64):Optimizing Loops with SSE/SSE2/SSE3/SSSE3/SSE4
Your Task: convert this$ cat w.c
void work( float* a, float *b, float *c, int MAX) { for (int I=0;I
8/10/2019 Compiler Support for Multicore
14/41
void work( float* a, float *b, float *c, int MAX) {
for (int I=0;I
8/10/2019 Compiler Support for Multicore
15/41
Vectorization Report
Existence of vector
dependence
Non-unit stride used
Mixed Data Types
Condition too Complex
Condition may protectexception
Low trip count
Subscript too complex
Unsupported Loop Structure
Contains unvectorizablestatement at line XX
Not Inner Loop
"vectorization possible butseems inefficient"
Operator unsuited for
vectorization
or other countries.
Loop was not vectorized because:
8/10/2019 Compiler Support for Multicore
16/41
Compiler Based Vectorization
Automatic Processor Dispatch ax[?]
Single executable
Optimized for Intel Core Duo processors and generic code that runs on allIA32 processors.
For each target processor it uses:
Processor-specific instructions
Vectorization Low overhead
Some increase in code size
8/10/2019 Compiler Support for Multicore
17/41
Processor Specific Options
QxOxOGenerated SSE3 where possible on Intel and any Intel compatiblesystem, such as AMD* Opteron*, not using CPU-dispatch.Applications will crash with illegal instruction on systems that dontsupport SSE3/SSE2/SSE2. Will not utilize SSE4/SSSE3, and may
not be as optimal as axT or xT
QxSxSGenerate SSE4 on future Intel processors code name Penryn
QxTxTGenerate SSSE3 on supported Intel processors with Intel CoreMicro architecture
QaxTaxTCPU Dispatch: Generate SSSE3 for supported Intel processors,
and generic Intel 64 processor, such as AMD* Opteron* via CPUdispatch. Can use axS to generate SSE4 instructions.
Windows*Linux* and Mac OS*Processor Target
QxPxPGenerate SSE3 on supported Intel processors
8/10/2019 Compiler Support for Multicore
18/41
Optimisations
Global Compiler Options
Inter-procedural Optimisations
Profile Guided Optimisations
Vectorisation
Parallelisation
Auto Parallelisation
OpenMP
8/10/2019 Compiler Support for Multicore
19/41
Auto-parallelization
Auto-parallelization: Automatic threading of loops without having to manually insertOpenMP* directives.
Compiler can identify easy candidates for parallelization, but largeapplications are difficult to analyze.
-par_report[n]
-parallel
Mac*
-par_report[n]
-parallel
Linux*
/Qpar_report[n]
/Qparallel
Windows*
8/10/2019 Compiler Support for Multicore
20/41
8/10/2019 Compiler Support for Multicore
21/41
Cluster OpenMP
Since Intel Compilers 9.1: Cluster OpenMP*
Extends OpenMP* from Shared Memory Processors (SMP) to Distributed
Memory systems ( Clusters)
Not a single system image
Minor language extensions only one new directive (SHARABLE)
8/10/2019 Compiler Support for Multicore
22/41
Optimization Strategy
Turn on the reporting feature of the compiler
Use Representative workload
Use VTune Analyzer to find Hot Spots
Focus effort on Hot Spots
Try advanced compiler optimizations on Hot spots
Re-Run workload
Seeing expected benefits ? If not, look at optimization reports
8/10/2019 Compiler Support for Multicore
23/41
Which Option First?
First try compiler vectorization options
Try O3 for loop bound hot functions
Try Interprocedural (IPO) & Profile Guided Optimization (PGO)
Recommended use IPO on hot functions / libraries
Can use PGO on hot functions / libraries or entire application
8/10/2019 Compiler Support for Multicore
24/41
Compiler Optimization Reports
Tells what optimizations were done and most importantly hints on what prevented agiven optimization
Turn on Optimization Reports -opt-report
Can be read by VTune Performance Analyzer
Default report verbose, recommend selecting optimization
Enable Vectorizer reports: -vec-report3 Enable Loop Optimizer (-O3): -opt-report-phase hlo
Vectorization Example: Aliasing problem prevented vectorization:
icc hpo.c -c -O3 -xT vec-report3
loop was not vectorized: existence of vector dependence.
vector dependence: proven FLOW dependence between a line 48, and b
line 48.
HLO Example Compiler able to optimize: generated multiple versions
of loop, did loop interchange:icc hpo.c -c -O3 -fargument-noalias -xT -opt-report-phase hlo
LOOP DISTRIBUTION in doit at line 43
LOOP INTERCHANGE in loops at line: 43 47
Loopnest permutation ( 1 2 ) --> ( 2 1 )
8/10/2019 Compiler Support for Multicore
25/41
Static VerifierStack Checking & Buffer Overflow
Detecting x87 FP Stack CorruptionMudflap Support
Security
8/10/2019 Compiler Support for Multicore
26/41
Static Verifier
New in Intel C++ and Fortran Compilers version 10.0
Detects defects or questionable code for C, C++, Fortran & OpenMP*
$ Can analyze mixed C/C++/Fortran applications
Multi-file analysis
Static Verifier analysis done at compile/link time, doesn%t detect run-time errors, such aspassing incorrect parameters to a function
Defects Detected
$ Inconsistent object or function declaration in different parts of the application, &verifies function arguments,
$ Uninitialized variables
$ memory leaks & memory corruption
$ incorrect usage of pointers and allocatable arrays
$ Detects incorrect OpenMP usage.
8/10/2019 Compiler Support for Multicore
27/41
Detecting Buffer Overflow
$ icc Buffer_overflow.c
$ ./a.out AhhhBustMyBuffers
Segmentation fault
$ icc -fstack-security-check Buffer_overflow.c
$ ./a.out AhhhBustMeBuffers
Error: Buffer overrun occurred, forced exit
Compiler generates code todetect some buffer overflowsthat overwrite the returnaddress.
Helps prevent commonlyused security vulnerabilities
Compiler Options:
Linux* and Mac OS* X icc -fstack-security-check
Windows*
ICL /GS
Buffer Overflow Example $ cat Buffer_overflow.c
#include "string.h"
void example(char *s) {
char buf[8];
strcpy(buf, s);}
int main(int argc, char **argv) {
example(argv[1]); }
8/10/2019 Compiler Support for Multicore
28/41
Improved Floating Point Model (C++)
EnabledEnabledSame as
fp:source
Same as fp:precise-fp:strict
DisabledDisabledUse real algebraIntermediate result precision,rounding determined by thecompiler
-fp:fast
DisabledDisabledSame as
fp:source
Intermediate results evaluatedat register precision. Roundingat assignment, type casting,function call
-fp:precise
DisabledDisabledUse FP non-associative,non-distributive algebra
Intermediate results in sourceprecision. Rounding after eachoperation
-fp:source
FPexception
FP Envaccess
Algebraic TransformRoundingModel
8/10/2019 Compiler Support for Multicore
29/41
Scott Meyers "Effective C++" DiagnosticsPorting from 32 to 64-bits
Assisting Threaded App DevelopmentCode Coverage and Test Prioritization
Quality
8/10/2019 Compiler Support for Multicore
30/41
10.0: Better C++ diagnostics Effective C++
Based on: Effective C++ Second Edition
50 Specific Ways to Improve Your Programs and Designs (Scott Meyers)
More Effective C++- 35 New Ways to Improve Your Programs and Designs(Scott Meyers)
Enabled via -Weffc++ ( /Qeffc++ )
Examples include
Use const and inline rather than #define
Use rather than .
Use new and delete rather than malloc and free Use delete on pointer members in destructors (diagnoses any pointer that does
not have a delete)
have a user copy constructor and assignment operator in classes containingpointers.
Use initialization rather than assignment to members in constructorsetc
8/10/2019 Compiler Support for Multicore
31/41
Porting from 32 to 64 bit
Moving from 32 to 64 bit can result inporting error
-Wp64 : enables 64 bit portingdiagnostics
N/ALP64ILP32Mac OS*
10
LP64LP64ILP32Linux*
P64P64ILP32Windows*
Ia64Intel 64IA 32OperatingSystem
& ' (
! ) ! * ) +,-
! ) * ,-
8/10/2019 Compiler Support for Multicore
32/41
Threading Legacy ApplicationsCompiler Global Variable Accesses Diagnostic
Problem: Thread legacy code that contains large number of global variables. Need toprotect access to globals throughout application.
Intel C++ Compiler has compile time diagnostics to identify when global variable areaccessed, available since 9.0 release (2005)
Linux* / Mac OS*: Enabled via -ww1710,1711,1712 fsyntax-only
Windows*: /Qww1710,1711,1712 /Zs
Can enable each diagnostic separately:
1710 warns about reference to statically allocated variables
1711 warns about assignment to statically allocated variables
1712 warns about address taken of statically allocated variables
8/10/2019 Compiler Support for Multicore
33/41
Threading Legacy ApplicationsIdentifying Global Variable Accesses
$ cat a.cpp 1: static int x;
2: void foo(int *);
3: void funcx(void){
4: int y;
5: x=2;
6: y=x;
7: foo(&x);
8: }
9:
10: extern int q;
11: int p; 12: void funcy(void) {
13: q=10;
14: p=5;
15: }
$ icc -ww1710,1711,1712 a.cpp a.cpp(5): warning #1711: assignment to statically allocated
variable "x
x=2;
a.cpp(6): warning #1710: reference to statically allocated
variable "x y=x;
a.cpp(7): warning #1712: address taken of statically allocatedvariable "x
foo(&x);
a.cpp(13): warning #1711: assignment to statically allocatedvariable "q
q=10; a.cpp(14): warning #1711: assignment to statically allocated
variable "p
p=5;
8/10/2019 Compiler Support for Multicore
34/41
Intel Code Coverage Tool
Example of code coverage summary fora project. The workload applied in this
test exercised 34 of 143 blocks,representing 5 of 19 functions in 2 of 3modules. In the file, SAMPLE.C, 4 of 5
functions were exercised
Clicking on SAMPLE.C produces alisting that highlights the code that
was exercised. In this example,the pink-highlighted code was
never exercised, the yellow wasrun but not exercised by any of thetests set up by the developer and
the beige was partiallycovered.
8/10/2019 Compiler Support for Multicore
35/41
Intel Test Prioritization Tool
Helps guide and speed software testing, Helps produce better code more quickly Helps improve programmer productivity
Example:
These 3 achieve 52.17% block and 50.00% function coverage
Test 3 alone covers 45.65% of basic blocks or 87.50% of total block coverage from alltests
By adding Test 2, cumulative block coverage goes to 52.17%, or 100% of the totalblock coverage of Test 1, Test 2, and Test 3
Eliminating Test 1 has no negative impact on block coverage and saves time
Number
of Tests
%Rat
Cvrg
%Blk
Cvrg
%Func
Cvrg
Test Names
@ Options
1 87.50 45.65 37.50 Test3.dpi
2 100.00 52.17 50.00 Test2.dpi
Total Number of Tests = 3
Total Block Coverage ~ 52.17%
Total Function Coverage ~50.00%
8/10/2019 Compiler Support for Multicore
36/41
Compatibility
8/10/2019 Compiler Support for Multicore
37/41
C++ Compatibility with Microsoft
Source & binary compatible with VC2003 with /Qvc71,
Source & binary compatible with w/ VC 2005 under /Qvc8.
Microsoft* & Intel OpenMP binaries are compatible. Use the option
8/10/2019 Compiler Support for Multicore
38/41
Support for Code Targeting AMD*
Goal: Competitiveon AMD*; Beston Intel
Compilers and Libraries support AMD* Opteron* processor-based systems
Our Analysis Tools (Intel VTune Analyzer and Threading Tools) do NOTsupport AMD*processors
May use specific features present only on Intel processors
Intel Compilers and Performance Libraries offerleadership performance on Intel processors;
competitive performance on AMD*.
Intel Compilers and Performance Libraries offer
leadership performance on Intel processors;competitive performance on AMD*.
Linux C/C++: Intel and GNU Compatibility History
8/10/2019 Compiler Support for Multicore
39/41
Linux C/C++: Intel and GNU Compatibility History
Established C++ ABI Industry Group
Intel Compiler for Linux* Version 5.0.1
C language binary compatibility, using glibc for C library Versions 6.0 and 7.1
C++ ABI compliant
Subtle differences in ABI compliance with gcc prevent full binary compatibility Version 8.0
Match gcc 3.2, 3.3, & 3.4 C++ ABI
Full C++ binary interoperability Version 8.1
gcc binary compatibility is the default for gcc 3.2, 3.3, & 3.4 Version 9.0
No g++ compatibility changes required, adds gcc 4.0 support Version 9.1
No g++ compatibility changes required, adds gcc 4.1 support
Version 10.0
Require g++ compatibility (Removed Intel provided C++ libraries), adds gcc 4.2 support
8/10/2019 Compiler Support for Multicore
40/41
10.0 OS Support Matrix
IA32
Red Hat EL3
SuSE SLES 10
SGI Propack v4.0
SGI Propack v5.0
Red Flag DC Server5.0
Red Hat EL4
Red Hat Fedora Core 5
Turbo Linux 10
Mandriva/Mandrake 10.1
Red Hat Fedora Core 4
Haansoft Linux 2006 Server
Miracle Linux v4.0
SuSE SLES9
Linux Distros
IPFIntel 64
8/10/2019 Compiler Support for Multicore
41/41
Additional Resources Intel C++ Compiler for Linux* product website
http://www.intel.com/software/products/compilers/clin
Active User Forum http://softwarecommunity.intel.com/isn/Community/en-US/forums/1016/ShowForum.aspx
C++ Compiler White Papers at http://www3.intel.com/cd/software/products/asmo-na/eng/278608.htm
Useful White Papers
Quick Reference Guide White Paper - http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf
Optimization Guide White Paper - http://cache-www.intel.com/cd/00/00/27/66/276615_276615.pdf
gcc/g++ compatibility White Paper - http://cache-www.intel.com/cd/00/00/28/47/284736_284736.pdf
Code Coverage White Paper - http://cache-www.intel.com/cd/00/00/21/92/219280_compiler_code-coverage.pdf
Security White Paper - : http://cache-
www.intel.com/cd/00/00/37/03/370307_370307.pdf