Compiler Support for Multicore

  • Upload
    dmctek

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 Compiler Support for Multicore

    1/41

    Compiler Support for Multi-Core

    Stephen Blair-Chappell

  • 8/10/2019 Compiler Support for Multicore

    2/41

    Agenda

    !"

    #"

  • 8/10/2019 Compiler Support for Multicore

    3/41

    Optimisations

    Global Compiler Options

    Inter-procedural Optimisations

    Profile Guided Optimisations

    Vectorisation Parallelisation

  • 8/10/2019 Compiler Support for Multicore

    4/41

    Common Optimization Switches

    -openmp/QopenmpOpenMP 2.5 support

    -fast/fastOptimize for speed, including IPO

    -parallel/QparallelAutomatic parallelization

    -ipo/QipoInter-procedural optimization

    -prof-gen

    -prof-use

    /Qprof-gen

    /Qprof-use

    Profile guided optimization (muli-step build)

    Linux & Mac OS*WINDOWS

    /Zi

    /O3

    /O2

    /O1

    /Og

    -gCreate symbols for debugging

    -O3High-level optimizer, including prefetch, unroll

    -O2Optimize for speed (default)

    -O1Optimize for speed (no code size increase)

    -O0Disable optimization

    Itanium and the Intel logo are trademarks or registered trademarks of IntelCorporation or its subsidiaries in the United States or other countries

  • 8/10/2019 Compiler Support for Multicore

    5/41

    Optimisations

    Global Compiler Options

    Inter-procedural Optimisations

    Profile Guided Optimisations

    Vectorisation Parallelisation

  • 8/10/2019 Compiler Support for Multicore

    6/41

    Interprocedural Optimizations

    Enables inlining, better register usage, deadcode elimination, etc.

    usage:

    icpc -ip: single file IPO

    icpc -ipo: multi-file IPO Link time code generation - increases build

    time

    IPO: Two Step Process

    Usability Tips: Try IPO on performance critical files/libs Dont run ipo on 10,000s object files,

    avoid unnecessary increased build time Remember to link with -ipo option

    Pass 1

    Pass 2

    ipo objects

    executable

    Compiling:icpc -c -ipo a.cxx b.cxx

    Linking:icpc -ipo a.o b.o

  • 8/10/2019 Compiler Support for Multicore

    7/41

    Interprocedural OptimizationExtends optimizations across file boundaries

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    Compile & OptimizeCompile & Optimize

    file1.c

    file2.c

    file3.c

    file4.c

    Without IPOWithout IPO

    Compile & OptimizeCompile & Optimize

    file1.c

    file4.c file2.c

    file3.c

    With IPOWith IPO

    Modules of multiple files/whole application-ipo

    Only between modules of one source file-ip

  • 8/10/2019 Compiler Support for Multicore

    8/41

    Optimisations

    Global Compiler Options

    Inter-procedural Optimisations

    Profile Guided Optimisations

    Vectorisation Parallelisation

  • 8/10/2019 Compiler Support for Multicore

    9/41

    Profile-Guided

    Optimizations Optimizing with runtime feedback

    Enhances all optimizations, especiallyIPO, register allocation, instruction

    cache usage, switch statementoptimization, etc

    Code-Coverage and Test-PrioritizationTools uses PGO technology

    Usability Tips:

    - Run on typical input dataset(s)- Each run generates a data file.- Compiler calculates averages of all runs

  • 8/10/2019 Compiler Support for Multicore

    10/41

    Profile-Guided Optimizations (PGO)

    Use execution-time feedback to guide (final) optimization

    Helps I-cache, paging, branch-prediction

    Enabled optimizations: Basic block ordering

    Better register allocation

    Better decision on which functions to inline

    Function ordering

    Switch-statement optimization

  • 8/10/2019 Compiler Support for Multicore

    11/41

    Optimisations

    Global Compiler Options

    Inter-procedural Optimisations

    Profile Guided Optimisations

    Vectorisation

    Parallelisation

  • 8/10/2019 Compiler Support for Multicore

    12/41

    Automatic Compiler VectorizationProcessor Specific Optimizations

    Automatically generate vector SSE/SSE2/SSE3/SSSE3/SSE4

    Vector processing: Operate at once on:

    4 floating point values

    2 double precision floating point values

    4 integer values

    Etc

    Optimal code generation and instruction scheduling

    Large number of options for advanced control of vectorization

    Specify trip count, ignore dependencies (ivdep), specify alignmeSpecify trip count, ignore dependencies (ivdep), specify alignment,nt,disable vectorization, etc.disable vectorization, etc.

  • 8/10/2019 Compiler Support for Multicore

    13/41

    Auto-Vectorization (IA-32 and Intel 64):Optimizing Loops with SSE/SSE2/SSE3/SSSE3/SSE4

    Your Task: convert this$ cat w.c

    void work( float* a, float *b, float *c, int MAX) { for (int I=0;I

  • 8/10/2019 Compiler Support for Multicore

    14/41

    void work( float* a, float *b, float *c, int MAX) {

    for (int I=0;I

  • 8/10/2019 Compiler Support for Multicore

    15/41

    Vectorization Report

    Existence of vector

    dependence

    Non-unit stride used

    Mixed Data Types

    Condition too Complex

    Condition may protectexception

    Low trip count

    Subscript too complex

    Unsupported Loop Structure

    Contains unvectorizablestatement at line XX

    Not Inner Loop

    "vectorization possible butseems inefficient"

    Operator unsuited for

    vectorization

    or other countries.

    Loop was not vectorized because:

  • 8/10/2019 Compiler Support for Multicore

    16/41

    Compiler Based Vectorization

    Automatic Processor Dispatch ax[?]

    Single executable

    Optimized for Intel Core Duo processors and generic code that runs on allIA32 processors.

    For each target processor it uses:

    Processor-specific instructions

    Vectorization Low overhead

    Some increase in code size

  • 8/10/2019 Compiler Support for Multicore

    17/41

    Processor Specific Options

    QxOxOGenerated SSE3 where possible on Intel and any Intel compatiblesystem, such as AMD* Opteron*, not using CPU-dispatch.Applications will crash with illegal instruction on systems that dontsupport SSE3/SSE2/SSE2. Will not utilize SSE4/SSSE3, and may

    not be as optimal as axT or xT

    QxSxSGenerate SSE4 on future Intel processors code name Penryn

    QxTxTGenerate SSSE3 on supported Intel processors with Intel CoreMicro architecture

    QaxTaxTCPU Dispatch: Generate SSSE3 for supported Intel processors,

    and generic Intel 64 processor, such as AMD* Opteron* via CPUdispatch. Can use axS to generate SSE4 instructions.

    Windows*Linux* and Mac OS*Processor Target

    QxPxPGenerate SSE3 on supported Intel processors

  • 8/10/2019 Compiler Support for Multicore

    18/41

    Optimisations

    Global Compiler Options

    Inter-procedural Optimisations

    Profile Guided Optimisations

    Vectorisation

    Parallelisation

    Auto Parallelisation

    OpenMP

  • 8/10/2019 Compiler Support for Multicore

    19/41

    Auto-parallelization

    Auto-parallelization: Automatic threading of loops without having to manually insertOpenMP* directives.

    Compiler can identify easy candidates for parallelization, but largeapplications are difficult to analyze.

    -par_report[n]

    -parallel

    Mac*

    -par_report[n]

    -parallel

    Linux*

    /Qpar_report[n]

    /Qparallel

    Windows*

  • 8/10/2019 Compiler Support for Multicore

    20/41

  • 8/10/2019 Compiler Support for Multicore

    21/41

    Cluster OpenMP

    Since Intel Compilers 9.1: Cluster OpenMP*

    Extends OpenMP* from Shared Memory Processors (SMP) to Distributed

    Memory systems ( Clusters)

    Not a single system image

    Minor language extensions only one new directive (SHARABLE)

  • 8/10/2019 Compiler Support for Multicore

    22/41

    Optimization Strategy

    Turn on the reporting feature of the compiler

    Use Representative workload

    Use VTune Analyzer to find Hot Spots

    Focus effort on Hot Spots

    Try advanced compiler optimizations on Hot spots

    Re-Run workload

    Seeing expected benefits ? If not, look at optimization reports

  • 8/10/2019 Compiler Support for Multicore

    23/41

    Which Option First?

    First try compiler vectorization options

    Try O3 for loop bound hot functions

    Try Interprocedural (IPO) & Profile Guided Optimization (PGO)

    Recommended use IPO on hot functions / libraries

    Can use PGO on hot functions / libraries or entire application

  • 8/10/2019 Compiler Support for Multicore

    24/41

    Compiler Optimization Reports

    Tells what optimizations were done and most importantly hints on what prevented agiven optimization

    Turn on Optimization Reports -opt-report

    Can be read by VTune Performance Analyzer

    Default report verbose, recommend selecting optimization

    Enable Vectorizer reports: -vec-report3 Enable Loop Optimizer (-O3): -opt-report-phase hlo

    Vectorization Example: Aliasing problem prevented vectorization:

    icc hpo.c -c -O3 -xT vec-report3

    loop was not vectorized: existence of vector dependence.

    vector dependence: proven FLOW dependence between a line 48, and b

    line 48.

    HLO Example Compiler able to optimize: generated multiple versions

    of loop, did loop interchange:icc hpo.c -c -O3 -fargument-noalias -xT -opt-report-phase hlo

    LOOP DISTRIBUTION in doit at line 43

    LOOP INTERCHANGE in loops at line: 43 47

    Loopnest permutation ( 1 2 ) --> ( 2 1 )

  • 8/10/2019 Compiler Support for Multicore

    25/41

    Static VerifierStack Checking & Buffer Overflow

    Detecting x87 FP Stack CorruptionMudflap Support

    Security

  • 8/10/2019 Compiler Support for Multicore

    26/41

    Static Verifier

    New in Intel C++ and Fortran Compilers version 10.0

    Detects defects or questionable code for C, C++, Fortran & OpenMP*

    $ Can analyze mixed C/C++/Fortran applications

    Multi-file analysis

    Static Verifier analysis done at compile/link time, doesn%t detect run-time errors, such aspassing incorrect parameters to a function

    Defects Detected

    $ Inconsistent object or function declaration in different parts of the application, &verifies function arguments,

    $ Uninitialized variables

    $ memory leaks & memory corruption

    $ incorrect usage of pointers and allocatable arrays

    $ Detects incorrect OpenMP usage.

  • 8/10/2019 Compiler Support for Multicore

    27/41

    Detecting Buffer Overflow

    $ icc Buffer_overflow.c

    $ ./a.out AhhhBustMyBuffers

    Segmentation fault

    $ icc -fstack-security-check Buffer_overflow.c

    $ ./a.out AhhhBustMeBuffers

    Error: Buffer overrun occurred, forced exit

    Compiler generates code todetect some buffer overflowsthat overwrite the returnaddress.

    Helps prevent commonlyused security vulnerabilities

    Compiler Options:

    Linux* and Mac OS* X icc -fstack-security-check

    Windows*

    ICL /GS

    Buffer Overflow Example $ cat Buffer_overflow.c

    #include "string.h"

    void example(char *s) {

    char buf[8];

    strcpy(buf, s);}

    int main(int argc, char **argv) {

    example(argv[1]); }

  • 8/10/2019 Compiler Support for Multicore

    28/41

    Improved Floating Point Model (C++)

    EnabledEnabledSame as

    fp:source

    Same as fp:precise-fp:strict

    DisabledDisabledUse real algebraIntermediate result precision,rounding determined by thecompiler

    -fp:fast

    DisabledDisabledSame as

    fp:source

    Intermediate results evaluatedat register precision. Roundingat assignment, type casting,function call

    -fp:precise

    DisabledDisabledUse FP non-associative,non-distributive algebra

    Intermediate results in sourceprecision. Rounding after eachoperation

    -fp:source

    FPexception

    FP Envaccess

    Algebraic TransformRoundingModel

  • 8/10/2019 Compiler Support for Multicore

    29/41

    Scott Meyers "Effective C++" DiagnosticsPorting from 32 to 64-bits

    Assisting Threaded App DevelopmentCode Coverage and Test Prioritization

    Quality

  • 8/10/2019 Compiler Support for Multicore

    30/41

    10.0: Better C++ diagnostics Effective C++

    Based on: Effective C++ Second Edition

    50 Specific Ways to Improve Your Programs and Designs (Scott Meyers)

    More Effective C++- 35 New Ways to Improve Your Programs and Designs(Scott Meyers)

    Enabled via -Weffc++ ( /Qeffc++ )

    Examples include

    Use const and inline rather than #define

    Use rather than .

    Use new and delete rather than malloc and free Use delete on pointer members in destructors (diagnoses any pointer that does

    not have a delete)

    have a user copy constructor and assignment operator in classes containingpointers.

    Use initialization rather than assignment to members in constructorsetc

  • 8/10/2019 Compiler Support for Multicore

    31/41

    Porting from 32 to 64 bit

    Moving from 32 to 64 bit can result inporting error

    -Wp64 : enables 64 bit portingdiagnostics

    N/ALP64ILP32Mac OS*

    10

    LP64LP64ILP32Linux*

    P64P64ILP32Windows*

    Ia64Intel 64IA 32OperatingSystem

    & ' (

    ! ) ! * ) +,-

    ! ) * ,-

  • 8/10/2019 Compiler Support for Multicore

    32/41

    Threading Legacy ApplicationsCompiler Global Variable Accesses Diagnostic

    Problem: Thread legacy code that contains large number of global variables. Need toprotect access to globals throughout application.

    Intel C++ Compiler has compile time diagnostics to identify when global variable areaccessed, available since 9.0 release (2005)

    Linux* / Mac OS*: Enabled via -ww1710,1711,1712 fsyntax-only

    Windows*: /Qww1710,1711,1712 /Zs

    Can enable each diagnostic separately:

    1710 warns about reference to statically allocated variables

    1711 warns about assignment to statically allocated variables

    1712 warns about address taken of statically allocated variables

  • 8/10/2019 Compiler Support for Multicore

    33/41

    Threading Legacy ApplicationsIdentifying Global Variable Accesses

    $ cat a.cpp 1: static int x;

    2: void foo(int *);

    3: void funcx(void){

    4: int y;

    5: x=2;

    6: y=x;

    7: foo(&x);

    8: }

    9:

    10: extern int q;

    11: int p; 12: void funcy(void) {

    13: q=10;

    14: p=5;

    15: }

    $ icc -ww1710,1711,1712 a.cpp a.cpp(5): warning #1711: assignment to statically allocated

    variable "x

    x=2;

    a.cpp(6): warning #1710: reference to statically allocated

    variable "x y=x;

    a.cpp(7): warning #1712: address taken of statically allocatedvariable "x

    foo(&x);

    a.cpp(13): warning #1711: assignment to statically allocatedvariable "q

    q=10; a.cpp(14): warning #1711: assignment to statically allocated

    variable "p

    p=5;

  • 8/10/2019 Compiler Support for Multicore

    34/41

    Intel Code Coverage Tool

    Example of code coverage summary fora project. The workload applied in this

    test exercised 34 of 143 blocks,representing 5 of 19 functions in 2 of 3modules. In the file, SAMPLE.C, 4 of 5

    functions were exercised

    Clicking on SAMPLE.C produces alisting that highlights the code that

    was exercised. In this example,the pink-highlighted code was

    never exercised, the yellow wasrun but not exercised by any of thetests set up by the developer and

    the beige was partiallycovered.

  • 8/10/2019 Compiler Support for Multicore

    35/41

    Intel Test Prioritization Tool

    Helps guide and speed software testing, Helps produce better code more quickly Helps improve programmer productivity

    Example:

    These 3 achieve 52.17% block and 50.00% function coverage

    Test 3 alone covers 45.65% of basic blocks or 87.50% of total block coverage from alltests

    By adding Test 2, cumulative block coverage goes to 52.17%, or 100% of the totalblock coverage of Test 1, Test 2, and Test 3

    Eliminating Test 1 has no negative impact on block coverage and saves time

    Number

    of Tests

    %Rat

    Cvrg

    %Blk

    Cvrg

    %Func

    Cvrg

    Test Names

    @ Options

    1 87.50 45.65 37.50 Test3.dpi

    2 100.00 52.17 50.00 Test2.dpi

    Total Number of Tests = 3

    Total Block Coverage ~ 52.17%

    Total Function Coverage ~50.00%

  • 8/10/2019 Compiler Support for Multicore

    36/41

    Compatibility

  • 8/10/2019 Compiler Support for Multicore

    37/41

    C++ Compatibility with Microsoft

    Source & binary compatible with VC2003 with /Qvc71,

    Source & binary compatible with w/ VC 2005 under /Qvc8.

    Microsoft* & Intel OpenMP binaries are compatible. Use the option

  • 8/10/2019 Compiler Support for Multicore

    38/41

    Support for Code Targeting AMD*

    Goal: Competitiveon AMD*; Beston Intel

    Compilers and Libraries support AMD* Opteron* processor-based systems

    Our Analysis Tools (Intel VTune Analyzer and Threading Tools) do NOTsupport AMD*processors

    May use specific features present only on Intel processors

    Intel Compilers and Performance Libraries offerleadership performance on Intel processors;

    competitive performance on AMD*.

    Intel Compilers and Performance Libraries offer

    leadership performance on Intel processors;competitive performance on AMD*.

    Linux C/C++: Intel and GNU Compatibility History

  • 8/10/2019 Compiler Support for Multicore

    39/41

    Linux C/C++: Intel and GNU Compatibility History

    Established C++ ABI Industry Group

    Intel Compiler for Linux* Version 5.0.1

    C language binary compatibility, using glibc for C library Versions 6.0 and 7.1

    C++ ABI compliant

    Subtle differences in ABI compliance with gcc prevent full binary compatibility Version 8.0

    Match gcc 3.2, 3.3, & 3.4 C++ ABI

    Full C++ binary interoperability Version 8.1

    gcc binary compatibility is the default for gcc 3.2, 3.3, & 3.4 Version 9.0

    No g++ compatibility changes required, adds gcc 4.0 support Version 9.1

    No g++ compatibility changes required, adds gcc 4.1 support

    Version 10.0

    Require g++ compatibility (Removed Intel provided C++ libraries), adds gcc 4.2 support

  • 8/10/2019 Compiler Support for Multicore

    40/41

    10.0 OS Support Matrix

    IA32

    Red Hat EL3

    SuSE SLES 10

    SGI Propack v4.0

    SGI Propack v5.0

    Red Flag DC Server5.0

    Red Hat EL4

    Red Hat Fedora Core 5

    Turbo Linux 10

    Mandriva/Mandrake 10.1

    Red Hat Fedora Core 4

    Haansoft Linux 2006 Server

    Miracle Linux v4.0

    SuSE SLES9

    Linux Distros

    IPFIntel 64

  • 8/10/2019 Compiler Support for Multicore

    41/41

    Additional Resources Intel C++ Compiler for Linux* product website

    http://www.intel.com/software/products/compilers/clin

    Active User Forum http://softwarecommunity.intel.com/isn/Community/en-US/forums/1016/ShowForum.aspx

    C++ Compiler White Papers at http://www3.intel.com/cd/software/products/asmo-na/eng/278608.htm

    Useful White Papers

    Quick Reference Guide White Paper - http://cache-www.intel.com/cd/00/00/22/23/222300_222300.pdf

    Optimization Guide White Paper - http://cache-www.intel.com/cd/00/00/27/66/276615_276615.pdf

    gcc/g++ compatibility White Paper - http://cache-www.intel.com/cd/00/00/28/47/284736_284736.pdf

    Code Coverage White Paper - http://cache-www.intel.com/cd/00/00/21/92/219280_compiler_code-coverage.pdf

    Security White Paper - : http://cache-

    www.intel.com/cd/00/00/37/03/370307_370307.pdf