Download pdf - Performance migration from Intel Westmere to Intel Sandy Bridge

© 2012 IBM Corporation

Performance migration from Intel Westmere to Intel Sandy Bridge thru Advanced Vector

Extensions (AVX)Nagarajan Kathiresan

IBM India Presented by Giri Prabhakar

Contact:[email protected]@in.ibm.com

IBM India © 2012 IBM Corporation3

Source: Intel MMX, SSE and AVX


“I must have the Intel compiler, it has sped up our application by two.” - A customer when moving from version 9.1 to version 10 of the Intel compiler

Source: Intel


Source: Intel AVX


Source: Intel SSE & AVX


Source: Intel Compiler tunings


Following figure illustrates the data types used in the SSE and Intel® AVX instructions. Roughly, for Intel® AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.

Source: Intel MMX, SSE and AVX


About AVX Performance - Summary

Doubling the 128 bit SSE registers to 256 bits They introduce an entirely new instruction encoding (VEX) The new encoding switches from 2 operand instructions to 3 operand

instructions allowing the destination register to be different than the source registers. Example:

addps r0, r1 # (r0 = r0 + r1) vs. vaddps r0, r1, r2 # (r0 = r1 + r2)

This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers.

switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx


Source: Compiling for AVX, Intel


Intel and GNU compiler for AVX

Intel's 12.1 uses OpenMP std. 3.1, while the CP2K source code uses OpenMP std. 2.5

Some OpenMP classes could not be compiled with the Intel compiler The GNU compiler is open source, and appears to be more 'in step' with

the CP2K source. However, it is “difficult” to get the system admin of a very large

installation to make a root installation of the GNU compiler (4.3+ - later version)

Therefore, experiments were tried with a local build of GNU (Gfortran) While -mavx does “work”, i.e., code compiles, it doesn't “AVX vectorize” -

it was found that the flags -march=corei7-avx -mtune=corei7-avx were necessary to enable AVX


How to build Gfortran compiler locally

Gfortran Dependent libraries – GNU Multiple Precision Library (GMP) – MPFR Library (http://www.mpfr.org/. )– MPC Library (http://www.multiprecision.org/ )– Parma Polyhedra Library (PPL) – CLooG-PPL or CLooG (ftp://gcc.gnu.org/pub/gcc/infrastructure/ as cloog-

ppl-0.15.tar.gz. )


Gfortran Local Build

MPFR MPC

GMP

PPL ClooG(-PPL)

GFORTRAN


FFTW_INC = /user/naga/hybrid/Endeavor/fftw/includeFFTW_LIB = /user/naga/hybrid/Endeavor/fftw/libCC = gccCPP =FC = mpif90LD = mpif90AR = ar -rCPPFLAGS =DFLAGS = -D__GFORTRAN -D__FFTSG -D__LIBINT -D__parallel -D__SCALAPACK -D__BLACS -D__FFTW3 -D__MAX_CONTR=3 -D__GRID_CORE=2FCFLAGS = -I$(FFTW_INC) –O3 -fopenmp -ffast-math -march=corei7-avx -mtune=corei7-avx -funroll-loops -ftree-vectorize -march=native -ffree-form $(DFLAGS)LDFLAGS = $(FCFLAGS)LIBS = /user/naga/hybrid/Endeavor/libint_cpp_wrapper.o \/user/naga/hybrid/Endeavor/libint/lib/libderiv.a \/user/naga/hybrid/Endeavor/libint/lib/libint.a \/user/naga/hybrid/Endeavor/libs/libscalapack.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacsCinit_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacsF77init_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/lapack_LINUX.a \/user/naga/hybrid/Endeavor/libs/blas_LINUX.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ -lpthreadOBJECTS_ARCHITECTURE = machine_gfortran.o


CP2K Build

BLACS FFTW

BLAS

LAPACK SCALAPACK

CP2K


CP2K Execution time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1

Tota

l exe

cutio

n tim

e (in

rat

io)

SNB GF 4.5 SNB GF 4.7 SNB GF 4.7 OPT WSM GF 4.5

Lower is better


MPI Synchronization time

0

0.5

1

1.5

2

2.5

GF 4.5 GF 4.7 GF 4.7 Opt GF 4.5

SNB SNB SNB WSM

Category

MP

I Syn

chro

niza

tion

time

(in ra

tio)

Lower is better


MPI PERFORMANCE

05000

100001500020000250003000035000400004500050000550006000065000700007500080000850009000095000

100000105000

MP_Bcast MP_ISendRecv MP_ISend MP_IRecv MP_Recv

MPI ROUTINE

PE

RF

OR

MA

NC

E [M

B/s

]

SDB Gfortran 4.5 SDB Gfortran 4.7 SDB Gfortran 4.7 Optimized WSM Gfortran 4.5

Higher is better


Swamy Kandadai

Acknowledgements / Technical advisory

Luigi Brochard

Raj Panda



Sandy Bridge vs Westmere

Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading

Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia


Gfortran Local Build

MPFR MPC

GMP

PPL ClooG(-PPL)


MPFR Install

export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure –prefix=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install


MPC Install

export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-mpfr=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install


PPL Installexport CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-libgmp-prefix=/user/naga/4.7.0/dlibs/lib \ 2>&1 | tee config.naga-64bit.log

make -j8 2>&1 | tee make.naga-64bit.logmake install


cloog-ppl-0.15.11export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \ --with-ppl=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs \

make -j8 2>&1 | tee make.naga-64bit.log

make install


Sandy Bridge vs Westmere

Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading

Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.

Source: Wikipedia


CP2K Build

BLACS FFTW

BLAS

LAPACK SCALAPACK

CP2K


BLAS InstallModify the make.inc

FORTRAN = gfortranOPTS = -O3 -ffast-math -funroll-loops -ftree-vectorize -march=corei7-avx -mtune=corei7-avxOPTS = -O3DRVOPTS = $(OPTS)NOOPT =LOADER = gfortranLOADOPTS =Make Make install


Modify the Bmake.inc file

BTOPdir = /user/naga/hybrid/Endeavor/BLACS BLACSdir = $(BTOPdir)/LIB BLACSDBGLVL = 0 BLACSFINIT = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSCINIT = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSLIB = $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a MPIdir = /opt/intel/impi/4.0.3.008/intel64 MPILIBdir = $(MPIdir)/lib MPIINCdir = $(MPIdir)/include MPILIB = -L$(MPILIBdir) -lmpich F77 = mpif90 F77NO_OPTFLAGS = F77FLAGS = $(F77NO_OPTFLAGS) -O F77LOADER = $(F77) F77LOADFLAGS = CC = mpicc CCFLAGS = -O4 -ffast-math -funroll-loops \

-ftree-vectorize -march=corei7-avx -mtune=corei7-avx CCFLAGS = -O4 CCLOADER = $(CC) CCLOADFLAGS =


fftw-3.2.2export CC=gccexport CFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \

-ffree-form -march=corei7-avx -mtune=corei7-avx"export CFLAGS="-O3"export MPICC=mpiccexport F77=gfortranexport FFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \

-ffree-form -march=corei7-avx -mtune=corei7-avx"export FFLAGS="-O3"./configure --prefix=/user/naga/4.7.0/cp2k-dlibs \

--enable-mpi 2>&1 | tee config.naga.log


Install scalapack-2.0.1Modify SLmake.inc fileFC = mpif90CC = mpiccNOOPT = -O0FCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxCCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxFCLOADER = $(FC)CCLOADER = $(CC)FCLOADFLAGS = $(FCFLAGS)CCLOADFLAGS = $(CCFLAGS)BLASLIB = /user/naga/hybrid/Endeavor/BLAS/blas_LINUX.aLAPACKLIB = /user/naga/hybrid/Endeavor/lapack-3.4.0/liblapack.aLIBS = $(LAPACKLIB) $(BLASLIB)