© 2012 IBM Corporation
Performance migration from Intel Westmere to Intel Sandy Bridge thru Advanced Vector
Extensions (AVX)Nagarajan Kathiresan
IBM India Presented by Giri Prabhakar
Contact:[email protected]@in.ibm.com
IBM India © 2012 IBM Corporation3
Source: Intel MMX, SSE and AVX
IBM India © 2012 IBM Corporation4
“I must have the Intel compiler, it has sped up our application by two.” - A customer when moving from version 9.1 to version 10 of the Intel compiler
Source: Intel
IBM India © 2012 IBM Corporation6
Source: Intel AVX
IBM India © 2012 IBM Corporation7
Source: Intel SSE & AVX
IBM India © 2012 IBM Corporation8
Source: Intel Compiler tunings
IBM India © 2012 IBM Corporation9
Following figure illustrates the data types used in the SSE and Intel® AVX instructions. Roughly, for Intel® AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.
Source: Intel MMX, SSE and AVX
IBM India © 2012 IBM Corporation10
About AVX Performance - Summary
Doubling the 128 bit SSE registers to 256 bits They introduce an entirely new instruction encoding (VEX) The new encoding switches from 2 operand instructions to 3 operand
instructions allowing the destination register to be different than the source registers. Example:
addps r0, r1 # (r0 = r0 + r1) vs. vaddps r0, r1, r2 # (r0 = r1 + r2)
This new encoding is not only used for the new 256 bit instructions, but also for the 128 bit AVX versions of all the old SSE instructions. This means that existing SSE code can improved without requiring a switch to 256 bit registers.
switching to AVX is very easy; simply recompile with -mavx. In addition to using -mavx
IBM India © 2012 IBM Corporation11
Source: Compiling for AVX, Intel
IBM India © 2012 IBM Corporation12
Intel and GNU compiler for AVX
Intel's 12.1 uses OpenMP std. 3.1, while the CP2K source code uses OpenMP std. 2.5
Some OpenMP classes could not be compiled with the Intel compiler The GNU compiler is open source, and appears to be more 'in step' with
the CP2K source. However, it is “difficult” to get the system admin of a very large
installation to make a root installation of the GNU compiler (4.3+ - later version)
Therefore, experiments were tried with a local build of GNU (Gfortran) While -mavx does “work”, i.e., code compiles, it doesn't “AVX vectorize” -
it was found that the flags -march=corei7-avx -mtune=corei7-avx were necessary to enable AVX
IBM India © 2012 IBM Corporation13
How to build Gfortran compiler locally
Gfortran Dependent libraries – GNU Multiple Precision Library (GMP) – MPFR Library (http://www.mpfr.org/. )– MPC Library (http://www.multiprecision.org/ )– Parma Polyhedra Library (PPL) – CLooG-PPL or CLooG (ftp://gcc.gnu.org/pub/gcc/infrastructure/ as cloog-
ppl-0.15.tar.gz. )
IBM India © 2012 IBM Corporation14
Gfortran Local Build
MPFR MPC
GMP
PPL ClooG(-PPL)
GFORTRAN
IBM India © 2012 IBM Corporation15
FFTW_INC = /user/naga/hybrid/Endeavor/fftw/includeFFTW_LIB = /user/naga/hybrid/Endeavor/fftw/libCC = gccCPP =FC = mpif90LD = mpif90AR = ar -rCPPFLAGS =DFLAGS = -D__GFORTRAN -D__FFTSG -D__LIBINT -D__parallel -D__SCALAPACK -D__BLACS -D__FFTW3 -D__MAX_CONTR=3 -D__GRID_CORE=2FCFLAGS = -I$(FFTW_INC) –O3 -fopenmp -ffast-math -march=corei7-avx -mtune=corei7-avx -funroll-loops -ftree-vectorize -march=native -ffree-form $(DFLAGS)LDFLAGS = $(FCFLAGS)LIBS = /user/naga/hybrid/Endeavor/libint_cpp_wrapper.o \/user/naga/hybrid/Endeavor/libint/lib/libderiv.a \/user/naga/hybrid/Endeavor/libint/lib/libint.a \/user/naga/hybrid/Endeavor/libs/libscalapack.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacsCinit_MPI-LINUX-0.a \ /user/naga/hybrid/Endeavor/libs/blacsF77init_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/blacs_MPI-LINUX-0.a \/user/naga/hybrid/Endeavor/libs/lapack_LINUX.a \/user/naga/hybrid/Endeavor/libs/blas_LINUX.a \ $(FFTW_LIB)/libfftw3.a \ -lstdc++ -lpthreadOBJECTS_ARCHITECTURE = machine_gfortran.o
IBM India © 2012 IBM Corporation16
CP2K Build
BLACS FFTW
BLAS
LAPACK SCALAPACK
CP2K
IBM India © 2012 IBM Corporation17
CP2K Execution time
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1
Tota
l exe
cutio
n tim
e (in
rat
io)
SNB GF 4.5 SNB GF 4.7 SNB GF 4.7 OPT WSM GF 4.5
Lower is better
IBM India © 2012 IBM Corporation18
MPI Synchronization time
0
0.5
1
1.5
2
2.5
GF 4.5 GF 4.7 GF 4.7 Opt GF 4.5
SNB SNB SNB WSM
Category
MP
I Syn
chro
niza
tion
time
(in ra
tio)
Lower is better
IBM India © 2012 IBM Corporation19
MPI PERFORMANCE
05000
100001500020000250003000035000400004500050000550006000065000700007500080000850009000095000
100000105000
MP_Bcast MP_ISendRecv MP_ISend MP_IRecv MP_Recv
MPI ROUTINE
PE
RF
OR
MA
NC
E [M
B/s
]
SDB Gfortran 4.5 SDB Gfortran 4.7 SDB Gfortran 4.7 Optimized WSM Gfortran 4.5
Higher is better
IBM India © 2012 IBM Corporation20
Swamy Kandadai
Acknowledgements / Technical advisory
Luigi Brochard
Raj Panda
IBM India © 2012 IBM Corporation21
IBM India © 2012 IBM Corporation22
Sandy Bridge vs Westmere
Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading
Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.
Source: Wikipedia
IBM India © 2012 IBM Corporation23
Gfortran Local Build
MPFR MPC
GMP
PPL ClooG(-PPL)
IBM India © 2012 IBM Corporation24
MPFR Install
export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure –prefix=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install
IBM India © 2012 IBM Corporation25
MPC Install
export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-mpfr=/user/naga/4.7.0/dlibs \--with-gmp=/user/naga/4.7.0/dlibs 2>&1 \ | tee config.naga-64bit.logmake -j8 2>&1 | tee make.naga-64bit.logmake install
IBM India © 2012 IBM Corporation26
PPL Installexport CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \--with-libgmp-prefix=/user/naga/4.7.0/dlibs/lib \ 2>&1 | tee config.naga-64bit.log
make -j8 2>&1 | tee make.naga-64bit.logmake install
IBM India © 2012 IBM Corporation27
cloog-ppl-0.15.11export CC=gccexport CXX=g++export F77=gfortranexport FC=gfortranexport F90=gfortranexport CFLAGS="-m64 -O2 "export CXXFLAGS="-m64 -O2 "export FFLAGS="-m64 -O2 "export FCFLAGS="-m64 -O2 "export LDFLAGS="-m64 -O2 "./configure --prefix=/user/naga/4.7.0/dlibs \ --with-ppl=/user/naga/4.7.0/dlibs \ --with-gmp=/user/naga/4.7.0/dlibs \
make -j8 2>&1 | tee make.naga-64bit.log
make install
IBM India © 2012 IBM Corporation28
Sandy Bridge vs Westmere
Sandy Bridge· 32 kB data + 32 kB instruction ··L1 cache (3 clocks) and 256 kB ··L2 cache (8 clocks) per core · Shared L3 cache includes the processor graphics (··LGA 1155) · 64-byte ··cache line size · Two load/store operations per ··CPU cycle for each memory channel · Decoded micro-operation cache and enlarged, optimized ··branch predictor · Improved performance for ··transcendental mathematics, ··AES encryption (··AES instruction set), and ··SHA-1 hashing · 256-bit/cycle ring bus interconnect between cores, graphics, cache and System Agent Domain · ··Advanced Vector Extensions (AVX) 256-bit instruction set with wider vectors, new extensible syntax and rich functionality · ··Intel Quick Sync Video, hardware support for video encoding and decoding · Up to 8 physical cores or 16 logical cores through ··Hyper-threading
Westmere:· Native six-core (··Gulftown) and ten-core (··Westmere-EX) processors.··[8] · A new set of instructions that gives over 3x the encryption and decryption rate of ··Advanced Encryption Standard (AES) processes compared to before.··[9] · Delivers seven new instructions (··AES instruction set or ··AES-NI) that will be used by the AES algorithm. Also an instruction called PCLMULQDQ (see ··CLMUL instruction set) that will perform carry-less multiplication for use in cryptography.··[10] These instructions will allow the processor to perform hardware-accelerated encryption, not only resulting in faster execution but also protecting against software targeted attacks.· Integrated graphics, added into the processor package (dual core ··Arrandale and ··Clarkdale only). · Improved virtualization latency.··[11] · New virtualization capability: "VMX Unrestricted mode support," which allows 16-bit guests to run (real mode and big real mode). · Support for "Huge Pages" of 1 GB in size.
Source: Wikipedia
IBM India © 2012 IBM Corporation29
CP2K Build
BLACS FFTW
BLAS
LAPACK SCALAPACK
CP2K
IBM India © 2012 IBM Corporation30
BLAS InstallModify the make.inc
FORTRAN = gfortranOPTS = -O3 -ffast-math -funroll-loops -ftree-vectorize -march=corei7-avx -mtune=corei7-avxOPTS = -O3DRVOPTS = $(OPTS)NOOPT =LOADER = gfortranLOADOPTS =Make Make install
IBM India © 2012 IBM Corporation31
Modify the Bmake.inc file
BTOPdir = /user/naga/hybrid/Endeavor/BLACS BLACSdir = $(BTOPdir)/LIB BLACSDBGLVL = 0 BLACSFINIT = $(BLACSdir)/blacsF77init_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSCINIT = $(BLACSdir)/blacsCinit_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a BLACSLIB = $(BLACSdir)/blacs_$(COMMLIB)-$(PLAT)-$(BLACSDBGLVL).a MPIdir = /opt/intel/impi/4.0.3.008/intel64 MPILIBdir = $(MPIdir)/lib MPIINCdir = $(MPIdir)/include MPILIB = -L$(MPILIBdir) -lmpich F77 = mpif90 F77NO_OPTFLAGS = F77FLAGS = $(F77NO_OPTFLAGS) -O F77LOADER = $(F77) F77LOADFLAGS = CC = mpicc CCFLAGS = -O4 -ffast-math -funroll-loops \
-ftree-vectorize -march=corei7-avx -mtune=corei7-avx CCFLAGS = -O4 CCLOADER = $(CC) CCLOADFLAGS =
IBM India © 2012 IBM Corporation32
fftw-3.2.2export CC=gccexport CFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \
-ffree-form -march=corei7-avx -mtune=corei7-avx"export CFLAGS="-O3"export MPICC=mpiccexport F77=gfortranexport FFLAGS="-O3 -ffast-math -funroll-loops -ftree-vectorize \
-ffree-form -march=corei7-avx -mtune=corei7-avx"export FFLAGS="-O3"./configure --prefix=/user/naga/4.7.0/cp2k-dlibs \
--enable-mpi 2>&1 | tee config.naga.log
IBM India © 2012 IBM Corporation33
Install scalapack-2.0.1Modify SLmake.inc fileFC = mpif90CC = mpiccNOOPT = -O0FCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxCCFLAGS = -O3 -march=corei7-avx -mtune=corei7-avxFCLOADER = $(FC)CCLOADER = $(CC)FCLOADFLAGS = $(FCFLAGS)CCLOADFLAGS = $(CCFLAGS)BLASLIB = /user/naga/hybrid/Endeavor/BLAS/blas_LINUX.aLAPACKLIB = /user/naga/hybrid/Endeavor/lapack-3.4.0/liblapack.aLIBS = $(LAPACKLIB) $(BLASLIB)