Download ppt - CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques

CPE 631 Project Presentation

Hussein Alzoubi and Rami Alnamneh

Reconfiguration of architectural parameters to maximize performance and using softwaretechniques to reduce cache miss rate

Topics to Be Covered

Part I, Using PAPI: Finding the best blocking factor to reduce

cache miss rate Getting a complete picture of system hardware

Part II: Using SimpleScalar to find the best size of branch predictor

Part III: Getting the best TLB using the SimpleScalar, also

What is PAPI?

Performance Application Programming Interface Developed at the University of Tennessee’s

Innovative Computing Laboratory Access the hardware performance counters found

on most modern microprocessors Easy to use, well documented, and freely available

Events

Occurrences of specific signals related to a processor’s function

Hardware performance counters exist as a small set of registers that count events while the program executes on the processor such as : Cache misses Floating point operations

C calling interface

Function calls are defined in the header file “papi.h”

Consists of the following form :

return type PAPI_function_name (arg1,arg2,…) Return value can be a pointer to structures or a

value

PAPI timers

can be used to obtain both real and virtual time The real time clock runs all the time (e.g. a wall

clock) and the virtual time clock runs only when the processor is running in user mode

Real time can be acquired in clock cycles and microseconds by calling the following low-level functions, respectively:

PAPI_get_real_cyc()

PAPI_get_real_usec()

System information

Executable informationPAPI_get_executable_info()Information about the executable’s address space:

The beginning of the user program The end of the user program

Hardware information

PAPI_get_hardware_info() Information about the system hardware:

Cycle time of processor Number of processors in the system

Finding the best blocking factor on Bragg and get system information

Use PAPI to find the best block size (using the matrix multiplication)

Measure the number of clock cycles for each block size

Choose the best block size according to the minimum number of clock cycles

Provides system hardware information such as: processor clock rate, number of processors in the system

Results on Bragg system

Available hardware information.-------------------------------------------------------------Vendor string and code : SUN unknown (-1)Model string and code : UltraSPARC I&II (1000)CPU revision : 9.000000CPU Megahertz : 248.000000CPU's in an SMP node : 8Nodes in the system : 1Total CPU's in the system: 8-------------------------------------------------------------Best block size: 8bfactor: 8clock cycles 201801712bfactor: 16clock cycles 208085422bfactor: 32clock cycles 217125792bfactor: 64clock cycles 215792624

Part II: branch predictor

modify the Simple Scalar parameters of: L1-I cache, L1-D cache, branch predictor, and branch target buffer

Get 16 different configurations Using four integer and four floating point

SPEC2000 benchmarks with these configuration Calculate the CPI for each benchmark and every

configuration and plot the results

CPI for integer benchmarks

CPI for the Integer Benchmarks

00.20.40.60.8

11.21.4

1 3 5 7 9

11 13

15

Configuration

CP

I

176.gcc

181.mcf

254.gap

256.bzip2

CPI for floating point benchmarks

CPI for the floating point benchmarks

0

1

2

3

41 3 5 7 9 11 13 15

Configuration

CP

I 171.swim

189.lucas

183.equake

191.fma3d

Average CPI for the integer and floating point benchmarks

Average CPI for integer and floating point benchmarks

0

0.51

1.5

1 3 5 7 9

11 13 15

Configuration

CP

I integer

floating point

Config. # 14

Config. # 14: Branch predictor: 16 KB, branch target buffer: 4KB, L1 instruction cache: 32KB, and L1 data cache: 8KB

Part III: TLB

Used instruction TLB varying from 512 to 1024 entries and data TLB varying from 512 to 1024 entries. L1I and L1D cache sizes were also varied

Get 16 different configurations Run one integer and one floating point SPEC2000

benchmarks for each of these configurations Find the number of clock cycles for each

benchmark and every configuration and plot the results

Number of clock cycles for the integer benchmark

Number of Clock Cycles for Integer Benchmark

2.9

2.92

2.94

2.96

2.98

3

3.02

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Configuration #

nu

mb

er o

f cy

cles

* 1

E-

9

Number of clock cycles for the floating point benchmark

Number of Clock Cycles for Floating Benchmark

4

4.1

4.2

4.3

4.4

configuration #

Nu

mb

er o

f cl

ock

cy

cles

*1e-

8

173.applu

Average number of clock cycles of the integer and floating point benchmarks

Average Number of Clock Cycles of Integer and Floating Benchmarks

0

1

2

3

4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Configuration #

Nu

mb

er

of

Clo

ck

Cycle

s *

1E

-8

Average

16 KB L1 instruction cache, 16 KB L1 data cache, 1024 instruction TLB, and 512 data TLB

Questions?

Thank you…