Intro Chp1

8/18/2019 Intro Chp1

1/109

B.Tech – IIIrd

1


2/109

CO322: PARALLEL PROCESSING AND ARCHITECTURE

(EIS – II)

L – 3, T – 0, P- 0, C- 3 1 lecture/week [ from my side]

30 Marks Midsem

50 Marks Endsem

10 Marks Class tests/Quizz/Assignments

10 Marks Attandence

100 Marks Total

2


3/109

3


4/109

Parallel computer model - 4

The state of Computing

Multiprocessors and Multicomputers

Multivector and SIMD Computers

Program and network Properties - 4

Conditions of parallelism

Program Partitioning and scheduling

Program Flow Mechanism

System Interconnect Architecture

4


5/109

Principles of scalable performance - 4

Performance Metrics and Measures

Parallel Processing Applications

Speedup Performance Laws Scalability Analysis and Approaches

Processors and Memory Hierarchy - 4

Advanced Processor Technology

Superscalar and vector Processors Memory Hierarchy Technology

Virtual Memory Technology

5


6/109

Multiprocessors and Multicomputers

Multiprocessor system Interconnects

Cache Coherence and synchronization

Message Passing Mechanism


Vector Processing Principles,

Multivector Multiprocessors

Compound Vector Processing

SIMD Computer Organization,

The Connection Machine CM-5.

6


7/109

Scalable Multithreaded and dataflow Architecture

Latency Hiding Techniques,

Principles Of Multithreading,

Fine Grain MultiComputers,

Scalable and Multithreaded Architecture,

Dataflow and Hybrid Architectures.

Multicore Programming

Single Core Processor Fundamentals,

Introduction to Multi Core Architecture, System Overview of Threading,

Fundamental Concepts of Parallel Programming,

Threading and Parallel Programming

7


8/109

1) Kai Hwang, F. Briggs, “Computer Architecture and Parallel

Processing”, McGraw Hill International Edition, Reprint 2006.

2) M. Flynn, “Computer Architecture: Pipelined and Parallel

Processor Design”, 1/E, Jones and Bartlett, 1995

3) Harry F. Jordan, “ Fundamentals of Parallel Processing”, 1/E,

2002

4) Hesham ElRewini and Mostsfa AbdElBarr, “Advanced

Computer Architecture and Parallel Processing, WileyInterscience”, 2005.

5) Shameem Akhter & Jason Roberts, “MultiCore

Programming”, Intel Press, 2006.

8


9/109

9


10/109

The State of Computing

Multiprocessors and Multicomputer


10


11/109

Parallel processing

It is a form of processing in which many calculations are

carried out simultaneously

Increasing demand for higher performance, lower costs,

and nonstop productivity in real life applications.

Concurrent events are taking place in today's highperformance computers

▪ due to the common practice of multiprogramming, multiprocessing,

or multicomputing.

11


12/109

Processor = Programmable computing element that

runs stored programs written using pre-defined

instruction set

A parallel computer (or multiple processor system)

is a collection of communicating processing

elements (processors) that cooperate to solve large

computational problems fast by dividing such

problems into parallel tasks.

12


13/109

Parallelism appears in various forms

▪ lookahead, pipelining, vectorization, concurrency, data

parallelism, partitioning, interleaving, overlapping,

replication, timesharing, spacesharing, multitasking,

multiprogramming, multithreading, and distributed

computing at different processing levels.

Model physical architectures of

parallel computers, vector supercomputers,

multiprocessors, multicomputers and massively parallel

processors

13


14/109

Modern computers are equipped with powerful

hardware facilities driven by extensive software

packages historical milestones in the development of

computers

crucial hardware and software elements

Identified and analyzing the performance of

computers

14


15/109

Computers have gone through two major stages of

development:

Mechanical and Electronic Zuse's and Aiken's machines

Designed for general-purpose computations

Computing and communication were carried out with

moving mechanical parts

Limited the computing speed and reliability of mechanical

computers

15


16/109

Modern computers

Electronic components

Moving parts in mechanical computers replaced by

high mobility electrons in electronic computers

Information transmission by mechanical gears or

levers replaced by electric signals

16


17/109

17


18/109

Modern computer is an integrated system consisting

of

machine hardware, an instruction set, system software,application programs, and user interfaces

The use of a computer is driven by real life problems

demanding

Numerical computing, transaction processing, and logical

reasoning

18


19/109

19


20/109

Computing Problems

Numerical Computing: Science and engineering

numerical problems demand intensive integer andfloating point computations

Logical Reasoning: Artificial intelligence (AI)

demand logic inferences and symbolic

manipulations and large space searches

20


21/109

Algorithms and Data Structures

Special algorithms and data structures are needed to

specify the computations and communications involved in

computing problems Most numerical algorithms are deterministic using regular

data structures

Symbolic processing may use heuristics or non-

deterministic searches

Parallel algorithm development requires interdisciplinary

interaction

21


22/109

Hardware Resources

Processors, memory and peripheral devices

Special hardware interfaces built to I/O devices

Software interface programs

Processor connectivity (system interconnects,

network), memory organization, influence the

system architecture

22


23/109

Operating System

An effective operating system manages the allocation and

deallocation of resources during the execution of user

programs

Mapping to match algorithmic structures with hardware

architecture and vice versa: processor scheduling, memory

mapping, interprocessor communication

Parallelism utilization possible at:

1- algorithm design,

2- program writing,

3- compilation, and

4- run time

23


24/109

System Software Support

Needed for the development of efficient programs in high

level languages.

HLL to object code - compiler

object code to machine code – assembler

Loader is used to initiate the program execution

24


25/109

Compiler Support – 3 approaches

Preprocessor

▪ Uses a sequential compiler and a low-level library of the target

computer to implement high-level parallel constructs

Precompiler

▪ Program flow analysis, dependence checking limited optimizations

toward parallelism detection

Parallelizing compiler▪ Fully developed parallelizing compiler which can automatically

detect parallelism in source code and transform sequential codes

into parallel constructs

25


26/109

Computing Resources and Computation Allocation

The number of processing elements (PEs),computing power of each

element and amount/organization of physical memory used.

What portions of the computation and data are allocated or mapped

to each PE

Data access, Communication and Synchronization

How the processing elements cooperate and communicate.

How data is shared/transmitted between processors.

Abstractions and primitives for cooperation/communication and

synchronization.

The characteristics and performance of parallel system network

(System interconnects).

26


27/109

Parallel Processing Performance and Scalability Goals

Maximize performance enhancement of parallelism:

Maximize Speedup.

▪ By minimizing parallelization overheads and balancing workload on

processors

Scalability of performance to larger systems

27


28/109

Application demands:

More computing cycles/memory needed

Scientific/Engineering computing: CFD, Biology, Chemistry,Physics, ...

General-purpose computing: Video, Graphics, CAD,

Databases, Transaction Processing, Gaming

Mainstream multithreaded programs, are similar to parallel

programs

28


29/109

Challenging Applications in Applied Science/Engineering

Astrophysics

Atmospheric and Ocean Modeling

Bioinformatics

Biomolecular simulation: Protein folding

Computational Chemistry

Computational Physics

Computer vision and image understanding

Data Mining and Data-intensive Computing

Engineering analysis (CAD/CAM)

Global climate modeling and forecasting

Military applications

Quantum chemistry

VLSI design29

Such applications have very

high computational and

memory Requirements that

cannot be met with single-processor architectures.

Many applications contain a

large degree of

computational parallelism


30/109

30


31/109

31


32/109

Study of arch. involved Hardware organization and

programming/ software requirements

Assembly language programmer point of view▪ Instruction set, which includes opcode (operation codes),

addressing modes, registers, virtual memory

Hardware implementation point of view

▪ CPUs, caches, buses, microcode, pipelines, physicalmemory

Architecture cover ISA + M/c implementation

32


33/109

Figure

33


34/109

The von Neumann architecture built as a sequential

machine executing scalar data

Sequential computer – improved from

bit-serial to word-parallel operations

fixed-point to floating-point operations

The von Neumann architecture is slow due to

sequential execution of instructions in programs

34


35/109

Lookahead

Techniques introduced to prefetch instructions in order to

overlap I/E (instruction fetch/decode and execution)

operations and to enable functional parallelism

Functional parallelism

To use multiple functional units simultaneously

To practice pipelining at various processing levels

35


36/109

Pipeline Includes

pipelined instruction execution

Pipelined arithmetic computations

memory-access operations

Pipelining especially used to performing identical operations

repeatedly over vector data strings

Vector operations were originally carried out implicitly by

software-controlled looping using scalar pipeline processors

36


37/109


38/109

SISD

Conventional sequential machines

38


39/109

They are also called scalar processor i.e., one

instruction at a time and each instruction have only

one set of operands.

Single instruction: only one instruction stream is

being acted on by the CPU during any one clock

cycle

Single data: only one data stream is being used as

input during any one clock cycle

39


40/109

SIMD

Vector computers are equipped with scalar and vector

hardware

40


41/109

Single instruction: All processing units execute the same

instruction issued by the control unit at any given clock cycle

as shown in figure where there are multiple processor

executing instruction given by one control unit.

Multiple data: Each processing unit can operate on a different

data element as shown if figure the processor are providing

multiple data to processing unit

41


42/109

MIMD

42


43/109

Multiple Instruction: every processor may be

executing a different instruction stream

Multiple Data: every processor may be working witha different data stream as shown in figure multiple

data stream is provided by shared memory.

Can be categorized as loosely coupled or tightly

coupled depending on sharing of data and control

Execution can be synchronous or asynchronous,

deterministic or non-deterministic

43


44/109

MISD

The same data stream flows through a linear array of

processors executing different instruction streams

44


45/109

A single data stream is fed into multiple processing units.

Each processing unit operates on the data independently via

independent instruction streams as shown in figure

A single data stream is forwarded to different processing unit

which are connected to different control unit and execute

instruction given to it by processing unit to which it is

attached.

45


46/109

46

Single Instruction stream over a Single Data stream (SISD):

Conventional sequential machines or uniprocessors.

Single Instruction stream over Multiple Data streams (SIMD):

Vector computers, array of synchronized processing elements.

Multiple Instruction streams and a Single Data stream (MISD):

Systolic arrays for pipelined execution.

Multiple Instruction streams over Multiple

Data streams (MIMD): Parallel computers:

Distributed memory multiprocessor system shown

CU = Control Unit

PE = Processing Element

M = Memory

Shown here:

array of synchronized

processing elements

Parallel computers

or multiprocessor systems

Uniprocessor

(Taxonomy)


47/109

Parallel computers

execute programs in MIMD mode

Two major classes of parallel computers shared-memory multiprocessors

message-passing multicomputers

Different in Memory sharing and the mechanismsused for interprocessor communication

47


48/109

Multiprocessor system

communicate with each other through shared variables in

a common memory

Multicomputer system

Each computer node has a local memory, unshared with

other nodes

Interprocessor communication is done through messagepassing among the nodes

48


49/109

Vector processors (Implicit)

Vector instructions

Equipped with multiple vector pipelined

Concurrently used under hardware or firmware control

Two families of pipelined (explicit) vector processors:

Memory-to-memory architecture

▪ Pipelined flow of vector operands directly from the memory topipelines and then back to the memory

Register-to-register architecture

▪ Uses vector registers to interface between the memory and

functional pipelines

49


50/109

50


51/109

Hardware configurations differ from machine to machine

(even with the same Flynn classification)

Address spaces of processors

vary among different architectures, and

depend on memory organization, and

should match target application domain.

The communication model and language environments

should ideally be machine-independent to allow porting to many computers with minimum conversion costs.

Application developers prefer architectural transparency

51


52/109

Programmability depends on the programming environment

provided to the users

Conventional computers are used in a sequential

programming environment with tools developed for auniprocessor computer

Parallel computers need

parallel tools that allow specification or easy detection of parallelism

operating systems that can perform parallel scheduling of concurrent

events, shared memory allocation, and shared peripheral and

communication links.

52


53/109

Use a conventional language (like C, Fortran, Lisp, or Pascal)

to write the program

Use a parallelizing compiler to translate the source code into

parallel code

The compiler must detect parallelism and assign target

machine resources

Success relies heavily on the quality of the compiler.

53


54/109

Programmer write explicit parallel code using

parallel dialects of common languages

Compiler has reduced need to detect parallelism,

but must still preserve existing parallelism and assign

target machine resources

54


55/109

55

Source code written inSource code written in

concurrent dialects of C, C++concurrent dialects of C, C++

FORTRAN, LISPFORTRAN, LISP ..

Programmer Programmer

ConcurrencyConcurrency

preserving compiler preserving compiler

ConcurrentConcurrent

object codeobject code

Execution byExecution by

runtime systemruntime system

Source code written inSource code written in

sequential languages C, C++sequential languages C, C++

FORTRAN, LISPFORTRAN, LISP ..

Programmer Programmer

ParallelizingParallelizing

compiler compiler

ParallelParallel

object codeobject code

Execution byExecution by

runtime systemruntime system

(a(a)) Implicit ParallelismImplicit Parallelism (b) Explicit Parallelism(b) Explicit Parallelism


56/109

Parallel extensions of conventional high-level

languages

Integrated environments to provide

different levels of program abstraction validation,

testing and debugging performance prediction and

monitoring visualization support to aid program

development, performance measurement graphics display

and animation of computational results

56


57/109

Shared-Memory Multiprocessors

Distributed-Memory Multicomputers

57


58/109

Shared memory parallel computers generally have the ability

for all processors to access all memory as global address

space.

Multiple processors can operate independently but share thesame memory resources.

Changes in a memory location effected by one processor are

visible to all other processors.

Shared memory machines can be divided into two mainclasses based upon memory access times: UMA, NUMA and

COMA.

58


59/109

Three shared-memory multiprocessor models

▪ The Uniform Memory Access (UMA) model,

▪ The Non Uniform Memory Access (NUMA) model,

▪ The Cache Only Memory Architecture (COMA) model

Models differ in how the memory and peripheralresources are shared or distributed.

59


60/109

The UMA Model

The physical memory is uniformly shared by all the processors

Have equal access time to all memory words

Each processor use a private cache.

Peripherals are also shared

Called tightly coupled systems due to high degree of resource sharing

The system interconnect in the form of Common bus, crossbar switch,

or a multistage network

Symmetric multiprocessor – all processors have equally capable of

running the executive programs

Asymmetric multiprocessor – one or subset of processors have

executive capability

60


61/109


62/109

The NUMA Model A NUMA multiprocessor

Shared-memory system in which the access time varies

with the location of the memory word. Shared memory is physically distributed to all processors

called local memories

Collection of all local memories forms a global address

space accessible by all processors

Delay through the interconnection network

62


63/109

NUMA

63


64/109

Globally shared memory

Three memory-access patterns

▪ The fastest is local memory access▪ The next is global memory access

▪ The slowest is access of remote memory

64


65/109

Hierarchically structured multiprocessor

processors are divided in to several clusters

Cluster is it self an UMA or a NUMA

Clusters are connected to global shared memory modules

Entire system is considered a NUMA

All processors belonging to the same cluster are allowed to

uniformly access the cluster shared-memory modules

All clusters have equal access to the global memory

Access time for cluster memory is shorter than that to the

global memory

65


66/109

66


67/109

The COMA Model

Is a special case of a NUMA machine

The distributed main memories are converted to caches

No memory hierarchy at each processor node All the caches form a global address space

Remote cache access assisted by distributed cache

directories D.

Initial data placement is not critical

67


68/109

The COMA Model

68


69/109

Other variants for shared memory multiprocessors

CC-NUMA

Cache coherent non uniform memory access

Model can be specified with distributed shared memory

and cache directory

CC-COMA

Cache coherent coma

69


70/109

System consists of

multiple computers called nodes

interconnected by a message passing network

provides point-to-point static connections among the nodes

Each node is autonomous computer consisting of a processor, local memory,

attached disks or I/O peripherals

All local memories are private and are accessible only by local

processors

NORMA - no-remote-memory-access Inter node communication is carried out by passing messages through

the static connection network

70


71/109

71


72/109

Multicomputers use hardware routers to pass messages

Computer node is attached to each router

Boundary router may be connected to I/O and peripheral

devices Message passing between any two nodes involves a sequence

of routers and channels

Mixed types of nodes are allowed in a heterogeneous

multicomputer

Internode communications achieved through compatible datarepresentations and message-passing protocols

72


73/109

First Based on processor board technology using hypercube

architecture and software-controlled message switching

The second generation was implemented with mesh-connected architecture, hardware message routing software

environment for medium-grain distributed computing

73


74/109

Important issues for multicomputers

Famous topologies include the ring, tree, mesh, torus,

hypercube, cube connected cycle etc.

Various communication patterns one-to-one, broadcasting,

permutations and multicast patterns

message-routing schemes, network flow control strategies,

deadlock avoidance, virtual channels, message-passing

primitives and program decomposition techniques.

74


75/109

Introduce supercomputers and parallel processors for vector

processing and data parallelism

Classify supercomputers as pipelined vector machines using afew powerful processors equipped with vector hardware

SIMD computers emphasizing massive data parallelism

75


76/109

A vector computer is often built on top of a scalar processor

The vector processor attached to the scalar processor as an

optional feature

Program and data loaded in to main memory through hostcomputer

All instructions are first decoded by the scalar control unit

76


77/109

77


78/109

If the decoded instruction is scalar operation or a program

control operation,

will directly executed by the scalar process or using the scalar

functional pipelines

If the decoded instruction is vector operation

will be sent to the vector control unit (CU)

CU supervise the flow of vector data between the main memory and

vector functional pipelines

Number of vector functional pipelines may be built in to a vectorprocessor

78


79/109

register-to-register architecture

Vector registers are

▪ used to hold the vector operands, intermediate and final vector

results

▪ Programmable in user instructions▪ Each equipped with a component counter

▪ Keeps track of the component registers used in successive pipeline

cycles.

▪ Length of each vector register is usually fixed

▪ 64-bit component registers in a vector register in a CraySeriessupercomputer

▪ The vector functional pipelines retrieve operands from and put

results into the vector registers

79


80/109


81/109

memory-to-memory architecture

Differs from the use of a vector stream unit to

replace the vector registers

Vector operands and results are directly retrievedfrom the main memory in superwords

512 bits as in the Cyber 205

81


82/109

An operational model of an SIMD computer is specified by a

5-tuple: M = { N, C, I, M, R }

N - is the number of processing elements(PEs) in the machine.

C - is the set of instructions directly executed by the control unit(CU)including scalar and program flow control instructions

I - is the set of instructions broadcast by the CU to all PEs for parallel

execution

M - is the set of masking schemes, where each mask partitions the set

of PEs in to enabled and disabled subsets.

R - is the set of data-routing functions, specifying various patterns to

be setup in the interconnection network for inter-PE communications.

82


83/109

83


84/109

Performance Measures

The ideal performance of a computer system

demands a perfect match between machine

capability and program behavior.

Machine capability can be enhanced with better:

▪ Hardware technology,

▪ Innovative architectural features, and

▪ Efficient resource management.

84


85/109

Program behavior, is difficult to predict due to its heavy

dependence on application and run-time conditions.

There are many factors affecting program behavior, including:

Algorithm design, Data structures,

Language efficiency,

Programmer skill, and

Compiler technology.

85


86/109

Introduce some fundamental factors for

projecting the performance of computer

They can be used to guide system architect indesigning batter machine

To educate the programmer or compiler

writers in optimizing the codes for more

efficient execution

86


87/109

The simplest measure of program performance is the

turnaround time, which includes disk and memory accesses,

input and output activities, compilation time, OS overhead,

and CPU time. In order to shorten the turnaround time, one must reduce all

these time factors.

In a multiprogrammed computer, the I/O and system

overheads of a given program may overlap with the CPU

times required in other programs. It is fair to compare just the total CPU time needed for

program execution.

87


88/109


Response Time (Execution time, Latency): Thetime elapse between the start and the completionof an event.

Throughput (Bandwidth): The amount of workdone in a given time.

Performance: Number of events occurring perunit of time.

▪ Note execution time is the reciprocal of performance —lower execution time implies higher performance.

88


89/109


A system (X) is faster than (Y), if for a given task,

the response time on X is lower than on Y .

89

Execution timeY

Execution timeX

PerformanceY

PerformanceY

PerformanceX

Performance

X

=n = =

1

1


90/109

Consequently, the statement that X is n%

faster than Y means:

90

Execution timeY

Execution timeX

PerformanceYPerformance

Y

PerformanceY

PerformanceX

PerformanceX

PerformanceX

PerformanceY

= 1+100

n= =

=

1

1

and

n 100 ( )

hence,


91/109

Example: Machine A runs a program in 10seconds and machine B runs the same

program in 15 seconds. Therefore:

91

Execution t imeB

Execution t imeA

= 1 +100

n and

=n 100 ( ) hence,Execution t ime

BExecution t ime

A

Execution t ime

A

=n 100 ( )15

10

10= 50


92/109

Clock Rate

The processor is driven by a clock with a constant cycle

time (t).

The inverse of the cycle time is the clock rate (f =1/t).

CPI - cycles per instruction

Size of a program is instruction count (Ic) - number of

machine instructions to be executed.

Different instructions acquire different number of clock

cycles to execute

92


93/109

CPI is an important parameter for measuring the time needed

to execute each instruction

For a given instruction set, we can calculate an average CPI

over all instruction types, provided we know theirfrequencies of appearance in the program.

An accurate estimate of the average CPI requires a large

amount of program code to be traced over a long period of

time.

CPI will be taken as an average value for a given instructionset and a given program mix.

93


94/109

Let us define the average number of clock cycle per

instruction (CPI) as:

94

CPI =CPU clock cycles for a program

Instruction count Ic=

( )CPI i Iii =1

n

Where I i is the number of time instruction i is executed andCPI i is the average number of clock cycles for instruction i.

= (CPI i

n

Ic

Ii )i =1


95/109

CPI and clock rate depends on the technology and

architecture of the machine.

Instruction count depends on the instruction set of the

machine and compiler technology.

95


96/109

The CPU time (T) or Execution Time : is the time needed to

execute a given program excluding the waiting time for I/O or

other running programs.

CPU time is further divided into the user CPU time and thesystem CPU time.

The CPU time is estimated as:

96

( )CPIi Iii =1

n

T = Ic * CPI * =


97/109

The execution of an instruction requires going through cycle

of events involves the instruction fetch, decode, operand(s)

fetch, execution, and store result(s):

p is the number of processor cycles needed to decode and

execute the instruction

m is the number of the memory references needed

k is the ratio between memory cycle time and processor cycletime. ( latency factor -- how much the memory is slow w. r. toCPU)

97

T = Ic * CPI * = I

c * (p+m*k)*


98/109

Now let C be Total number of cycles required to

execute a program.

And the time to execute a program will be

98

C = Ic

CPI

T = C

= C/ f

T = Ic CPI = Ic CPI / f


99/109

MIPS - Million Instructions Per Second

Measure the processor speed

MIPS = Ic / T 106

MIPS = f / CPI 106

MIPS = f Ic / C 106

99

Execution time(T) * 10

MIPS = C I

6


100/109

MFLOPS - Million Floating Point Operations Per

Second

is another performance measure to be used to

evaluate computers.

100

MFLOPS =

Number of floating point operations in a program

Execution time * 106


101/109

Throughput Rate

Number of programs executed per unit time.

Ws = CPU throughput

Wp = System (program/second)

Wp = 1 / T

Wp = MIPS 106 /Ic

Based on the MIPS rates and the average program length Ic

Ws


102/109

Throughput (W/Tn) - the execution rate on an n

processor system, measured in FLOPs/unit-time or

instructions/unit-time

Speedup (Sn = T1/Tn) - how much faster an actual

machine with n processors compared to 1.

Efficiency (En = Sn/n) - fraction of the maximumspeedup achieved by n processors

102


103/109

The attributes of a computer system which allow it to linearly

scaled up or down in size, to handle smaller or larger

workloads, or to obtain proportional decreases or increase in

speed on a given application

Good scalability requires the good algorithm and the machine

to have the right properties

Thus in general there are five performance factors (Ic,p,m,k,t)which are influenced by four system attributes

103


104/109

104

System

Attributes

Performance Factors

I c

CPI

p m k

Instruction-set

Architecture

Compiler

Technology

CPU

Implementation

& Technology

Memory

Hierarchy


105/109

A benchmark program is executed on a 40MHz processor. The

benchmark program has the following statistics.

Calculate average CPI, MIPS rate & execution for the above

benchmark program

Solved in class

105

Instruction Type

Arithmetic

Branch

Load/Store

Floating Point

Instruction Count

45000

32000

15000

8000

Clock Cycle Count

1

2

2

2


106/109

(Book Problem 1.4) Solved in class

106


107/109

(Book Problem 1.6) Solved in class

107


108/109

Operation Frequency CPI

ALU ops 35% 1

Loads 25% 2

Stores 15% 2

Branches 25% 3

Compute the Average CPI.

Solved in class

108


109/109

For the purpose of solving a given application problem, you benchmark a

program on two computer systems.

On system A, the object code executed 80 million Arithmetic Logic Unit

operations (ALU ops), 40 million load instructions, and 25 million branch

instructions.

On system B, the object code executed 50 million ALU ops, 50 millionloads, and 40 million branch instructions.

In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock

cycles, and each branch takes 5 clock cycles.

A. Compute the relative frequency of occurrence of each type of instruction

executed in both systemsB. Find the Average CPI for each system.

Documents

Intro Chp1