Upload
parth-shah
View
227
Download
0
Embed Size (px)
Citation preview
8/18/2019 Intro Chp1
1/109
B.Tech – IIIrd
1
8/18/2019 Intro Chp1
2/109
CO322: PARALLEL PROCESSING AND ARCHITECTURE
(EIS – II)
L – 3, T – 0, P- 0, C- 3 1 lecture/week [ from my side]
30 Marks Midsem
50 Marks Endsem
10 Marks Class tests/Quizz/Assignments
10 Marks Attandence
100 Marks Total
2
8/18/2019 Intro Chp1
3/109
3
8/18/2019 Intro Chp1
4/109
Parallel computer model - 4
The state of Computing
Multiprocessors and Multicomputers
Multivector and SIMD Computers
Program and network Properties - 4
Conditions of parallelism
Program Partitioning and scheduling
Program Flow Mechanism
System Interconnect Architecture
4
8/18/2019 Intro Chp1
5/109
Principles of scalable performance - 4
Performance Metrics and Measures
Parallel Processing Applications
Speedup Performance Laws Scalability Analysis and Approaches
Processors and Memory Hierarchy - 4
Advanced Processor Technology
Superscalar and vector Processors Memory Hierarchy Technology
Virtual Memory Technology
5
8/18/2019 Intro Chp1
6/109
Multiprocessors and Multicomputers
Multiprocessor system Interconnects
Cache Coherence and synchronization
Message Passing Mechanism
Multivector and SIMD Computers
Vector Processing Principles,
Multivector Multiprocessors
Compound Vector Processing
SIMD Computer Organization,
The Connection Machine CM-5.
6
8/18/2019 Intro Chp1
7/109
Scalable Multithreaded and dataflow Architecture
Latency Hiding Techniques,
Principles Of Multithreading,
Fine Grain MultiComputers,
Scalable and Multithreaded Architecture,
Dataflow and Hybrid Architectures.
Multicore Programming
Single Core Processor Fundamentals,
Introduction to Multi Core Architecture, System Overview of Threading,
Fundamental Concepts of Parallel Programming,
Threading and Parallel Programming
7
8/18/2019 Intro Chp1
8/109
1) Kai Hwang, F. Briggs, “Computer Architecture and Parallel
Processing”, McGraw Hill International Edition, Reprint 2006.
2) M. Flynn, “Computer Architecture: Pipelined and Parallel
Processor Design”, 1/E, Jones and Bartlett, 1995
3) Harry F. Jordan, “ Fundamentals of Parallel Processing”, 1/E,
2002
4) Hesham ElRewini and Mostsfa AbdElBarr, “Advanced
Computer Architecture and Parallel Processing, WileyInterscience”, 2005.
5) Shameem Akhter & Jason Roberts, “MultiCore
Programming”, Intel Press, 2006.
8
8/18/2019 Intro Chp1
9/109
9
8/18/2019 Intro Chp1
10/109
The State of Computing
Multiprocessors and Multicomputer
Multivector and SIMD Computers
10
8/18/2019 Intro Chp1
11/109
Parallel processing
It is a form of processing in which many calculations are
carried out simultaneously
Increasing demand for higher performance, lower costs,
and nonstop productivity in real life applications.
Concurrent events are taking place in today's highperformance computers
▪ due to the common practice of multiprogramming, multiprocessing,
or multicomputing.
11
8/18/2019 Intro Chp1
12/109
Processor = Programmable computing element that
runs stored programs written using pre-defined
instruction set
A parallel computer (or multiple processor system)
is a collection of communicating processing
elements (processors) that cooperate to solve large
computational problems fast by dividing such
problems into parallel tasks.
12
8/18/2019 Intro Chp1
13/109
Parallelism appears in various forms
▪ lookahead, pipelining, vectorization, concurrency, data
parallelism, partitioning, interleaving, overlapping,
replication, timesharing, spacesharing, multitasking,
multiprogramming, multithreading, and distributed
computing at different processing levels.
Model physical architectures of
parallel computers, vector supercomputers,
multiprocessors, multicomputers and massively parallel
processors
13
8/18/2019 Intro Chp1
14/109
Modern computers are equipped with powerful
hardware facilities driven by extensive software
packages historical milestones in the development of
computers
crucial hardware and software elements
Identified and analyzing the performance of
computers
14
8/18/2019 Intro Chp1
15/109
Computers have gone through two major stages of
development:
Mechanical and Electronic Zuse's and Aiken's machines
Designed for general-purpose computations
Computing and communication were carried out with
moving mechanical parts
Limited the computing speed and reliability of mechanical
computers
15
8/18/2019 Intro Chp1
16/109
Modern computers
Electronic components
Moving parts in mechanical computers replaced by
high mobility electrons in electronic computers
Information transmission by mechanical gears or
levers replaced by electric signals
16
8/18/2019 Intro Chp1
17/109
17
8/18/2019 Intro Chp1
18/109
Modern computer is an integrated system consisting
of
machine hardware, an instruction set, system software,application programs, and user interfaces
The use of a computer is driven by real life problems
demanding
Numerical computing, transaction processing, and logical
reasoning
18
8/18/2019 Intro Chp1
19/109
19
8/18/2019 Intro Chp1
20/109
Computing Problems
Numerical Computing: Science and engineering
numerical problems demand intensive integer andfloating point computations
Logical Reasoning: Artificial intelligence (AI)
demand logic inferences and symbolic
manipulations and large space searches
20
8/18/2019 Intro Chp1
21/109
Algorithms and Data Structures
Special algorithms and data structures are needed to
specify the computations and communications involved in
computing problems Most numerical algorithms are deterministic using regular
data structures
Symbolic processing may use heuristics or non-
deterministic searches
Parallel algorithm development requires interdisciplinary
interaction
21
8/18/2019 Intro Chp1
22/109
Hardware Resources
Processors, memory and peripheral devices
Special hardware interfaces built to I/O devices
Software interface programs
Processor connectivity (system interconnects,
network), memory organization, influence the
system architecture
22
8/18/2019 Intro Chp1
23/109
Operating System
An effective operating system manages the allocation and
deallocation of resources during the execution of user
programs
Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication
Parallelism utilization possible at:
1- algorithm design,
2- program writing,
3- compilation, and
4- run time
23
8/18/2019 Intro Chp1
24/109
System Software Support
Needed for the development of efficient programs in high
level languages.
HLL to object code - compiler
object code to machine code – assembler
Loader is used to initiate the program execution
24
8/18/2019 Intro Chp1
25/109
Compiler Support – 3 approaches
Preprocessor
▪ Uses a sequential compiler and a low-level library of the target
computer to implement high-level parallel constructs
Precompiler
▪ Program flow analysis, dependence checking limited optimizations
toward parallelism detection
Parallelizing compiler▪ Fully developed parallelizing compiler which can automatically
detect parallelism in source code and transform sequential codes
into parallel constructs
25
8/18/2019 Intro Chp1
26/109
Computing Resources and Computation Allocation
The number of processing elements (PEs),computing power of each
element and amount/organization of physical memory used.
What portions of the computation and data are allocated or mapped
to each PE
Data access, Communication and Synchronization
How the processing elements cooperate and communicate.
How data is shared/transmitted between processors.
Abstractions and primitives for cooperation/communication and
synchronization.
The characteristics and performance of parallel system network
(System interconnects).
26
8/18/2019 Intro Chp1
27/109
Parallel Processing Performance and Scalability Goals
Maximize performance enhancement of parallelism:
Maximize Speedup.
▪ By minimizing parallelization overheads and balancing workload on
processors
Scalability of performance to larger systems
27
8/18/2019 Intro Chp1
28/109
Application demands:
More computing cycles/memory needed
Scientific/Engineering computing: CFD, Biology, Chemistry,Physics, ...
General-purpose computing: Video, Graphics, CAD,
Databases, Transaction Processing, Gaming
Mainstream multithreaded programs, are similar to parallel
programs
28
8/18/2019 Intro Chp1
29/109
Challenging Applications in Applied Science/Engineering
Astrophysics
Atmospheric and Ocean Modeling
Bioinformatics
Biomolecular simulation: Protein folding
Computational Chemistry
Computational Physics
Computer vision and image understanding
Data Mining and Data-intensive Computing
Engineering analysis (CAD/CAM)
Global climate modeling and forecasting
Military applications
Quantum chemistry
VLSI design29
Such applications have very
high computational and
memory Requirements that
cannot be met with single-processor architectures.
Many applications contain a
large degree of
computational parallelism
8/18/2019 Intro Chp1
30/109
30
8/18/2019 Intro Chp1
31/109
31
8/18/2019 Intro Chp1
32/109
Study of arch. involved Hardware organization and
programming/ software requirements
Assembly language programmer point of view▪ Instruction set, which includes opcode (operation codes),
addressing modes, registers, virtual memory
Hardware implementation point of view
▪ CPUs, caches, buses, microcode, pipelines, physicalmemory
Architecture cover ISA + M/c implementation
32
8/18/2019 Intro Chp1
33/109
Figure
33
8/18/2019 Intro Chp1
34/109
The von Neumann architecture built as a sequential
machine executing scalar data
Sequential computer – improved from
bit-serial to word-parallel operations
fixed-point to floating-point operations
The von Neumann architecture is slow due to
sequential execution of instructions in programs
34
8/18/2019 Intro Chp1
35/109
Lookahead
Techniques introduced to prefetch instructions in order to
overlap I/E (instruction fetch/decode and execution)
operations and to enable functional parallelism
Functional parallelism
To use multiple functional units simultaneously
To practice pipelining at various processing levels
35
8/18/2019 Intro Chp1
36/109
Pipeline Includes
pipelined instruction execution
Pipelined arithmetic computations
memory-access operations
Pipelining especially used to performing identical operations
repeatedly over vector data strings
Vector operations were originally carried out implicitly by
software-controlled looping using scalar pipeline processors
36
8/18/2019 Intro Chp1
37/109
8/18/2019 Intro Chp1
38/109
SISD
Conventional sequential machines
38
8/18/2019 Intro Chp1
39/109
They are also called scalar processor i.e., one
instruction at a time and each instruction have only
one set of operands.
Single instruction: only one instruction stream is
being acted on by the CPU during any one clock
cycle
Single data: only one data stream is being used as
input during any one clock cycle
39
8/18/2019 Intro Chp1
40/109
SIMD
Vector computers are equipped with scalar and vector
hardware
40
8/18/2019 Intro Chp1
41/109
Single instruction: All processing units execute the same
instruction issued by the control unit at any given clock cycle
as shown in figure where there are multiple processor
executing instruction given by one control unit.
Multiple data: Each processing unit can operate on a different
data element as shown if figure the processor are providing
multiple data to processing unit
41
8/18/2019 Intro Chp1
42/109
MIMD
42
8/18/2019 Intro Chp1
43/109
Multiple Instruction: every processor may be
executing a different instruction stream
Multiple Data: every processor may be working witha different data stream as shown in figure multiple
data stream is provided by shared memory.
Can be categorized as loosely coupled or tightly
coupled depending on sharing of data and control
Execution can be synchronous or asynchronous,
deterministic or non-deterministic
43
8/18/2019 Intro Chp1
44/109
MISD
The same data stream flows through a linear array of
processors executing different instruction streams
44
8/18/2019 Intro Chp1
45/109
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via
independent instruction streams as shown in figure
A single data stream is forwarded to different processing unit
which are connected to different control unit and execute
instruction given to it by processing unit to which it is
attached.
45
8/18/2019 Intro Chp1
46/109
46
Single Instruction stream over a Single Data stream (SISD):
Conventional sequential machines or uniprocessors.
Single Instruction stream over Multiple Data streams (SIMD):
Vector computers, array of synchronized processing elements.
Multiple Instruction streams and a Single Data stream (MISD):
Systolic arrays for pipelined execution.
Multiple Instruction streams over Multiple
Data streams (MIMD): Parallel computers:
Distributed memory multiprocessor system shown
CU = Control Unit
PE = Processing Element
M = Memory
Shown here:
array of synchronized
processing elements
Parallel computers
or multiprocessor systems
Uniprocessor
(Taxonomy)
8/18/2019 Intro Chp1
47/109
Parallel computers
execute programs in MIMD mode
Two major classes of parallel computers shared-memory multiprocessors
message-passing multicomputers
Different in Memory sharing and the mechanismsused for interprocessor communication
47
8/18/2019 Intro Chp1
48/109
Multiprocessor system
communicate with each other through shared variables in
a common memory
Multicomputer system
Each computer node has a local memory, unshared with
other nodes
Interprocessor communication is done through messagepassing among the nodes
48
8/18/2019 Intro Chp1
49/109
Vector processors (Implicit)
Vector instructions
Equipped with multiple vector pipelined
Concurrently used under hardware or firmware control
Two families of pipelined (explicit) vector processors:
Memory-to-memory architecture
▪ Pipelined flow of vector operands directly from the memory topipelines and then back to the memory
Register-to-register architecture
▪ Uses vector registers to interface between the memory and
functional pipelines
49
8/18/2019 Intro Chp1
50/109
50
8/18/2019 Intro Chp1
51/109
Hardware configurations differ from machine to machine
(even with the same Flynn classification)
Address spaces of processors
vary among different architectures, and
depend on memory organization, and
should match target application domain.
The communication model and language environments
should ideally be machine-independent to allow porting to many computers with minimum conversion costs.
Application developers prefer architectural transparency
51
8/18/2019 Intro Chp1
52/109
Programmability depends on the programming environment
provided to the users
Conventional computers are used in a sequential
programming environment with tools developed for auniprocessor computer
Parallel computers need
parallel tools that allow specification or easy detection of parallelism
operating systems that can perform parallel scheduling of concurrent
events, shared memory allocation, and shared peripheral and
communication links.
52
8/18/2019 Intro Chp1
53/109
Use a conventional language (like C, Fortran, Lisp, or Pascal)
to write the program
Use a parallelizing compiler to translate the source code into
parallel code
The compiler must detect parallelism and assign target
machine resources
Success relies heavily on the quality of the compiler.
53
8/18/2019 Intro Chp1
54/109
Programmer write explicit parallel code using
parallel dialects of common languages
Compiler has reduced need to detect parallelism,
but must still preserve existing parallelism and assign
target machine resources
54
8/18/2019 Intro Chp1
55/109
55
Source code written inSource code written in
concurrent dialects of C, C++concurrent dialects of C, C++
FORTRAN, LISPFORTRAN, LISP ..
Programmer Programmer
ConcurrencyConcurrency
preserving compiler preserving compiler
ConcurrentConcurrent
object codeobject code
Execution byExecution by
runtime systemruntime system
Source code written inSource code written in
sequential languages C, C++sequential languages C, C++
FORTRAN, LISPFORTRAN, LISP ..
Programmer Programmer
ParallelizingParallelizing
compiler compiler
ParallelParallel
object codeobject code
Execution byExecution by
runtime systemruntime system
(a(a)) Implicit ParallelismImplicit Parallelism (b) Explicit Parallelism(b) Explicit Parallelism
8/18/2019 Intro Chp1
56/109
Parallel extensions of conventional high-level
languages
Integrated environments to provide
different levels of program abstraction validation,
testing and debugging performance prediction and
monitoring visualization support to aid program
development, performance measurement graphics display
and animation of computational results
56
8/18/2019 Intro Chp1
57/109
Shared-Memory Multiprocessors
Distributed-Memory Multicomputers
57
8/18/2019 Intro Chp1
58/109
Shared memory parallel computers generally have the ability
for all processors to access all memory as global address
space.
Multiple processors can operate independently but share thesame memory resources.
Changes in a memory location effected by one processor are
visible to all other processors.
Shared memory machines can be divided into two mainclasses based upon memory access times: UMA, NUMA and
COMA.
58
8/18/2019 Intro Chp1
59/109
Three shared-memory multiprocessor models
▪ The Uniform Memory Access (UMA) model,
▪ The Non Uniform Memory Access (NUMA) model,
▪ The Cache Only Memory Architecture (COMA) model
Models differ in how the memory and peripheralresources are shared or distributed.
59
8/18/2019 Intro Chp1
60/109
The UMA Model
The physical memory is uniformly shared by all the processors
Have equal access time to all memory words
Each processor use a private cache.
Peripherals are also shared
Called tightly coupled systems due to high degree of resource sharing
The system interconnect in the form of Common bus, crossbar switch,
or a multistage network
Symmetric multiprocessor – all processors have equally capable of
running the executive programs
Asymmetric multiprocessor – one or subset of processors have
executive capability
60
8/18/2019 Intro Chp1
61/109
8/18/2019 Intro Chp1
62/109
The NUMA Model A NUMA multiprocessor
Shared-memory system in which the access time varies
with the location of the memory word. Shared memory is physically distributed to all processors
called local memories
Collection of all local memories forms a global address
space accessible by all processors
Delay through the interconnection network
62
8/18/2019 Intro Chp1
63/109
NUMA
63
8/18/2019 Intro Chp1
64/109
Globally shared memory
Three memory-access patterns
▪ The fastest is local memory access▪ The next is global memory access
▪ The slowest is access of remote memory
64
8/18/2019 Intro Chp1
65/109
Hierarchically structured multiprocessor
processors are divided in to several clusters
Cluster is it self an UMA or a NUMA
Clusters are connected to global shared memory modules
Entire system is considered a NUMA
All processors belonging to the same cluster are allowed to
uniformly access the cluster shared-memory modules
All clusters have equal access to the global memory
Access time for cluster memory is shorter than that to the
global memory
65
8/18/2019 Intro Chp1
66/109
66
8/18/2019 Intro Chp1
67/109
The COMA Model
Is a special case of a NUMA machine
The distributed main memories are converted to caches
No memory hierarchy at each processor node All the caches form a global address space
Remote cache access assisted by distributed cache
directories D.
Initial data placement is not critical
67
8/18/2019 Intro Chp1
68/109
The COMA Model
68
8/18/2019 Intro Chp1
69/109
Other variants for shared memory multiprocessors
CC-NUMA
Cache coherent non uniform memory access
Model can be specified with distributed shared memory
and cache directory
CC-COMA
Cache coherent coma
69
8/18/2019 Intro Chp1
70/109
System consists of
multiple computers called nodes
interconnected by a message passing network
provides point-to-point static connections among the nodes
Each node is autonomous computer consisting of a processor, local memory,
attached disks or I/O peripherals
All local memories are private and are accessible only by local
processors
NORMA - no-remote-memory-access Inter node communication is carried out by passing messages through
the static connection network
70
8/18/2019 Intro Chp1
71/109
71
8/18/2019 Intro Chp1
72/109
Multicomputers use hardware routers to pass messages
Computer node is attached to each router
Boundary router may be connected to I/O and peripheral
devices Message passing between any two nodes involves a sequence
of routers and channels
Mixed types of nodes are allowed in a heterogeneous
multicomputer
Internode communications achieved through compatible datarepresentations and message-passing protocols
72
8/18/2019 Intro Chp1
73/109
First Based on processor board technology using hypercube
architecture and software-controlled message switching
The second generation was implemented with mesh-connected architecture, hardware message routing software
environment for medium-grain distributed computing
73
8/18/2019 Intro Chp1
74/109
Important issues for multicomputers
Famous topologies include the ring, tree, mesh, torus,
hypercube, cube connected cycle etc.
Various communication patterns one-to-one, broadcasting,
permutations and multicast patterns
message-routing schemes, network flow control strategies,
deadlock avoidance, virtual channels, message-passing
primitives and program decomposition techniques.
74
8/18/2019 Intro Chp1
75/109
Introduce supercomputers and parallel processors for vector
processing and data parallelism
Classify supercomputers as pipelined vector machines using afew powerful processors equipped with vector hardware
SIMD computers emphasizing massive data parallelism
75
8/18/2019 Intro Chp1
76/109
A vector computer is often built on top of a scalar processor
The vector processor attached to the scalar processor as an
optional feature
Program and data loaded in to main memory through hostcomputer
All instructions are first decoded by the scalar control unit
76
8/18/2019 Intro Chp1
77/109
77
8/18/2019 Intro Chp1
78/109
If the decoded instruction is scalar operation or a program
control operation,
will directly executed by the scalar process or using the scalar
functional pipelines
If the decoded instruction is vector operation
will be sent to the vector control unit (CU)
CU supervise the flow of vector data between the main memory and
vector functional pipelines
Number of vector functional pipelines may be built in to a vectorprocessor
78
8/18/2019 Intro Chp1
79/109
register-to-register architecture
Vector registers are
▪ used to hold the vector operands, intermediate and final vector
results
▪ Programmable in user instructions▪ Each equipped with a component counter
▪ Keeps track of the component registers used in successive pipeline
cycles.
▪ Length of each vector register is usually fixed
▪ 64-bit component registers in a vector register in a CraySeriessupercomputer
▪ The vector functional pipelines retrieve operands from and put
results into the vector registers
79
8/18/2019 Intro Chp1
80/109
8/18/2019 Intro Chp1
81/109
memory-to-memory architecture
Differs from the use of a vector stream unit to
replace the vector registers
Vector operands and results are directly retrievedfrom the main memory in superwords
512 bits as in the Cyber 205
81
8/18/2019 Intro Chp1
82/109
An operational model of an SIMD computer is specified by a
5-tuple: M = { N, C, I, M, R }
N - is the number of processing elements(PEs) in the machine.
C - is the set of instructions directly executed by the control unit(CU)including scalar and program flow control instructions
I - is the set of instructions broadcast by the CU to all PEs for parallel
execution
M - is the set of masking schemes, where each mask partitions the set
of PEs in to enabled and disabled subsets.
R - is the set of data-routing functions, specifying various patterns to
be setup in the interconnection network for inter-PE communications.
82
8/18/2019 Intro Chp1
83/109
83
8/18/2019 Intro Chp1
84/109
Performance Measures
The ideal performance of a computer system
demands a perfect match between machine
capability and program behavior.
Machine capability can be enhanced with better:
▪ Hardware technology,
▪ Innovative architectural features, and
▪ Efficient resource management.
84
8/18/2019 Intro Chp1
85/109
Program behavior, is difficult to predict due to its heavy
dependence on application and run-time conditions.
There are many factors affecting program behavior, including:
Algorithm design, Data structures,
Language efficiency,
Programmer skill, and
Compiler technology.
85
8/18/2019 Intro Chp1
86/109
Introduce some fundamental factors for
projecting the performance of computer
They can be used to guide system architect indesigning batter machine
To educate the programmer or compiler
writers in optimizing the codes for more
efficient execution
86
8/18/2019 Intro Chp1
87/109
The simplest measure of program performance is the
turnaround time, which includes disk and memory accesses,
input and output activities, compilation time, OS overhead,
and CPU time. In order to shorten the turnaround time, one must reduce all
these time factors.
In a multiprogrammed computer, the I/O and system
overheads of a given program may overlap with the CPU
times required in other programs. It is fair to compare just the total CPU time needed for
program execution.
87
8/18/2019 Intro Chp1
88/109
Performance Measures
Response Time (Execution time, Latency): Thetime elapse between the start and the completionof an event.
Throughput (Bandwidth): The amount of workdone in a given time.
Performance: Number of events occurring perunit of time.
▪ Note execution time is the reciprocal of performance —lower execution time implies higher performance.
88
8/18/2019 Intro Chp1
89/109
Performance Measures
A system (X) is faster than (Y), if for a given task,
the response time on X is lower than on Y .
89
Execution timeY
Execution timeX
PerformanceY
PerformanceY
PerformanceX
Performance
X
=n = =
1
1
8/18/2019 Intro Chp1
90/109
Consequently, the statement that X is n%
faster than Y means:
90
Execution timeY
Execution timeX
PerformanceYPerformance
Y
PerformanceY
PerformanceX
PerformanceX
PerformanceX
PerformanceY
= 1+100
n= =
=
1
1
and
n 100 ( )
hence,
8/18/2019 Intro Chp1
91/109
Example: Machine A runs a program in 10seconds and machine B runs the same
program in 15 seconds. Therefore:
91
Execution t imeB
Execution t imeA
= 1 +100
n and
=n 100 ( ) hence,Execution t ime
BExecution t ime
A
Execution t ime
A
=n 100 ( )15
10
10= 50
8/18/2019 Intro Chp1
92/109
Clock Rate
The processor is driven by a clock with a constant cycle
time (t).
The inverse of the cycle time is the clock rate (f =1/t).
CPI - cycles per instruction
Size of a program is instruction count (Ic) - number of
machine instructions to be executed.
Different instructions acquire different number of clock
cycles to execute
92
8/18/2019 Intro Chp1
93/109
CPI is an important parameter for measuring the time needed
to execute each instruction
For a given instruction set, we can calculate an average CPI
over all instruction types, provided we know theirfrequencies of appearance in the program.
An accurate estimate of the average CPI requires a large
amount of program code to be traced over a long period of
time.
CPI will be taken as an average value for a given instructionset and a given program mix.
93
8/18/2019 Intro Chp1
94/109
Let us define the average number of clock cycle per
instruction (CPI) as:
94
CPI =CPU clock cycles for a program
Instruction count Ic=
( )CPI i Iii =1
n
Where I i is the number of time instruction i is executed andCPI i is the average number of clock cycles for instruction i.
= (CPI i
n
Ic
Ii )i =1
8/18/2019 Intro Chp1
95/109
CPI and clock rate depends on the technology and
architecture of the machine.
Instruction count depends on the instruction set of the
machine and compiler technology.
95
8/18/2019 Intro Chp1
96/109
The CPU time (T) or Execution Time : is the time needed to
execute a given program excluding the waiting time for I/O or
other running programs.
CPU time is further divided into the user CPU time and thesystem CPU time.
The CPU time is estimated as:
96
( )CPIi Iii =1
n
T = Ic * CPI * =
8/18/2019 Intro Chp1
97/109
The execution of an instruction requires going through cycle
of events involves the instruction fetch, decode, operand(s)
fetch, execution, and store result(s):
p is the number of processor cycles needed to decode and
execute the instruction
m is the number of the memory references needed
k is the ratio between memory cycle time and processor cycletime. ( latency factor -- how much the memory is slow w. r. toCPU)
97
T = Ic * CPI * = I
c * (p+m*k)*
8/18/2019 Intro Chp1
98/109
Now let C be Total number of cycles required to
execute a program.
And the time to execute a program will be
98
C = Ic
CPI
T = C
= C/ f
T = Ic CPI = Ic CPI / f
8/18/2019 Intro Chp1
99/109
MIPS - Million Instructions Per Second
Measure the processor speed
MIPS = Ic / T 106
MIPS = f / CPI 106
MIPS = f Ic / C 106
99
Execution time(T) * 10
MIPS = C I
6
8/18/2019 Intro Chp1
100/109
MFLOPS - Million Floating Point Operations Per
Second
is another performance measure to be used to
evaluate computers.
100
MFLOPS =
Number of floating point operations in a program
Execution time * 106
8/18/2019 Intro Chp1
101/109
Throughput Rate
Number of programs executed per unit time.
Ws = CPU throughput
Wp = System (program/second)
Wp = 1 / T
Wp = MIPS 106 /Ic
Based on the MIPS rates and the average program length Ic
Ws
8/18/2019 Intro Chp1
102/109
Throughput (W/Tn) - the execution rate on an n
processor system, measured in FLOPs/unit-time or
instructions/unit-time
Speedup (Sn = T1/Tn) - how much faster an actual
machine with n processors compared to 1.
Efficiency (En = Sn/n) - fraction of the maximumspeedup achieved by n processors
102
8/18/2019 Intro Chp1
103/109
The attributes of a computer system which allow it to linearly
scaled up or down in size, to handle smaller or larger
workloads, or to obtain proportional decreases or increase in
speed on a given application
Good scalability requires the good algorithm and the machine
to have the right properties
Thus in general there are five performance factors (Ic,p,m,k,t)which are influenced by four system attributes
103
8/18/2019 Intro Chp1
104/109
104
System
Attributes
Performance Factors
I c
CPI
p m k
Instruction-set
Architecture
Compiler
Technology
CPU
Implementation
& Technology
Memory
Hierarchy
8/18/2019 Intro Chp1
105/109
A benchmark program is executed on a 40MHz processor. The
benchmark program has the following statistics.
Calculate average CPI, MIPS rate & execution for the above
benchmark program
Solved in class
105
Instruction Type
Arithmetic
Branch
Load/Store
Floating Point
Instruction Count
45000
32000
15000
8000
Clock Cycle Count
1
2
2
2
8/18/2019 Intro Chp1
106/109
(Book Problem 1.4) Solved in class
106
8/18/2019 Intro Chp1
107/109
(Book Problem 1.6) Solved in class
107
8/18/2019 Intro Chp1
108/109
Operation Frequency CPI
ALU ops 35% 1
Loads 25% 2
Stores 15% 2
Branches 25% 3
Compute the Average CPI.
Solved in class
108
8/18/2019 Intro Chp1
109/109
For the purpose of solving a given application problem, you benchmark a
program on two computer systems.
On system A, the object code executed 80 million Arithmetic Logic Unit
operations (ALU ops), 40 million load instructions, and 25 million branch
instructions.
On system B, the object code executed 50 million ALU ops, 50 millionloads, and 40 million branch instructions.
In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock
cycles, and each branch takes 5 clock cycles.
A. Compute the relative frequency of occurrence of each type of instruction
executed in both systemsB. Find the Average CPI for each system.