Upload
adityalakshay26
View
228
Download
0
Embed Size (px)
Citation preview
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
1/102
1
Santanu Chaudhury
Computer Architecture EEL308
2
Books:
Computer Organization and Design, The Hardware/Software Interface
Author(s) : Patterson & Hennessy
Imprint: Morgan Kaufmann
Additional Reference :
(i) Computer Architecture and Organisation : J .P. Hayes
(ii) Hamacher& Zacky
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
2/102
3
Evaluation & Attendance Policy
Minor-1 ; 20 Minor-2 : 20
Major: 35
Tutorial, Assignments, Quiz : 25
At tendance Pol icy:
One grade less if attendance less than 75%
No E-grade if attendance less than 75%
4
Introduction
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
3/102
5
What is a computer?
An electron ic device that can be programmed for solving aproblem
Components of a Computer
processor
input (mouse, keyboard,scanner, camera)
output (display, printer)
memory (disk drives, DRAM, SRAM, CD)
network
Rapidly changing technology
vacuum tube -> transistor -> IC -> VLSI
doubling every 1.5 years: memory capacity
processor speed
6
Computer Architecture and Organization
Computer Architecture refers to those attributes of a systemvisible to a programmer
instruction set
number of bit s used to represent various data types
i/o mechanisms
techniques for addressing memory
Computer Organization refers to the operational units andtheir interconnections that realize the architecturalspecifications
control signals
interfaces between the computer and peripherals memory technology
Distinction is fundamental
manufacturers can offer computer models with samearchitecture with differences in organization
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
4/102
7
Structure and Function
A Computer is a complex system
hierarchic nature of complex system essential for their designand description
behavior at each level depends only on a simplified, abstractedcharacterization of the system at the low er level
At each level designer is concerned wi th
structure
CPU - Central Processing Unit
Main memory: Stores data
I/O: moves data between computer and its external environment
System Interconnection
function
data processing data storage
data movement
control
8
Structure - Top Level
Computer
Main
Memory
Input
Output
Systems
Interconnection
Peripherals
Communication
lines
Central
Processing
Unit
Computer
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
5/102
9
Structure - The CPU
Computer Arithmeticand
Logic Unit
Control
Unit
Internal CPU
Interconnection
Registers
CPU
I/O
Memory
System
Bus
CPU
10
Structure - The Control Unit
CPU
Control
Memory
Control Unit
Registers andDecoders
Sequencing
LogicControl
Unit
ALU
Registers
Internal
Bus
Control Unit
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
6/102
11
History
First generation computers were made with vacuum valves andused punched cards as the main (non-volatile) storage medium. Ageneral purpose computer of this era was 'ENIAC' (ElectronicNumerical Integrator and Computer) which was completed in 1946.
The next major step in the history of computing was the inventionof the transistor in 1947. Transistorized computers are normallyreferred to as 'Second Generation' and dominated the late 1950sand early 1960s.
12
History
'Third Generation' computers used J ack St. Claire Kilby's invention- the integrated circuit or microchip;
the first integrated circuit was produced in September 1958 butcomputers using them didn't begin to appear until 1963.
In 1964 IBM announced system/ 360with increased storage and
processing capabilities.
formed the foundation of modern computer architecture
In 1971 Intel anounced 4004 - first chip to contain all of the
components of CPU
the microprocessor was born
Fourth generation computers used as underlying technology - verylarge scale integration : VLSI
8086 and all of Intel's processors for the IBM PC and compatibles
Supercomputers of the era were immensely powerful, like the
Cray-1
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
7/102
13
CRAY X-MP
14
Performance of Computers have shown drastic improvement
over time
Parameter for evaluating Performance: Response Time (latency)
How long does it take for my job to run?
How long does it take to execute a job?
How long must I wait for the database query?
Parameter for evaluating Performance: Throughput
How many jobs can the machine run at once?
What is the average execution rate?
How much work is getting done?
Performance of Computers
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
8/102
15
Elapsed Time covers everything (disk and memory accesses, I/O , etc.)
a useful indicator, but often not good for comparisonpurposes
CPU time
doesn't count I/O or time spent running other programs
can be broken up into system time, and user time
User CPU time
time spent executing the lines of code that are " in" ourprogram
Execution Time
16
For some program running on machine X,
PerformanceX = 1 / Execut ion t imeX
"X is n times faster than Y"
PerformanceX / PerformanceY = n
A Defini tion of Performance
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
9/102
17
Clock Cycles
Instead of report ing execution time in seconds, we often usecycles
Internal clock in a computer co-ordinates execution o finstructions
Clock ticks indicate when to start activities (oneabstraction):
cycle time = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second
time
cycleseconds
programcycles
programseconds
18
Could assume that # of cycles = # of instruct ions
However, different instructions take differentamounts of time on different machines.
time
1stinstruction
2ndinstruction
3rdinstruction
4th
5th
6th ..
.
How many cycles are required for a program?
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
10/102
19
Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more t ime than accessing
registers
Different numbers of cycles for different instructions
20
A given program wil l requi re
some number of instructions (machine instructions)
some number of cycles
some number of seconds
We have a vocabulary that relates these quantities:
cycle time (seconds per cycle)
clock rate (cycles per second)
CPI (cycles per i nstruct ion)a floating point i ntensive application might have a higher CPI
MIPS (millions of ins tructions per second)
this would be higher for a program using simple instruc tions
Terminology
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
11/102
21
Suppose we have two machines
For some program,
Machine A has a clock cycle time of 10 ns. and a CPI of 2.0Machine B has a clock cycle time of 20 ns. and a CPI of 1.2
What machine is faster for this program, and by how much?
CPI Example
If the program on both the machines requires
same number of instructions -N, then
machine A requires 2*N*10ns and
machine B requires 1.2*N*20 ns
22
A compi ler designer is try ing to decide between two codesequences for a particular machine. Based on the hardwareimplementation, there are three different classes ofinstructions: Class A, Class B, and Class C, and they requireone, two, and three cycles (respectively).
The first code sequence has 5 instruc tions: 2 of A, 1 of B,and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1of C.
Which sequence will be faster? How much?What is the CPI for each sequence?
Number of Instructions Example
First sequence : 10 cycles, CPI-2
Second Sequence: 9 cycles, CPI - 1.5
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
12/102
23
Two dif ferent compi lers are being tested for a 100 MHz machinewith three d if ferent c lasses of instruc tions : Class A, Class B,and Class C, which require one, two, and three cycles(respect ively). Both compilers are used to produce code for alarge piece of software.
The f irs t compi ler 's code uses 5 mill ion Class A inst ruct ions, 1mil lion Class B instructions, and 1 mil lion Class C instructions.
The second compiler's code uses 10 million Class Ainstructions, 1 mil lion Class B instructions, and 1 mil lion Class C
instructions.
Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?
MIPS example
24
Performance best determined by running a real application
Use programs typ ical of expected workload
Or, typical of expected class of applicationse.g., compilers/editors, scientific applications, graphics,
etc.
SPEC (System Performance Evaluation Cooperative)
companies have agreed on a set of real program andinputs
valuable indicator of performance (and compiler
technology)
Benchmarks
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
13/102
25
SPEC 95Benchmark Descri pti on
g o Artific ial in tellig en ce; p lay s th e g ame o f Go
m88ksim Motorola 88k chip simulator; runs test programg cc Th e Gn u C c om piler gen er atin g SPARC c od e
compress Compresses and decompresses file in memory
li Lisp interpreter
ijpeg Graph ic compression and decompression
p er l Man ip ulates str in gs an d p rime n umb ers
v or tex A d atab as e p ro gram
tomcatv A mesh generation program
swim Shal low water model w ith 513 x 513 gr id
su2cor quantum physics; Monte Carlo simulat ion
hydro2d Astrophysics; Hydrodynamic Naiver Stokes equations
mgrid Mult igrid solver in 3-D potent ial field
applu Parabol ic /ell ipt ic partial di fferent ial equat ions
trub3d Simulates isotropic, homogeneous turbulence in a cube
apsi Solves problems regarding temperature, wind veloci ty , and dist ributio
fp pp p Qu an tu m ch emistry
wave5 Plasma physics; electromagnetic part icle simulation
26
SPEC95
Can a machine with a slower clock rate have better performance?
P entiumClock rate (MHz)
SPECfp
P entium Pro
2
0
4
6
8
3
1
5
7
9
10
200 25015010050
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
14/102
27
Uniprocessor to Multiprocessor
Multiple processor on a single chip Multi-core microprocessor
Impact more on throughput than on response time
To improve response time may need to rewrite the code totake advantage of mu ltiple cores
28
Instruction Set Architecture
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
15/102
29
Instruction Set Architecture
A very important abstract ion
interface between hardware and low-level software
standardizes inst ructions, machine language bit patterns, etc.
There can be different implementations of the
same architecture
30
Instructions
Language of the Machine
More primitive than higher level languages
Very restrict ive
Variety of functions a CPU may perform are reflected in itsinstruction set
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
16/102
31
Instruction Set of Computers :Complex Instruction set
Each instruction in a CISC instruction set might perform a series ofoperations inside the processor.
Reduces the number of instructions required to implement a givenprogram, and allows the programmer to learn a small but flexibleset of instructions.
You can even have a single instruction computer
Since earlier memory was slow and expensive, the CISCphilosophy made sense
Most common microprocessor designs --- including the Intel 80x86
and Motorola 68K series --- also follow the CISC philosphy Later, it was discovered that, by reducing the full set to only the
most frequently used instructions, the computer would get morework done in a shorter amount of time for most applications - RISC
32
Instruction Set of Computers :
Reduced Instruct ion set
Background
With advances in semiconductor technology difference in speedbetween main memory and processor reduced.
a sequence of simple instructions produces the same results asa sequence of complex instructions, but can be implementedwith a simpler (and faster) hardware - assuming that memorycan keep up
RISC forms the basis for modern design
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
17/102
33
Instruction Set of Computers :Reduced Instruct ion set
RISC characteristics
Simple instruction set.
In a RISC machine, the instruction set contains simple, basicinstructions, from which more complex instructions can be
composed.
Same length instructions.
Each instruction is the same length, so that it may be fetchedin a single operation.
1 machine-cycle instructions. Most instructions complete in one machine cycle
34
RISC Architecture
Well be working with the MIPS instruction set
architecture
similar to other architectures developed since
the 1980's
used by NEC, Nintendo, Silicon Graphics,
Sony
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
18/102
35
Instructions are bits Programs are stored in memory
to be read or written just like data
Fetch & Execute Cycle
Instructions are fetched and put into a special register in theprocessor
Bits in the register "control" the subsequent actions
Fetch the next instruction and continue
Processor Memory
memory for data, programs,compilers, editors, etc.
Stored Program Concept
36
Elements of Instruction
Elements of an instruc tion
Operation Code
Source Operand Reference
one or two
Result Operand Reference
(may be) Next Instruction Reference
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
19/102
37
MIPS arithmetic Instructions
All inst ruct ions have 3 operands Operand order is fixed (destination fir st)
Example:
C code: A = B + C
M I PS code: add $s0, $s1, $s2
$si - indicate REGISTERS: storage inside CPUassociated with variables by compiler
Operands can be only registers: 32 registers provided
Principle of Regularity
38
Registers vs. Memory
Processor I/O
Control
Datapath
Memory
Input
Output
Ar ithmetic instruct ions operands must be registers, only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables ?
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
20/102
39
Memory Organization
Viewed as a large, single-dimension array, with an address. A memory address is an index into the array
"Byte addressing" means that the index points to a byte ofmemory.
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
40
Memory Organization
Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
0
4
8
12
...
32 bits of data
32 bits of data
32 bits of data
32 bits of data
Registers hold 32 bits of data
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
21/102
41
More Instructions
MIPS loading words but addressing bytes arithmetic on registers only
Instruction Meaning
add $s1, $s2, $s3 $s1 = $s2 + $s3
sub $s1, $s2, $s3 $s1 = $s2 $s3
lw $s1, 100($s2) $s1 = Memory[$s2+100]
sw $s1, 100($s2) Memory[$s2+100] = $s1
42
Instructions
Load and store instructions
Example:
C code: A[8] = h + A[8];
MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 32($s3)
Store word has destination last
Remember arithmetic operands are registers, not memory!
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
22/102
43
Our First Example
Can we figure out the code?
swap(int v[], int k);{ int temp;
temp = v[k]v[k] = v[k+1];v[k+1] = temp;
}
swap:muli $2, $5, 4add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31
44
Instructions, like registers and words of data, are also 32 bits long
Example: add $t0, $s1, $s2
registers have numbers, $t0=9, $s1=17, $s2=18
Instruction Format:
op rs rt rd shamt funct
op: 6 bits opcode field
rs: 5 bits first register source operand
rt: 5 bits second register source operand
rd: 5 bits register destination operand
shamt: 5 bits shift amount to be used in shift instructionsfunct: 6 bits selects specific variant of the operation in
opcode field
This is R-type instruction format
Machine Language
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
23/102
45
Consider the load-word and store-word instructions,
What woul d the regularity princip le have us do?
New principle: Good design demands a compromise
Introduce a new type of inst ruction format
I-type for data transfer instructions
other format w as R-type for register
Example: lw $t0, 32($s2)
35 18 9 32
op rs rt 16 bit number
16 bit offset : +/-215 bytes of t he address in the base register rs
Where's the compromise?
Instructions o f fixed length but of different format to take care ofdifferent functional requirements
Multiple format complicates hardware
Machine Language
46
Decision making instructions
alter the control fl ow,
i.e., change the "next" instruct ion to be executed
MIPS conditional branch instructions:
bne $t0, $t1, Label
Go to statement at address Label if value in register t0 doesnot equal value in register t1
beq $t0, $t1, Label
Example: if (i==j) h = i + j;
bne $s0, $s1, Labeladd $s3, $s0, $s1
Label: ....
Control
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
24/102
47
MIPS unconditional branch instruct ions:j label
Example:
if (i!=j) beq $s4, $s5, Lab1h=i+j; add $s3, $s4, $s5
else j Lab2h=i-j; Lab1: sub $s3, $s4, $s5
Lab2: ...
Format of Branch instructions
Branch - I Format
Address corresponding to Label is given by 16 bit offset
Unconditional Branch Jump instruction
J format - 26 bits of offset
Control
48
Instructions:
bne $t4,$t5,Label Next instruction is at Label if $t4$t5beq $t4,$t5,Label Next instruction is at Label if $t4=$t5
Formats:
Use a register (like lw and sw) and add its content to address
use Instruction Address Register (PC = program counter)
most branches are local (princip le of locality)
Jump instructions just use high order bits of PC address boundaries of 256 MB
op rs rt 16 bit addressI
Addresses in Branches
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
25/102
49
So far:
Instruction Meaning
add $s1,$s2,$s3 $s1 = $s2 + $s3
sub $s1,$s2,$s3 $s1 = $s2 $s3
lw $s1,100($s2) $s1 = Memory[$s2+100]
sw $s1,100($s2) Memory[$s2+100] = $s1
bne $s4,$s5,L Next instr. is at Label if $s4 $s5
beq $s4,$s5,L Next instr. is at Label if $s4 = $s5
j Label Next instr. is at Label
Formats:
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
50
We have: beq, bne, what about Branch-if-less-than?
New instruction:if $s1 < $s2 then
$t0 = 1slt $t0, $s1, $s2 else
$t0 = 0
Can use this instruction to build Branch if less than
slt $t0,$s0,$s1
bne $t0,$Zero, Less
Register $Zero always contain 0
Control Flow
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
26/102
51
While loop
Example C- codewhile (save[i] == k)
i=i+j;
Corresponding MIPS code
Assume I, j, k corresponds to regis ters $3, $4, $5 respecti vely andbase of the array save is in $6.
Loop: add $t1,$s3,$s3 # reg $t1= 2*i
add $t1,$t1,$t1 # reg $t1= 4*I
add $t1,$t1,$s6 #$t1= address of save[i]
lw $t0, 0($t1)
bne $t0,$s5,Exit #go to Exit if save[i]!=k
add $s3,$s3,$s4 #i=i+j
j Loop #go to loopExit:
52
Register Use Conventions
Name egister numbe Usage
$zero 0 the constant value 0
$v0-$v1 2-3 values for results and expression ev
$a0-$a3 4-7 arguments
$t0-$t7 8-15 temporaries
$s0-$s7 16-23 saved
$t8-$t9 24-25 more temporaries
$gp 28 global pointer
$sp 29 stack pointer $fp 30 frame pointer
$ra 31 return address
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
27/102
53
Small constants are used quite frequently (50% of operands)e.g., A = A + 5;
B = B + 1;C = C - 18;
Mechanism
put 'typical constants' in memory alongwith instruct ions and loadthem.
create hard-wired registers (like $zero) for constants like one.
MIPS Instructions :
addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4
How do we make this work?
Instructions are I-type
16 bit field for the cons tant
Constants
54
We'd like to be able to load a 32 bit constant into a register
Must use two instruc tions, new "load upper immediate"instruction
lui $t0, 1010101010101010
Then must get the lower order bits right, i.e.,ori $t0, $t0, 1010101010101010
1010101010101010 0000000000000000
0000000000000000 1010101010101010
1010101010101010 1010101010101010
ori
1010101010101010 0000000000000000
filled with zeros
How about larger constants?
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
28/102
55
Supporting Procedures in Hardware
Execution of a Procedure Requires Place parameters so that procedures can access them
Transfer control to the procedure
Acqu ire storage resources for the procedure
Place the result so that calling procedure can access it
Return Control to the point o f origin
In MIPS architecture
$a0 - $a3 : four argument registers in which to passparameters
$v0 - $v1 : two value registers in which to pass parameters
$ra : one return address register to return the point oforigin
Special Instructionjal ProcedureAddress
Control jumps to the address and simultaneously saves theaddress of the following instruction in $ra
56
Assembly Language vs. Machine Language
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
29/102
57
Other Issues
58
simple instruct ions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
rely on compi ler to achieve performance
what are the compiler's goals? help compiler where we can
op rs rt rd shamt funct
op rs rt 16 bit address
op 26 bit address
R
I
J
Summarising
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
30/102
59
Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC -relative addressing
5. Ps eudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
PC
+
+
60
Design alternative:
provide more powerful operations
goal is to reduce number of instruct ions executed
danger is a slower cycle time and/or a higher CPI
Sometimes referred to as RISC vs. CISC
virtually all new instruct ion sets s ince 1982 have been
RISC
VAX: minimize code size, make assembly language easy
instructions from 1 to 54 bytes long!
Well look at PowerPC and 80x86
Alternative Architectures
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
31/102
61
PowerPC
Indexed addressing example: lw $t1,$a0+$s3 #$t1=Memory[$a0+$s3]
What do we have to do in MIPS?
Update addressing
update a register as part of load (for marching througharrays)
example: lwu $t0,4($s3)#$t0=Memory[$s3+4];$s3=$s3+4
What do we have to do in MIPS?
Others:
load multiple/store multiple a special counter register bc Loop
decrement counter, if not 0 goto loop
62
80x86
1978: The Intel 8086 is announced (16 bit architecture)
1980: The 8087 floating point coprocessor is added
1982: The 80286 increases address space to 24 bits,+instructions
1985: The 80386 extends to 32 bits , new addressing modes
1989-1995: The 80486, Pentium, Pentium Pro add a fewinstructions
(mostly designed for higher performance)
1997: MMX is added
This history illustrates the impact of the golden handcuffs ofcompatibility
adding new features as someone might add clothing to a packed bag
an architecture that is difficult to explain and impossible to love
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
32/102
63
A dominant archi tecture: 80x86
See your textbook for a more detailed description Complexity:
Instructions from 1 to 17 bytes long
one operand must act as both a source and destination
one operand can come from memory
complex addressing modese.g., base or scaled index with 8 or 32 bit d isplacement
Saving grace:
the most frequently used instructions are not too difficultto build
compilers avoid the portions of the architecture that are
slow
what the 80x86 lacks in style is made up in quantity,
making it beautiful from the right perspective
64
Instruction complexity is only one variable
lower instruct ion count vs. higher CPI / lower clock rate
Design Principles:
simplicity favors regularity
smaller is faster
good design demands compromise
make the common case fast
Instruction set architecture
a very important abstraction indeed!
Summary
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
33/102
65
Computer Arithmetic
66
Ari thmetic
Basic computation invo lves arithmetic operations
Instructions for arithmetic operations
Ar ithmetic Operations implemented in hardware
Ar ithmetic-Logical Uni t (ALU)
32
32
32
operation
result
a
b
ALU
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
34/102
67
Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers
Binary numbers (base 2)0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...decimal: 0...2n-1
Problems:With fixed set of bits you can represent a finite set ofnumbers but actually set of possible numbers (evenintegers ) are infinitehow to represent fractions and real numbers?how to represent negative numbers?
Which bit patterns will represent which numbers?
Numbers
68
Sign Magnitude: One's Complement Two's Complement
000 = +0 000 = +0 000 = +0001 = +1 001 = +1 001 = +1010 = +2 010 = +2 010 = +2011 = +3 011 = +3 011 = +3100 = -0 100 = -3 100 = -4101 = -1 101 = -2 101 = -3110 = -2 110 = -1 110 = -2111 = -3 111 = -0 111 = -1
Issues: balance, number of zeros, ease of operations
Which one is best? Why?
Possible Representations of Negative Integers
2s complement ; consistent zero
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
35/102
69
32 bit signed numbers:
0000 0000 0000 0000 0000 0000 0000 0000two = 0ten0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten...
0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0000two = 2,147,483,648ten1000 0000 0000 0000 0000 0000 0000 0001two = 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0010two = 2,147,483,646ten...
1111 1111 1111 1111 1111 1111 1111 1101two = 3ten1111 1111 1111 1111 1111 1111 1111 1110two = 2ten1111 1111 1111 1111 1111 1111 1111 1111two = 1ten
maxint
minint
MIPS
70
Negating a two 's complement number:
invert all bits and add 1
remember: negate and invert are quite different!
Converting n bit numbers into numbers with more than n bits:
MIPS 16 bit immediate gets converted to 32 bits for arithmetic
copy the most sign ificant bit (the sign bit) into the other bits
0010 -> 0000 0010
1010 -> 1111 1010
"si gn extension" (lbu vs. lb)
Two's Complement Operations
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
36/102
71
Just like in school0111 0111 0110
+ 0110 - 0110 - 0101
Two's complement operations easy
subtraction using addition of negative numbers
0111
+ 1010
Overflow (result too large for finite computer word):
e.g., adding two n-bit numbers does not yield an n-bit number
0111
+ 0001 note that overflow term is somewhat misleading,
1000 it does not mean a carry overflowed
(becomes negative!)
Addit ion & Subtract ion
72
No overflow when adding a pos itive and a negative number
No overflow when signs are the same for subtraction
Overflow occurs when the value affects the sign:
overflow when adding two positives yields a negative
or, adding two negatives gives a positive
or, subtract a negative from a positive and get a negative
or, subtract a positi ve from a negative and get a positi ve
Consider the operations A + B, and A B
Can overflow occur if B is 0 ?
Can overflow occur if A is 0 ?
Detecting Overflow
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
37/102
73
An except ion (interrupt) occurs in MIPS Control jumps to predefined address for exception
Interrupted address is saved for possible resumption
Handling based on requirements of the software
Don't always want to detect overflow
Unsigned arithmetic new MIPS instructions: addu, addiu, subunote: addiu still sign-extends!
note: sltu, sltiu for unsigned comparisons
Effects of Overflow
74
Bit-wise AND, OR, Invert
Shift left
Shift right
Addi tional Operat ions
Bit -wise X-OR, X-NOR
Logical Operations
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
38/102
75
76
Let's build an ALU to support the andi and ori instructions
we'll just bu ild a 1 bit ALU, and use 32 of them
Possible Implementation (sum-of-products):
b
a
operation
result
op a b res
An ALU (arithmetic logic unit )
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
39/102
77
Selects one of the inputs to be the output , based on a controlinput
Lets build our ALU using a MUX:
S
CA
B0
1
The Multiplexor
note: we call this a 2-input mux
even though it has 3 inputs!
78
Desirable Features
Do not want too many inputs to a single gate
Do not want to have to go through too many gates
Let's look at a 1-bit ALU for addition:
How could we build a 1-bit ALU foradd, and, and or?
How could w e build a 32-bit ALU?
Different Implementations
cout = a b + a cin + b cinsum = a xor b xor cin
Sum
CarryIn
CarryOut
a
b
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
40/102
79
Building a 32 bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1
a1
b1
Result2
a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31CarryIn
80
Two's complement approch: just negate b and add.
How do we negate?
A very clever solution:
What about subtraction (a b) ?
0
2
Result
O peration
a
1
C arryIn
CarryOut
0
1
Binvert
b
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
41/102
81
Need to support the set-on-less-than instruction (slt)
remember: slt is an arithmetic instruc tion
produces a 1 if rs < rt and 0 otherwise
use subt raction: (a-b) < 0 impl ies a < b
Need to support test for equality (beq $t5, $t6, $t7)
use subt raction: (a-b) = 0 impl ies a = b
Adding more Operations to ALU
82
Supporting slt
Can we figure out the idea?0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
S et
Ove rflowdetection
Overflow
a.
b.
Set output if a
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
42/102
83
S e ta 3 1
0
A L U 0R e s u l t0
C a r ry I n
a 0
R e s u l t1a 1
0
R e s u l t2a 2
0
O p e r a tio n
b 3 1
b 0
b 1
b 2
R e s u l t3 1
O v e r flo w
B in v e r t
C a r r y In
L e s s
C a r ry I n
C a r ry O u t
A L U 1L e s s
C a r ry I n
C a r ry O u t
A L U 2L e s s
C a r ry I n
C a r ry O u t
A L U 3 1L e s s
C a r ry I n
84
Test for equality
Notice control lines:
000 = and
001 = or
010 = add
110 = subtract
111 = slt
Note: zero is a 1 when the result is zero!
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2
Less
CarryIn
CarryOut
ALU31Less
CarryIn
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
43/102
85
Recap
We can build an ALU to suppor t the MIPS instructi on set
key idea: use multiplexor to select the output we want
we can efficiently perform subtraction using twos complement
we can replicate a 1-bit ALU to p roduce a 32-bit ALU
Important points about hardware
all of the gates are always working
the speed of a gate is affected by the number of inputs to the gate
the speed of a circuit is affected by the number of gates in series
(on the critical path or the deepest level of logic )
Note
Clever changes to organization can improve performance(similar to using better algorithms in software)
well look at two examples for addition and multipl ication
86
Is a 32-bit ALU as fast as a 1-bit ALU?
Sequential dependence in the 32 bit ALU
Fast Carry carry computation in parallel
c1 = b0c0 + a0c0 + a0b0
c2 = b1c1 + a1c1 + a1b1 c2 =
c3 = b2c2 + a2c2 + a2b2 c3 =
c4 = b3c3 + a3c3 + a3b3 c4 =
Not feasible! Why?
large hardware requirement
Problem: ripple carry adder is slow
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
44/102
87
An approach in-between our two extremes Motivation:
If we didn't know the value of carry-in, what could we do?
When would we always generate a carry? gi = aibi When would we propagate the carry? pi = ai + bi Ci+1 = gi + pi.ci
When gi is 1, ci+1 = gi + pi.ci = 1 + pi.ci = 1
Adder generates Ci+1 independent of CiWhen gi=0 and pi=1, ci+1= 0 + 1.ci =Ci
Adder propagates
Did we get rid of the ripple?c1 = g0 + p0c0
c2 = g1 + p1c1 c2 =c3 = g2 + p2c2 c3 =
Feasible! Why?
Can use generate and propagate for larger buildingblocks 4-bit adder
Carry-Lookahead adder
88
Four 4-bit adders combinedto make a 16 bit adder
Carries come from Carrylookahead un it
Carry lookahead adder isfaster because carrygeneration and propagationlogic starts working themoment clock cycle begins;
carry goes through lessernumber of gates
Typically this 16-bit adder is6 times faster than ripplecarry adder
Build bigger adders
CarryIn
Result0--3
A LU0
CarryIn
Result4--7
A LU1
CarryIn
Result8--11
A LU2
CarryIn
Ca rryOut
Result12--15
A LU3
CarryIn
C 1
C 2
C 3
C 4
P 0G 0
P 1G 1
P 2G 2
P 3G 3
pigi
pi + 1gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2gi + 2
pi + 3gi + 3
a0 b0 a1 b1 a2 b2 a3 b3
a4 b4 a5 b5 a6 b6 a7 b7
a8 b8 a9 b9
a10
b10 a11 b11
a12 b12 a13 b13 a14 b14 a15 b15
Carry-lookahead unit
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
45/102
89
More complicated than addition accomplished via shifting and addition
More time and more area
Simplest Scheme
0010 (multiplicand)
__x_1011 (multiplier)
Negative numbers: convert and multiply
Multiplication
90
Multiplication implementation
D o n e
1 . Tes tM ultiplier0
1a. A dd m ul tip li cand to product and place the resu lt in Prod uct register
2. S hift the M ultipl ican d register left 1 b it
3. S hift the M ultipl ier register r ight 1 bit
32nd repetit ion?
S tart
M ultiplier0 = 0M ultiplier0 = 1
No : < 32 repetitions
Y e s : 3 2 re p etitio ns
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
46/102
91
Multiplication: Implementation
64-bit ALU
Control test
MultiplierShift right
ProductWrite
MultiplicandShift left
64 bits
64 bits
32 bits
92
2nd Version
D o n e
1 . T e s tM u l tip l i e r 0
1 a . A d d m u l ti p lic a n d t o t h e l e f t h a l f o f th e p r o d u c t a n d p l a c e t h e r e s u l t in t h e l e f t h a l f o f th e P r o d u c t r e g is t e r
2 . S h i f t th e P r o d u c t r e g i s te r r ig h t 1 b i t
3 . S h i ft th e M u l ti p l ie r r e g is t e r ri g h t 1 b i t
3 2 n d r e p e t iti o n ?
S t a rt
M u l tip l i e r 0 = 0M u l tip l i e r 0 = 1
N o : < 3 2 r e p e ti tio n s
Y e s : 3 2 r e p e ti ti o n s
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
47/102
93
Second Version
MultiplierShift right
Write
32 bits
64 bits
32 bits
Shift right
Multiplicand
32-bit ALU
Product Control test
94
Final Version
ControltestWrite
32bits
64bits
Shift rightProduct
Multiplicand
32-bit ALU
Done
1. TestProduct0
1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
Start
Product0=0Product0=1
No:
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
48/102
95
Efficient Multiplication: Booths Multiplication
Motivation
Use of addition and subtraction permits product computationin a variety of ways
Eg. 2 X 6 0010 X 0110
6 = -2 + 8 0110 = -0010 + 1000
2x6 = -2x2 + 8x6
We can replace a string of 1s in the multiplier with an initialsubtract when we see a 1 and then later add when we see thebit after last 1
To reduce the number of additions(subtractions)
96
Booths Algorithm
Works with signed integers; twos complement form
Looks at two bits at a time scanning from right to left;
Steps
1. Depending on the current and previous b its, do
00 : Middle of a string of 0s, so no arithmetic operations
01 : End of a string of 1s, so add the multiplicand to the left half ofthe product
10 : Beginning of a string of 1s, so subtract the multiplicand from theleft half of the product
11 : Middle of a string of 1s so no arithmetic operation
Starts with a 0 for the imaginary bit to the right of the rightmostbit for the first stage
2. Shift the product register right 1 bit
Simulated Example
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
49/102
97
Booths Algorithm: Example
10011100 -100x 01100011 99------------------------------
00000000 00000000- 11111111 10011100
------------------------------00000000 01100100
+ 11111110 011100-----------------------------11111110 11010100
- 11110011 100----------------------------
00001011 01010100+ 11001110 0--------------------
11011001 01010100 -9900
Note that the multiplicandand multiplier are 8-bit tw o'scomplement number, butthe result is understood as16-bit two's comp lementnumber. Be careful aboutthe proper alignment of thecolumns. 10 pair causes asubtraction, aligned with 1,01 pair causes an addition,aligned with 0. In bothcases, it aligns wi th the oneon the left. The algorithmstarts w ith the 0-th bit. Weshould assume that there isa (-1)-th bit , having value 0
98
Booths Algorithm: Hardware
The hardware consists of 32-bit register M for the multiplicand, 64-bit
product register P, and a 1-bit register C, 32-bit ALU and control.
Initially, M contains multiplicand, P contains multiplier (the upper half
Ph = 0), and C contains bit 0. The algorithm is the following steps.
Repeat 32 times:
1.If (P0, C) pair is:
10: Ph = Ph - M,
01: Ph = Ph + M,
00: do nothing,
11: do nothing.
2.Arithmetic shift P right 1 bit. The shift-out bit gets into C.
Arithmetic shift preserves the sign of a two's complement number, thus
shift right arithmetic (sra) 0100 ... 111 -> 00100 ... 11 1100 ... 111 -> 11100 ...
11
Shift right arithmetic performed on P is equivalent to shift the multiplicand left with
sign extension.
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
50/102
99
Floating Point Numbers
We need a way to represent
numbers wi th fract ions , e.g., 3.1416
very small numbers, e.g., .000000001
very large numbers, e.g., 3.15576 109 Representation:
sign, exponent, significand: (1)sign significand 2exponent
more bits for sign ificand gives more accuracy
more bits for exponent increases range
IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand
double precision: 11 bit exponent, 52 bit signif icand
100
IEEE 754 floating-point standard
Leading 1 bit of significand is implicit
Exponent is biased to make sorting easier
all 0s is smallest exponent all 1s i s largest
bias of 127 for single precision and 1023 for double precision
summary: (1)sign (1+significand) 2exponent bias Example:
decimal: -.75 = -3/4 = -3/22
binary: -11/22 = -.11 = -1.1 x 2-1
floati ng poin t: exponent = 126 = 01111110
IEEE single precision: 10111111010000000000000000000000
Representation of Zero: all zero bits in the exponent is reserved andused for indi cating zero.
Pattern of all 1 bits in exponent to indicate values and situationsoutside the scope of representation
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
51/102
101
Floating Point Addition
Al ign the binary po int of the number wi th smal ler exponent byshifting the significand of the smaller number to the right(such that exponent of the smaller number matches the largerexponent)
Addi tion of the sign if icands
Normalize the result and accordingly adjust exponent (shiftingright and incrementing the exponent or shifting left anddecrementing the exponent)
Generate exception in case of underflow or overflow
If necessary, round (or truncate) the significand
102
Floating Point Multiplication
Add the biased exponents of the two numbers, subtract ing thebias from the sum to get the new biased exponent
Multiply the significands
Normalise the product if necessary by shifting right andincrementing the exponent
Round the significand
Set the sign of the products correctly
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
52/102
103
Floating Point instructions in MIPS
Addi tion, subt raction, mul tipl ication, div is ion, compar ison Single and double-precision
Separate floating point registers: $f0, $f1, $f2, .. and separateload and store for floating point registers
Registers used either as single or double-precision; a doub leprecision register is really an even-odd pair of sing le precisionregisters, using the even register as it s name
104
Accurate Ari thmetic ?
Floating point numbers, unlike integers, are approximations
Between 0 and 1 there are infinite number of real numbersout of which only 253 can be exactly represented in doubleprecision form
Rounding prov ides the mechanism for desired approximation
Extra bits required because if every intermediate result had tobe truncated to the exact number of d igits, then there wouldbe no opportunity to round
IEEE 754 keeps 2 extra bit s on the righ t dur ing intermediatecalculations called guard and round
A DECIMAL EXAMPLE:
with 2-digit significand and 2 extra digits - round and guard
2.56x100 +2.34x102
Normalisation: 2.3400 +0.0256 (5 in guard, 6 in round)
Result: 2.3656X102 after rounding 2.37X102
Without guard or round bit : 2.34 +0.02 =2.36x102
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
53/102
105
Floating Point Complexities: Summary
Operations are somewhat more complicated
In addition to overflow we can have underflow
Accuracy can be a b ig problem
IEEE 754 keeps two ext ra bits , guard and round
four rounding modes
positive divided by zero yields infinity
zero divide by zero yields not a number
other complexities
Implementing the standard can be tricky
Not using the standard can be even worse
106
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
54/102
107
108
Through implementation of a simplified version of MIPS
Simplified to contain only:
memory-reference instructions: lw, sw
arithmetic-logical instructions: add, sub, and, or, slt
control flow instructions:beq, j
Generic Implementation:
use the program counter (PC) to supply ins truction address
get the instruct ion from memory
read registers
use the instruction to decide exactly what to do Al l i ns tructions use the ALU after reading the reg isters
The Processor: Data-path & Control
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
55/102
109
Conceptual View of the Processor
Registers
Register #
Data
Register #
Datamemory
Address
Data
Register #
PC Instruction ALU
Instructionmemory
Address
Two types of functional units:
elements that operate on data values (combinational)
elements that contain state (sequential)
110
Unclocked vs. Clocked
Clocks used in synchronous logic
when shou ld an element that contains state be updated?
cycle time
rising edge
falling edge
Recap: State Elements
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
56/102
111
The set-reset latch output depends on present inputs and also on past inputs
An un-clocked state element
112
Output is equal to the stored value inside the element
Change of s tate (value) is based on the clock
Latches: whenever the inputs change, and the clock isasserted
Flip-flop: state changes only on a clock edge(edge-triggered methodology)
"logically true", could mean electrically low
A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written
Latches and Flip-flops
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
57/102
113
Two inputs: the data value to be stored (D)
the clock signal (C) indicating when to read & store D
Two outputs:
the value of the internal state (Q) and it's complement
D-latch
Q
C
D
_Q
D
C
Q
114
D flip-flop
Output changes only on the clock edge
_Q
Q
_Q
D latch
D
C
D latch
DD
C
C
D
C
Q
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
58/102
115
Our Implementation
An edge triggered methodology Typical execution:
read contents of some state elements,
send values through some combinational logic
write results to one or more state elements
Clock cycle
Stateelement
1Combinational logic
Stateelement
2
116
Built using D flip-flops
Register File
M ux
R egister 0
R egister 1
R egister n 1
Register n
M ux
R ead data 1
R ead data 2
R ead registernumber 1
R ead registernumber 2
Read register
number 1 Readdata 1
Readdata 2
Read registernumber 2
Register fileWriteregister
Writedata Write
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
59/102
117
Register File
Use the real clock to determine when to write
n-to-1decoder
R egister 0
R egister 1
Register n 1
C
CD
D
R egister n
C
C
D
D
R egister number
W rite
R egister data
0
1
n 1
n
118
Building the Datapath
Use multiplexors to stit ch functional components together
PC
Instructionmemory
Readaddress
Instruction
16 32
Add ALUresult
Mux
Registers
WriteregisterWritedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Shiftleft 2
4
Mux
ALU operation3
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALUresult
ZeroALU
Datamemory
AddressWritedata
Readdata Mu
x
Signextend
Add
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
60/102
119
Control
Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs)
Decode Information that comes from the 32 bits of the instruction
Example:
add $8, $17, $18 Instruction Format:
000000 10001 10010 01000 00000 100000
op rs rt rd shamt funct
ALU's operat ion based on inst ruct ion type and function code
120
What should the ALU do with this inst ruction
Example: lw $1, 100($2)
35 2 1 100
op rs rt 16 bit offset
ALU cont ro l input
000 AND001 OR 010 add
110 subtract111 set-on-less-than
Why is the code for subtract 110 and not 011?
Control
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
61/102
121
Must describe hardware to compute 3-bit ALU conrol input given instruction type
00 = lw, sw01 = beq,11 = arithmetic
function code for arithmetic
Describe it using a truth table (can turn into gates):
ALUOpcomputed from instruction type
Control
ALUOp Func t f ield Operat ion
ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010
X 1 X X X X X X 110
1 X X X 0 0 0 0 0101 X X X 0 0 1 0 110
1 X X X 0 1 0 0 000
1 X X X 0 1 0 1 001
1 X X X 1 0 1 0 111
122
Control
Ins truc tion RegDst ALUSrc
Memto-
Re
Reg
Write
Mem
Read
Mem
Wr it e B ranc h A LUOp1 A LUp0
R-format 1 0 0 1 0 0 0 1 0l w 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
62/102
123
PC
Instructionmemory
Readaddress
Instruction[310]
Instruction[2016]
Instruction[2521]
Add
Instruction[50]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction[3126]
4
16 32Instruction[150]
0
0Mux
0
1
Control
Add ALUresult
Mux
0
1
RegistersWriteregister
Writedata
Readdata1
Readdata2
Readregister 1
Readregister 2
Signextend
Shiftleft2
Mux1
ALUresult
Zero
Datamemory
Writedata
Readdata
Mux
1
Instruction[1511]
ALUcontrol
ALUAddress
124
Control
Simple combinational logic (truth tables)
Operation2
Operation1
Operation0
Operation
ALUOp1
F3
F2
F1
F0
F (50)
ALUOp0
ALUOp
ALU control block
R -format Iw sw beq
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
O utputs
RegDst
ALUSrc
MemtoReg
R egWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
63/102
125
Al l of the logic is combinational
We wait for everything to settle down, and the right thing to be
done
ALU might not produce right answer righ t away
we use write signals along with clock to determine when to
write
Cycle time determined by length of the longest path
Our Simple Control Structure
We are ignoring some details like setup and hold times
126
Single Cycle Implementation Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access(1ns)
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[31 0]
Instruction [2016]
Instruction [2521]
Add
Instruction [50]
RegWrite
4
16 32Instruction [150]
0
Registers
Writeregister
WritedataWritedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
ALUresult
Zero
Datamemory
Address Readdata
Mux1
0
Mux
1
0
Mu
x
1
0
Mux
1
Instruction [1511]
ALUcontrol
Shiftleft 2
PCSrc
ALU
AddALU
result
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
64/102
127
Analysis
Single Cycle Problems: what if we had a more complicated instruction like floating
point?
wasteful of area: repetetion of functional units if they areneeded more than once in an instruction
One Solution:
use a smaller cycle time
have different instructions take different numbers of cycles
a multicycle datapath:
128
Multi -Cycle Data Path
PC
Memory
Address
Instructionor data
Data
Instructionregister
Registers
Register #
Data
Register #
Register #
ALU
Memorydata
register
A
B
ALUOut
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
65/102
129
We will be reusing func tional units ALU used to compute address and to increment PC
Memory used for ins truction and data
Our control signals will not be determined solely byinstruction
e.g., what should the ALU do for a subtract instruction?
Well use a finite state machine for control
Multicycle Approach
130
Finite state machines:
a set of states and
next state function (determined by cur rent state and theinput)
output function (determined by current state and possiblyinput)
Well use a Moore machine (output based only on currentstate)
Review: fin ite state machines
Next-statefunction
Current state
Clock
Outputfunction
Nextstate
Outputs
Inputs
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
66/102
131
Break up the instruc tions in to steps, each step takes a cycle balance the amount of work to be done
restrict each cycle to use only one major functional unit
At the end of a cycle
store values for use in later cycles (easiest thing to do)
introduce additional internal registers
Multicycle Approach
132
Multi -Cycle Path
Shiftleft 2
PC
Memory
MemData
Writedata
Mux
0
1
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Mux
0
1
Mux
0
1
4
Instruction[150]
Signextend
3216
Instruction[2521]
Instruction[2016]
Instruction[150]
Instructionregister
1 Mux
0
3
2
Mux
ALUresult
ALU
Zero
Memorydata
register
Instruction[1511]
A
B
ALUOut
0
1
Address
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
67/102
133
Instruction Fetch
Instruction Decode and Register Fetch
Execution, Memory Address Computation, or Branch Completion
Memory Access or R-type instruction completion
Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
Five Execution Steps
134
Use PC to get instruction and put it in the Instruction Register.
Increment the PC by 4 and pu t the result back in the PC.
Can be described succinctly using RTL " Register-TransferLanguage"
IR = Memory[PC];
PC = PC + 4;
Can we figure out the values of the control signals?
What is the advantage of updating the PC now?
Step 1: Instruction Fetch
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
68/102
135
Read regis ters rs and rt in case we need them Compute the branch address in case the instruction is a branch
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0])
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
69/102
137
Loads and stores access memory
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle on the edge
Step 4 (R-type or memory-access)
138
Reg[IR[20-16]]= MDR;
What about all the other instructions?
Write-back step
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
70/102
139
Illustrations
Implementation of instructions in multi-cycle Add
Beq
J
140
Summary:
Step name
Action for R-type
instructions
Action for memory-reference
instructions
Action for
branches
Action for
umps
Instruction fetch IR =Memory[PC]
PC =PC +4
Instruction A =Reg [IR[25-21]]
decode/register fetch B =Reg [IR[20-16]]ALUOut =PC +(sign-extend (IR[15-0])
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
71/102
141
How many cycles will it take to execute this code?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ...
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of$t2 and $t3 takes place?
Simple Questions
142
Value of contro l signals is dependent upon:
what instruction is being executed
which step is being performed
Use the information weve acculumated to specify a finite statemachine
specify the finite state machine graphically, or
use microprogramming
Implementation can be derived from specification
Implementing the Control
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
72/102
143 How many state bits will we need?
FSM
P C W r ite P C S o u r c e = 1 0
A L U S r c A = 1A L U S r c B = 0 0
A L U O p = 0 1 P C W r it e C o n d
P C S o u r c e = 0 1
A L U S r c A =1 A L U S r c B = 0 0
A L U O p = 10
R e g D s t = 1 R e g W r it e
M e m t o R e g = 0
M e m W r ite I o r D = 1
M e m R e a d I o r D = 1
A L U S r c A = 1 A L U S r c B = 1 0 A L U O p = 0 0
R e g D s t = 0 R e g W r it e
M e m t o R e g = 1
A L U S r cA = 0 A L U S r c B = 1 1
A L U O p = 0 0
M e m R e a d A L U S rc A = 0
I o r D = 0 I R W r it e
A L U S r c B = 0 1A L U O p = 0 0
P C W r i te P C S o u r c e = 0 0
Instruct ion fetchIn s t r u c ti o n d eco d e /
reg ister fetch
J u m p co mp l e t i o n
B r a n c h co mp l e t i o nE xecu t i o n
M e m o r y a d d re s s co mp u ta t i o n
M e m o r ya c c e s s
M e m o r ya c c e s s R - typ e co m p l e t io n
W r ite -b ack s tep
(Op='L
W ')or (O
p ='SW
') (Op =
R-typ
e)
(Op
='B
EQ')
(Op='J')
(Op='S
W')
(Op='LW')
4
01
9862
753
S ta r t
144
Implementation:
Finite State Machine for Control
PCWrite
PCWriteCond
IorD
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2NS1
NS0
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
S0
State register
IRWrite
MemRead
MemWrite
Instruction registeropcode field
Outputs
Control logic
Inputs
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
73/102
145
PLA Implementation
O p 5
O p 4
O p 3
O p 2
O p 1
O p 0
S 3
S 2
S 1
S 0
I o r D
I R W r ite
M e m R e a d
M e m W r ite
P C W r ite
P C W r ite C o n d
M e m to R e g
P C S o u rc e 1
A L U O p 1
A L U S rc B 0A L U S rc A
R e g W r ite
R e g D s t
N S 3
N S 2
N S 1
N S 0
A L U S rc B 1
A L U O p 0
P C S o u rc e 0
146
ROM = "Read Only Memory"
values of memory locations are fixed ahead of t ime
A ROM can be used to imp lement a truth table
if the address is m-bits, we can address 2m entries in the ROM.
our outpu ts are the bits of data that the address points to.
ROM Implementation
m n
0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 01 0 0 0 0 0 0
1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
74/102
147
How many inputs are there?6 bits fo r opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 dif ferent addresses)
How many outputs are there?16 datapath-contro l outputs, 4 state bits = 20 outputs
ROM is 210 x 20 = 20K bits (and a rather unusual size)
Rather wasteful, since for lots o f the entries, the outputs arethe same
i.e., opcode is often ignored
ROM Implementation
148
Break up the table into two parts
4 state bits tell you the 16 outputs , 24 x 16 bits of ROM
10 bits tell you the 4 next state bits, 210 x 4 bits of ROM
Total: 4.3K bits of ROM
PLA is much smaller
can share product terms
only need entries that produce an active output
can take into account don't cares
Size is (#inputs #product-terms) + (#outputs #product-terms)For this example = (10x17)+(20x17) = 460 PLA cells
PLA cells usually about the size of a ROM cell (slightl y bigger)
ROM vs PLA
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
75/102
149
Complex instructions : the "next state" is often current state +1
Another Implementation Style
AddrCtl
Outputs
PLA or ROM
State
Address select logic
Op[5
0]
Adder
Instruction registeropcodefield
1
Control unit
Input
PCWrite
PCWriteCond
IorD
MemtoReg
PCSource
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
IRWrite
MemRead
MemWrite
BWrite
150
Details
Dispatch ROM 1 Dispatch ROM 2
Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 l w 0011
000010 j mp 1001 101011 sw 0101
000100 beq 1000
100011 l w 0010
101011 sw 0010
State number Address-control action Value of AddrCtl0 Use incremented state 3
1 Use dispatch ROM 1 1
2 Use dispatch ROM 2 2
3 Use incremented state 3
4 Replace state number by 0 0
5 Replace state number by 0 0
6 Use incremented state 3
7 Replace state number by 0 0
8 Replace state number by 0 0
9 Replace state number by 0 0
State
Op
Adder
1
PLA or ROM
Mux
3 2 1 0
Dispatch ROM 1Dispatch ROM 2
0
AddrCtl
Address select logic
Instruction registeropcode field
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
76/102
151
Microprogramming
What are the microinstructions ?
PCWrite
PCWriteCondIorD
MemtoReg
PCSourceALUOp
ALUSrcB
ALUSrcARegWrite
AddrCtl
Outputs
Microcode memory
IRWrite
MemRead
MemWrite
RegDst
Control unit
Input
Microprogram counter
Address select logic
Op[50]
Adder
1
Datapath
Instruction register
opcode field
BWrite
152
A specif ication methodology
appropriate if hundreds of opcodes, modes, cycles, etc.
signals specified symbolically using microinstructions
Will two implementations of the same architecture have the same microcode?
What would a microassembler do?
Microprogramming
Label
ALU
control SRC1 SRC2
Register
control Memory
PCWrite
control Sequencing
Fetch Add PC 4 Read PC ALU Seq
Add PC Extshft Read Dispatch 1
Mem1 Add A Extend Dispatch 2
LW2 Read ALU Seq
Write MDR Fetch
SW2 Write ALU Fetch
Rformat1 Func code A B Seq
Write ALU FetchBEQ1 Subt A B ALUOut-cond Fetch
J UMP1 J ump address Fetch
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
77/102
153
Microinstruction formatField name Value Signals active Comment
Add ALUOp =00 Cause the ALU to add.
ALU control Subt ALUOp =01 Cause the ALU to subtract; this implements the compare for
branches.
Func code ALUOp =10 Use the instruction's function code to determine ALU control.SRC1 PC ALUSrcA =0 Use the PC as the first ALU input.
A ALUSrcA =1 Register A is the first ALU input.
B ALUSrcB =00 Register B is the second ALU input.
SRC2 4 ALUSrcB =01 Use 4 as the second ALU input.
Extend ALUSrcB =10 Use output of the sign extension unit as the second ALU input.
Extshft ALUSrcB =11 Use the output of the shift-by-two unit as the second ALU input.
Read Read two registers using the rs and rt fields of the IR as the register
numbers and putting the data into registers A and B.
Write ALU RegWrite, Write a register using the rd field of the IR as the register number and
Register RegDst =1, the contents of the ALUOut as the data.
control MemtoReg =0
Write MDR RegWrite, Write a register using the rt field of the IR as the register number and
RegDst =0, the contents of the MDR as the data.
MemtoReg =1
Read PC MemRead, Read memory using the PC as address; write result into IR (and
lorD =0 the MDR).
Memory Read ALU MemRead, Read memory us ing the ALUOut as address; write result into MDR.
lorD =1
Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as the
lorD =1 data.
ALU PCS ource =00 Write the output of the ALU into the PC.
PCWrite
PC write control ALUOut-cond PCSource =01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.
jump address PCSource =10, Write the PC with the jump address from the instruction.
PCWrite
Seq AddrCtl =11 Choose the next microinstruction sequentially.
Sequencing Fetch AddrCtl =00 Go to the first microinstruction to begin a new instruction.
Dispatch 1 AddrCtl =01 Dispatch us ing the ROM 1.
Dispatch 2 AddrCtl =10 Dispatch using the ROM 2.
154
No encoding:
1 bit fo r each datapath operation
faster, requires more memory (logic)
used for Vax 780 an astonishing 400K of memory!
Lots of encoding:
send the microinstructions through logic to get control signals
uses less memory, slower
Historical context of CISC:
Too much logic to put on a single chip with everything else
Use a ROM (or even RAM) to hold the microcode Its easy to add new instructions
Maximally vs. Minimally Encoded
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
78/102
155
Microcode: Trade-offs
Distinction between specification and implementation is sometimes blurred
Specification Advantages:
Easy to design and wri te
Design architecture and m icrocode in parallel
Implementation (off-chip ROM) Advantages
Easy to change since values are in memory
Can emulate other architectures
Can make use of internal registers
Implementation Disadvantages, SLOWER now that:
Control is implemented on same chip as processor
ROM is no longer faster than RAM
No need to go back and make changes
156
The Big Picture
Initialrepresentation
Finite statediagram
Microprogram
Sequencingcontrol
Explicit nextstate function
Microprogram counter+ dispatch R OMS
Logicrepresentation
Logicequations
Truthtables
Implementationtechnique
P rogrammablelogic array
Read onlymemory
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
79/102
157
158
SRAM:
value is stored on a pair of inverting gates
very fast but takes up more space than DRAM (4 to 6transistors)
DRAM:
value is stored as a charge on capacitor (must berefreshed)
very small bu t slower than SRAM (factor o f 5 to 10)
Memories: Review
B
A A
B
Word line
Pass transistor
Capacitor
Bit line
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
80/102
159
Users want large and fast memories!
SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.
DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per
Mbyte.
Try and give it to them anyway
build a memory hierarchy
Exploiting Memory Hierarchy
1997
CPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance fromthe CPU in
access time
Size of the memory at each level
160
Locality
A p rincip le that makes having a memory h ierarchy a good idea
If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
Our initial focus: two levels (upper, lower)
block: minimum unit of data
hit: data requested is in the upper level
miss: data requested is not in the upper level
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
81/102
161
Caches, Memory and Processor
CPUcache
controller cache
main
memory
data
data
address
data
address
162
Two issues:
How do we know if a data item is in the cache?
If it is, how do we find it?
Our first example:
block s ize is one word o f data
"direct mapped"
For each item of data at the lower level,there is exactly one location in the cache where it might be.
e.g., lots of items at the lower level share locations in the upper level
Cache
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
82/102
163
Cache operation
Many main memory locations are mapped onto one cacheentry.
May have caches for:
instructions;
data;
data + instruc tions (unified).
Memory access time is no longer deterministic.
164
Terms
Cache hit: required location is in cache.
Cache miss: required location is not in cache.
Working set: set of locations used by program in a timeinterval.
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
83/102
165
Types of misses
Compulsory (cold): location has never been accessed. Capacity: working set is too large.
Conflict: multiple locations in working set map to same cacheentry.
166
Memory system performance
h = cache hit rate.
tcache = cache access time, tmain = main memory access time.
Average memory access t ime:
tav = htcache + (1-h)tmain
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
84/102
167
Multiple levels of cache
CPU L1 cache L2 cache
168
Multi-level cache access t ime
h1 = cache hit rate.
h2 = rate for miss on L1, hit on L2.
Average memory access t ime:
tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
85/102
169
Cache performance benefits
Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access.
170
Replacement policies
Replacement poli cy: strategy for choosing which cache entryto throw out to make room for a new memory location.
Two popular st rategies:
Random.
Least-recently used (LRU).
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
86/102
171
Write operations
Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is
removed from cache.
172
Cache organizations
Fully-associative: any memory location can be storedanywhere in the cache (almost never imp lemented).
Direct-mapped: each memory location maps onto exactly onecache entry.
N-way set-associative: each memory location can go into oneof n sets.
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
87/102
173
Mapping: address is modulo the number of blocks in thecache
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
000
Cache
Memory
001
010
011
100
101
110
111
174
Direct Mapped Cache
20 10
Byteoffset
Valid Tag DataIndex
0
1
2
1021
10221023
Ta g
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
88/102
175
Direct-mapped cache
valid
=
tag index offset
hit value
tag data
1 0xabcd byte byte byte ...
byte
cache block
176
Taking advantage of spatial locality
Direct Mapped Cache
16 12 Byteoffset
V Tag Data
Hit Data
16 32
4Kentries
16 bits 128 bits
Mux
32 32 32
2
32
Block offsetIndex
Tag
31 16 15 4 32 10
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
89/102
177
Read hits this is w hat we want!
Read misses
stall the CPU, fetch block from memory, deliver to cache, restart
Write hits:
can replace data in cache and memory (write-through)
write the data only i nto the cache (write-back the cache later)
Write misses:
read the entire block i nto the cache, then write the word
Hits vs. Misses
178
Make reading mul tiple words easier by using banks of memory
It can get a lot more complicated...
Hardware Issues
CPU
Cache
Bus
Memory
a. One- word-widememory organization
CPU
Bus
b. Wide memory organization
Memory
Multiplexor
Cache
CPU
Cache
Bus
Memorybank 1
Memorybank 2
Memorybank 3
Memorybank 0
c. Interleaved memory organization
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
90/102
179
Increasing the block size tends to decrease miss rate:
Use split caches because there is more spatial locality in code:
Performance
1 KB8 KB16 KB64 KB256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Missrate
64164
Block size (bytes)
ProgramBlock size in
wordsInstructionmiss rate
Data missrate
Effective combinedmiss rate
gcc 1 6.1% 2.1% 5.4%
4 2.0% 1.7% 1.9%spice 1 1.2% 1.3% 1.2%
4 0.3% 0.6% 0.4%
180
Direct-mapped cache locations
Many locations map onto the same cache block.
Conflict misses are easy to generate:
Array a[] uses locations 0, 1, 2,
Array b[ ] uses locat ions 1024, 1025, 1026,
Operation a[i] + b[i] generates conflic t misses.
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
91/102
181
Performance
Simplified model:
execution time = (execution cycles + stall cycles) cycle timestall cycles = # of instruc tions miss ratio miss penalty
Two ways of improving performance:
decreasing the miss ratio
decreasing the miss penalty
What happens if we increase block size?
182
Compared to direct mapped, give a series of references that:
results in a lower miss ratio using a 2-way set associative cache
results in a higher miss ratio using a 2-way set associative cache
assuming we use the least recently used replacement strategy
Decreasing miss ratio with associativity
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data
Four-way set associative
Set
0
1
Tag Data
(direct mapped)
Block
0
7
1
2
3
4
5
6
Tag Data
Two-way set associative
Set
0
1
2
3
Tag Data
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
92/102
183
Set-associative cache
A set of direct -mapped caches:
Set 1 Set 2 Set n...
hit data
184
An implementation
22 8
V TagIndex
0
1
2
253
254
255
Data V Tag Data V T ag Data V T ag Data
3222
4-to-1 multiplexor
H it Data
123891011123031 0
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
93/102
185
Performance
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo -wayOne-way1 K B 2 K B 4 K B 8 K B
Missrate
Associativity 16 KB 32 KB 64 KB 128 KB
186
Example: direct-mapped vs. set-associative
address data
000 0101
001 1111
010 0000
011 0110
100 1000
101 0001
110 1010
111 0100
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
94/102
187
Direct-mapped cache behavior
After 001 access:block tag data
00 - -
01 0 1111
10 - -
11 - -
After 010 access:block tag data
00 - -
01 0 1111
10 0 0000
11 - -
188
Direct-mapped cache behavior, contd.
After 011 access:
block tag data
00 - -
01 0 1111
10 0 0000
11 0 0110
After 100 access:
block tag data
00 1 1000
01 0 1111
10 0 0000
11 0 0110
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
95/102
189
Direct-mapped cache behavior, contd.
After 101 access:block tag data
00 1 1000
01 1 0001
10 0 0000
11 0 0110
After 111 access:block tag data
00 1 1000
01 1 0001
10 0 0000
11 1 0100
190
2-way set-associtive cache behavior
Final state of cache (twice as big as direct-mapped):
set blk 0 tag blk 0 data blk 1 tag blk 1 data
001 1000 - -
010 1111 1 0001
100 0000 - -
110 0110 1 0100
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
96/102
191
2-way set-associative cache behavior
Final state of cache (same size as direct-mapped):set blk 0 tag blk 0 data blk 1 tag blk 1 data
0 01 0000 10 1000
1 10 0111 11 0100
192
Decreasing miss penalty w ith mul tilevel caches
Add a second level cache:
often primary cache is on the same chip as the processor
use SRAMs to add another cache above primary memory (DRAM)
miss penalty goes dow n if data is in 2nd level cache
Example:
CPI of 1.0 on a 500Mhz machine with a 5% miss rate, 200ns DRAM access
Adding 2nd level cache w ith 20ns access t ime decreases miss rate to 2%
Using multi level caches:
try and opt imize the hit time on the 1st level cache
try and opt imize the miss rate on the 2nd level cache
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
97/102
193
Example caches
194
Memory management units
Memory management unit (MMU) translates addresses:
CPUmain
memory
memory
management
unit
logical
addressphysical
address
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
98/102
195
Memory management tasks
Al lows programs to move in physical memory duringexecution.
Al lows virtual memory:
memory images kept in secondary storage;
images returned to main memory on demand duringexecution.
Page fault: request for location not resident in memory.
196
Address translation
Requires some sort of register/table to allow arbitrarymappings of log ical to physical addresses.
Two basic schemes:
segmented;
paged.
Segmentation and paging can be combined (x86).
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
99/102
197
Segments and pages
memory
segment 1
segment 2
page 1
page 2
198
Segment address translation
segment base address logical address
range
check
physical address
+
range
errorsegment lower bound
segment upper bound
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
100/102
199
Page address translation
page offset
page offset
page i base
concatenate
200
Page table organizations
flat tree
page descriptor
page
descriptor
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
101/102
201
Caching address translations
Large translation tables require main memory access. TLB: cache for address translation.
Typically small.
202
ARM memory management
Memory region types:
section: 1 Mbyte block;
large page: 64 kbytes;
small page: 4 kbytes.
An address is marked as sect ion-mapped or page-mapped.
Two-level translation scheme.
8/22/2019 Course-module-1 _Compatibility Mode_.pdf
102/102
203
ARM address translat ion
offset1st index 2nd index
physical address
Translation table
base register
1st level tabledescriptor
2nd level tabledescriptor
concatenate
concatenate