Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CS61C:GreatIdeasinComputerArchitecture
ControlandPipelining
1
Instructors:VladimirStojanovicandNicholasWeaverhttp://inst.eecs.Berkeley.edu/~cs61c/sp16
Datapath ControlSignals• ExtOp: “zero”,“sign”• ALUsrc: 0⇒ regB;
1⇒ immed• ALUctr: “ADD”,“SUB”,“OR”
• MemWr: 1⇒writememory• MemtoReg:0⇒ ALU;1⇒Mem• RegDst: 0⇒ “rt”;1⇒ “rd”• RegWr: 1⇒writeregister
32
ALUctr
clk
busW
RegWr
32
32busA
32
busB
5 5
Rw Ra Rb
RegFile
Rs
Rt
Rt
RdRegDst
Extender 3216Imm16
ALUSrcExtOp
MemtoReg
clk
DataIn32
MemWr01
0
1
ALU 0
1WrEn Adr
DataMemory
5
Imm16
clk
PC
00
4nPC_sel &Equal
PCExt
AdderAdder
Mux
InstAddress
0
1
2
SummaryoftheControlSignals(1/2)inst Register Transfer
add R[rd] ← R[rs] + R[rt]; PC ← PC + 4
ALUsrc=RegB, ALUctr=“ADD”, RegDst=rd, RegWr, nPC_sel=“+4”
sub R[rd] ← R[rs] – R[rt]; PC ← PC + 4
ALUsrc=RegB, ALUctr=“SUB”, RegDst=rd, RegWr, nPC_sel=“+4”
ori R[rt] ← R[rs] + zero_ext(Imm16); PC ← PC + 4
ALUsrc=Im, Extop=“Z”, ALUctr=“OR”, RegDst=rt,RegWr, nPC_sel=“+4”
lw R[rt] ← MEM[ R[rs] + sign_ext(Imm16)]; PC ← PC + 4
ALUsrc=Im, Extop=“sn”, ALUctr=“ADD”, MemtoReg, RegDst=rt, RegWr, nPC_sel = “+4”
sw MEM[ R[rs] + sign_ext(Imm16)] ← R[rs]; PC ← PC + 4
ALUsrc=Im, Extop=“sn”, ALUctr = “ADD”, MemWr, nPC_sel = “+4”
beq if (R[rs] == R[rt]) then PC ← PC + sign_ext(Imm16)] || 00else PC ← PC + 4
nPC_sel = “br”, ALUctr = “SUB”
3
SummaryoftheControlSignals(2/2)
add sub ori lw sw beq jumpRegDstALUSrcMemtoRegRegWriteMemWritenPCselJumpExtOpALUctr<2:0>
1001000xAdd
1001000x
Subtract
01010000Or
01110001Add
x1x01001Add
x0x0010x
Subtract
xxx00?1xx
op targetaddress
op rs rt rd shamt funct061116212631
op rs rt immediate
R-type
I-type
J-type
add,sub
ori,lw,sw,beq
jump
funcop 000000 000000 001101 100011 101011 000100 000010AppendixA
100000See 100010 WeDon’tCare:-)
4
BooleanExpressionsforControllerRegDst = add + subALUSrc = ori + lw + swMemtoReg = lwRegWrite = add + sub + ori + lw MemWrite = swnPCsel = beqJump = jump ExtOp = lw + swALUctr[0] = sub + beq (assume ALUctr is 00 ADD, 01 SUB, 10 OR)ALUctr[1] = or
Where:
rtype = ~op5 • ~op4 • ~op3 • ~op2 • ~op1 • ~op0, ori = ~op5 • ~op4 • op3 • op2 • ~op1 • op0lw = op5 • ~op4 • ~op3 • ~op2 • op1 • op0sw = op5 • ~op4 • op3 • ~op2 • op1 • op0beq = ~op5 • ~op4 • ~op3 • op2 • ~op1 • ~op0jump = ~op5 • ~op4 • ~op3 • ~op2 • op1 • ~op0
add = rtype • func5 • ~func4 • ~func3 • ~func2 • ~func1 • ~func0sub = rtype • func5 • ~func4 • ~func3 • ~func2 • func1 • ~func0
How do we implement this in
gates?
5
ControllerImplementation
addsuborilwswbeqjump
RegDstALUSrcMemtoRegRegWriteMemWritenPCselJumpExtOpALUctr[0]ALUctr[1]
“AND” logic “OR” logic
opcode func
6
P&HFigure4.17
7
Summary:Single-cycleProcessor• Fivestepstodesignaprocessor:
1.Analyzeinstructionsetàdatapathrequirements
2.Selectsetofdatapathcomponents&establishclockmethodology
3.Assembledatapathmeetingtherequirements
4.Analyzeimplementationofeachinstructiontodeterminesettingofcontrolpointsthateffectstheregistertransfer.
5.Assemblethecontrollogic• FormulateLogicEquations• DesignCircuits
Control
Datapath
Memory
ProcessorInput
Output
8
SingleCyclePerformance• Assumetimefor actionsare
– 100psforregisterreadorwrite;200psforother events
• Clockperiodis?Instr Instr fetch Register
readALU op Memory
accessRegister write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
• Clock rate (cycles/second = Hz) = 1/Period (seconds/cycle)
9
SingleCyclePerformance• Assumetimefor actionsare
– 100psforregisterreadorwrite;200psforother events
• Clockperiodis?Instr Instr fetch Register
readALU op Memory
accessRegister write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
• What can we do to improve clock rate?• Will this improve performance as well?
Want increased clock rate to mean faster programs10
LevelsofRepresentation/Interpretation
lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)
HighLevelLanguageProgram(e.g.,C)
AssemblyLanguageProgram(e.g.,MIPS)
MachineLanguageProgram(MIPS)
HardwareArchitectureDescription(e.g.,blockdiagrams)
Compiler
Assembler
MachineInterpretation
temp=v[k];v[k]=v[k+1];v[k+1]=temp;
0000 1001 1100 0110 1010 1111 0101 10001010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111
LogicCircuitDescription(CircuitSchematicDiagrams)
ArchitectureImplementation
Anythingcanberepresentedasanumber,
i.e.,dataorinstructions
11
NoMoreMagic!
12
CS61ACS61BCS61C ✔CS61C ✔CS61C ✔CS61C çCS61C ✔EE40Phys 7B
I/O systemProcessor
CompilerOperatingSystem(Mac OSX)
Application (ex: browser)
Digital DesignCircuit Design
Instruction Set Architecture
Datapath & Control
transistors
MemoryHardware
Software Assembler
Administrivia
• Project2-2due3/8@23:59:59(Tue)• GuerrillaSessions:MIPSCPU
– Wed3/093- 5PM@241Cory– Sat3/121- 3PM@651@611Soda
13
GottaDoLaundry• Ann,Brian,Cathy,Daveeachhaveoneloadofclothestowash,dry,fold,andputaway– Washertakes30minutes
– Dryertakes30minutes
– “Folder”takes30minutes
– “Stasher”takes30minutestoputclothesintodrawers
A B C D
14
SequentialLaundry
• Sequentiallaundrytakes8hoursfor4loads
Task
Order
B
CD
A30Time
30 30 3030 30 3030 30 30 3030 30 30 3030
6 PM 7 8 9 10 11 12 1 2 AM
15
PipelinedLaundry
• Pipelinedlaundrytakes3.5hoursfor4loads!
Task
Order
BC
D
A
12 2 AM6 PM 7 8 9 10 11 1
Time3030 30 3030 30 30
16
• Pipeliningdoesn’thelplatencyofsingletask,ithelpsthroughputofentireworkload
• Multiple tasksoperatingsimultaneouslyusingdifferentresources
• Potentialspeedup=Numberpipestages
• Timeto“fill”pipelineandtimeto“drain”itreducesspeedup:2.3x(8/3.5)v.4x(8/2)inthisexample
6 PM 7 8 9Time
BC
D
A3030 30 3030 30 30
Task
Order
PipeliningLessons(1/2)
17
• SupposenewWashertakes20minutes,newStashertakes20minutes.Howmuchfasterispipeline?
• Pipelineratelimitedbyslowest pipelinestage
• Unbalancedlengthsofpipestagesreducesspeedup
6 PM 7 8 9Time
BC
D
A3030 30 3030 30 30
Task
Order
PipeliningLessons(2/2)
18
1)IFtch:InstructionFetch,IncrementPC2)Dcd:InstructionDecode,ReadRegisters3)Exec:
Mem-ref: CalculateAddressArith-log:PerformOperation
4)Mem:Load:ReadDatafromMemoryStore:WriteDatatoMemory
5)WB:WriteDataBacktoRegister
ExecutionStepsinMIPSDatapath
19
PC
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
s
ALU
Dat
am
emor
y
imm
1. InstructionFetch
2. Decode/Register Read
3. Execute 4. Memory 5. WriteBack
SingleCycleDatapath
20
PC
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
s
ALU
Dat
am
emor
y
imm
1. InstructionFetch
2. Decode/Register Read
3. Execute 4. Memory 5. WriteBack
Pipelineregisters
• Needregistersbetweenstages– Toholdinformationproducedinpreviouscycle
21
MoreDetailedPipeline
22
IFforLoad,Store,…
23
IDforLoad,Store,…
24
EXforLoad
25
MEMforLoad
26
WBforLoad– Oops!
Wrongregisternumber!
27
CorrectedDatapathforLoad
28
PipelinedExecutionRepresentation
• Everyinstructionmusttakesamenumberofsteps,sosomestageswillidle– e.g.MEMstageforanyarithmeticinstruction
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
IF ID EX MEM WBIF ID EX MEM WB
Time
29
GraphicalPipelineDiagrams
• Usedatapath figurebelowtorepresentpipeline:IF ID EX Mem WB
ALUI$ Reg D$ Reg
1.InstructionFetch
2.Decode/RegisterRead
3.Execute 4.Memory 5.WriteBack
PC
inst
ruct
ion
mem
ory
+4
RegisterFilert
rsrd
ALU
Dat
am
emor
y
imm
MU
X
30
Instr
Order
Load
Add
Store
Sub
Or
I$
Time (clock cycles)
I$
ALU
Reg
Reg
I$
D$
ALU
ALU
Reg
D$
Reg
I$
D$
Reg
ALU
Reg Reg
Reg
D$
Reg
D$
ALU
• RegFile: left half is write, right half is read
Reg
I$
GraphicalPipelineRepresentation
31
PipeliningPerformance(1/3)
• UseTc (“timebetweencompletionofinstructions”)tomeasurespeedup
–
– Equalityonlyachievedifstagesarebalanced(i.e.takethesameamountoftime)
• Ifnotbalanced,speedupisreduced• Speedupduetoincreasedthroughput
– Latency foreachinstructiondoesnotdecrease
32
PipeliningPerformance(2/3)
• Assumetimeforstagesis– 100psforregisterreadorwrite– 200psforotherstages
• Whatispipelinedclockrate?– Comparepipelineddatapath withsingle-cycledatapath
Instr Instrfetch
Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
33
PipeliningPerformance(3/3)
Single-cycleTc = 800 psf = 1.25GHz
PipelinedTc = 200 ps
f = 5GHz
34
Clicker/PeerInstructionLogicinsomestagestakes200psandinsome100ps.Clk-Qdelayis30psandsetup-timeis20ps.Whatisthemaximumclockfrequencyatwhichapipelineddesigncanoperate?• A:10GHz• B:5GHz• C:6.7GHz• D:4.35GHz• E:4GHz
35