View
2
Download
0
Category
Preview:
Citation preview
SOCSA Slides: Microprocessors
© Institute for Integrated Systems Technische Universität München www.lis.ei.tum.de
Case Study: Microprocessors
System-on-Chip
Solutions & Architectures A. Herkersdorf
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 2
Microprocessors
Motivation
Classification and Characteristics
Look Inside
How to Increase CPU Performance
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 3
Motivation (1)
Processor-based Digital Systems: Computers with fully programmable, general-
purpose processors (PCs, laptops, workstations)
Primary purpose / function is data processing (incl. Web servers, bank servers)
Hardware & software evolve rather independently
However, most processors are deployed in „embedded systems“
Game consoles, PDAs, cell phones, printers, household appliances, …
Cars, industry robots, …
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 4
Motivation (2)
Network Equipment: Internet Router
Routing table entries grow exponentially
Link rates:
2.5 Gb/s: 6.5Mpps
10 Gb/s: 25Mpps
Mega Bytes memories with Giga Bytes / s access bandwidth !
Source: http://telstra.net/ops/bgptable.html 90 95 99 00 01
120 k
100 k
80 k
40 k
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 5
Motivation (3)
Network Equipment: Internet Router MIPS Processing requirements
per packet vary substantially depending on application
10‘s K effective MIPS!
10‘s of GHz class processors
Source: Jenkins, "NPU Co-Processors", 2000
OC-3 OC-12 OC-48 OC-192
b / s 155 M 622 M 2.5 G 10 G
pkts / s 420 K 1.7 M 6.8 M 27 M
s / pkt 2.4µ 600 n 150 n 37 n
NP case study will tell us how to tackle this challenge!
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 6
Microprocessor SoC: PowerPC 405GP
DMA Controller O
n-c
hip
Periphera
l B
us (
OP
B)
33-6
6 M
Hz
OPB Bridge
UART (2)
I2C (2)
GPIO
Arb
CPU
32K I-Cache
32K D-Cache
MMU
Trace JTAG
Processor Local Bus (PLB) up to 133MHz 128-bit
SRAM Ctlr.
128KB SRAM
10/100 Ethernet
MAC
Timers
Interrupt Controller
MAL
128-bit 128-bit
DDR266 SDRAM
Controller
266MHz 32/64-bit with ECC
128 bit
128-bit
PCI-X Bridge
66-133MHz 64-bit PCI-X, 33-66MHz 32/64bit PCI
128-bit master, 128-bit slave 128 bit
128-bit
RAM/ROM/ Peripheral controller
External bus master cntlr.
Up to 66MHz 32-bit address / 32-bit data
128-bit GPIO
13 external interrupts
1 MII or 2 RMII interfaces
GPT
PLB Monitor
Cache
CPU
Local Bus
Fast & Small SRAM
Slower & larger (S)DSRAM
I/O Subsystem (SCSI, PCI, etc)
Disk
Tape
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 7
Real-World Case Studies
Sonet/SDH Transmission LAN/SAN
Switch
Internet Router
Sonet/SDH Transmission
Control Procesors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 8
Classification and Characteristics
Type Application Characteristics Remarks
CISC Personal Computer Complex, variable-length instructions
Intel x86-based
RISC Embedded control Load/store instruct‘s for memory access
MIPS, PowerPC
DSP xDSL Modem HW multiply for digital filters
TI
VLIW Set Top Box Instruct‘s parallelism on compile-time
Parallel video pixel processing
Superscalar Network Protocol Processing
Instruct‘s parallelism on run-time
ASIP Embedded control Application-specific intructions
Tensilica
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 9
Implementation Strategies for SOC (1)
„Real“
Component
„Virtual“
Component
System on Board System on Silicon
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 10
Implementation Strategies for SOC (2)
Soft VC Firm VC Hard VC
VHDL
Architectural extensions
Speed/Area optimized
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 11
Soft VC CPU in FPGA SOC
Example: XILINX MicroBlaze CPU
SDRAM Ctrl.
RS-232
GPIO (buttons)
UserLogic (OPB-Master)
GPIO (LEDs)
Debug Logic
Local SRAM
MicroBlaze: 32 bit RISC 200 MHz 166 DMIPS Extensions: I-Cache D-Cache HW Multiplier
For comparison:
Hard VC PowerPC 405: 32 bit RISC 400 MHz 600 DMIPS
MicroBlaze Core
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 12
Today’s primary focus
What is “Machine Structure”?
I/O system Processor
Compiler Operating
System
Applications
Digital Design
Circuit Design
Instruction Set Architecture
Coordination of many levels of abstraction
Datapath & Control
transistors
Memory Hardware
Software Assembler
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 13
Levels of Representation
High Level Language Program (e.g., C)
Assembly Language Program (e.g.,MIPS)
Machine Language Program (MIPS)
Control Signal Specification
Compiler
Assembler
Machine Interpretation
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
lw $to, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2)
0000 1001 1100 0110 1010 1111 0101 1000
1010 1111 0101 1000 0000 1001 1100 0110
1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
°
°
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 14
Instruction Set Architecture (ISA)
Defines the interface between software & hardware
Visible hardware state (registers & memory)
A set of instructions that operate on that state
Given an ISA
The hardware implements it
The software uses it
Old SW can use new HW and vice versa
Keep in mind Difference: ISA vs. HW implementation
X86: 80x86 Pentium
Hardware
Software & OS
Instruction Set
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 15
ISA: What Programers See
Instruction Set Registers Memory Address Space
FFFxxx000
000xxxFFF
Intel‘s mostly used instructions [Hennessy]: • Load • Conditional branch • Compare • Store • Add • And • Sub • Move reg-reg
From total instruction set of ~140
i3886 register set [Intel]
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 16
Basic System Architecture
L1 cache
Memory access: Registers/ L1 cache: 1 cycle L2 cache: 10 cycles ext mem: 50 cycles
Spatial and temporal locality of data and code are the reasons why memory hierarchies perform!
RAM
L2 cache
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 17
Look Inside
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 18
Microprocessor Architecture
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 19
Microprocessor Architecture
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache Instruction fetch (IF) Instruction decode (ID)
Operand fetch (OF) Execution (EX)
Write back (WB)
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 20
Microprocessor Architecture
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
Memory load (ld) Memory store (st)
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 21
Pipelining
IF ID OF EX WB
IF ID M WB
IF ID OF M
add r3,r2,r1
Sequential machine
ld r1,0(r0)
st r3, 4(r0)
CPI = 4 - 5
… multiple instructions execute faster: CPI 1
IF ID OF EX WB
Pipelined processor
M
IF ID OF EX WB M
IF ID OF EX WB M
add r3,r2,r1
ld r1,0(r0)
st r3, 4(r0)
Individual instruction may take longer, …
EX
© Institute for
Integrated Systems
A. Herkersdorf
CPU Pipeline
SoC - Microprocessors - 22
IF
Pipeline Control
ID EXE MEM WB
Buffer
clk
ΣTlogic ΣTlogic ΣTlogic
stp
max
logicc2qclk TTTT
Single-scalar = 1 ALU, CPImin = 1.0
clk
maxT
f1
instr._rate[MIPS] =
= f[MHz]/CPI
D Q D
clk
Tstp Tc2q
Q
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 23
Pipelining
Prerequisite for effective pipelining Regularity in sequence of individual instruction phases
Few, regular instruction set
Simple, few addressing modes
Deep pipelining Ease processor speed scaling
Increase vulnerability for pipeline problems
Data hazards
Branch conflicts
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 24
Data Hazard
IF ID OF EX WB M
IF ID OF EX WB M
IF ID OF EX WB M
add r3,r2,r1
sub r7,r3,r1
and r6,r3,r2
Dependencies back in time cause data hazards
IF ID OF EX WB M
IF ID OF EX WB M
ID OF EX M
add r3,r2,r1
sub r7,r3,r1
and r6,r3,r2
Eliminate reverse time dependency by stalling
stall IF WB
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 25
Branching
IF ID OF EX WB M
IF ID OF EX WB M
IF ID OF EX WB M
bcctr r3
shr r7, r1
and r6,r3,r2
Deviation from sequential program execution
Stall, or exploit advanced concepts like “branch prediction”
If r3 points back in address space, it‘s more likely that branch is taken
bcctr r3 r3 addr1 addr2
addr1
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 26
Performance
What is performance? Example Porsche vs. Bus from Munich to Stuttgart
Vehicle Top speed
[km/h]
Distance
[km]
Travel time
[h]
Porsche 260 200 0.77
Bus 120 200 1.6
Capacity Throughput [person] [pkm/h]
2 520
46 5520
What matters in CPU performance: Fastest possible execution of single instruction?
Shortest program execution time (many instructions)?
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 27
Processor Performance
Ultimately interested in: CPU execution time: Time CPU needs to complete
certain program, task or function
CPU time = x Clock cycles
Program
Seconds
Clock cycle
Instructions
Program
Clock cycles
Instruction
Seconds
Clock cycle = x x
Specific for your application
Estimate/count after compilation
1 / fcpu
Processor data sheet
CPI: Processor architecture and memory hierarchy dependent
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 28
Processor Performance
CPI = CPICPU + CPIMEM
CPIMEM = CPIIaccess + CPIDaccess
= IFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty) + DaccFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty)
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 29
Processor Performance - Example
Pipelined RISC CPU: CPICPU =1.2
Two-level cache hierarchy: L1miss_rate = 5%; L1miss_penalty = 10 cycles L2miss_rate = 3%; L2miss_penalty = 50 cycles DaccFreq = 20% CPIMEM = 0.69
0.15% instr./data accesses to system memory degrade overall performance (CPU execution time) by 57%
CPIno_miss 1.2
CPImiss 1.89 = = 1.57
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 30
Microprocessor Performance
[Xilinx]
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 31
How to Increase CPU Performance ?
• Pipelining
• Application specific ISA extensions
• Multiple ALUs and Control units
• Superscalar
• VLIW (Very Long Instruction Word)
• Multithreading
• Memory hierarchy design
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 32
DSP Architecture
HW multiply unit
internal address bus
external data bus
internal data bus
Multiply
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 33
SIMD / MIMD
internal address bus
external data bus
internal data bus
Datapath
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
Single Instruction Multiple Data: • single control / multiple datapaths
Multiple Instruction Multiple Data: • multiple controls
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 34
VLIW – Very Long Instruction Word
....
SequentialProgram
Instr i+2
Instr i+1
Instr iInstr i-1
Instr i-2
....
DP 1 DP 2 DP 3 DP 4 DP n-1 DP n
Registers
Optimizing Compiler
InstrDP2InstrDP1 InstrDP3 InstrDP4 InstrDPn-1 InstrDPn.. ... ... ... ... ..
.... Datapath .....
Determined during Compile-time
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 35
Superscalar
OFi+2
EXi+2
WBi+2
DP 1
ID2i
OFi
EXi
WBi
DP 2
ID2i+3
OFi+3
EXi+3
WBi+3
DP 3
ID2i+1
OFi+1
EXi+1
WBi+1
DP 4
Instr pre-decode (ID1i ... ID1i+3)
Instr fetch (IFi ... IFi+3)
Instr distribute
ID2i+2
Decided at Run-time
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 36
Multithreading in Hardware
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
Multiple register banks
SOCSA Slides: Microprocessors
© Institute for
Integrated Systems
A. Herkersdorf SoC - Microprocessors - 37
Multithreading in Software
internal address bus
external data bus
internal data bus
ALU
accumulator
control
register block
status
program counter
address i/o
external address bus
data i/o
data cache
instr cache
reg
iste
r blo
ck
sta
tus
pro
gra
m c
ou
nte
r
reg
iste
r blo
ck
sta
tus
pro
gra
m c
ou
nte
r
reg
iste
r blo
ck
sta
tus
pro
gra
m c
ou
nte
r
Load/save register status
Recommended