11
ProcessorProcessor DesignDesign––
Embedded ProcessorsEmbedded Processors
Professor Jari NurmiInstitute of Digital and Computer SystemsTampere University of Technology, Finlandemail [email protected]
Embedded ProcessorsEmbedded Processors
ÿ Embedded processor = ’not a computer processor’ÿ implements control and/or communication functionality of a deviceÿ not user-programmable (programmed by the application developer)ÿ may be a microcontroller
ÿ management of peripherals to access sensors and actuators
ÿ or a full-fledged RISC/CISC/DSP processor
ÿ Different goals compared to high-end workstationsÿ low power consumption (in many applications)ÿ small silicon area of the processorÿ small memory footprintÿ low to moderate performance may be sufficientÿ small interrupt latency (and interrupt overhead)ÿ real-time requirementsÿ price, price, price
22
Embedded ProcessorsEmbedded Processors ((cont’dcont’d ))
ÿ Embedded application characteristics vary a lotÿ Examples on different kinds of applications
ÿ Game console (high end stream processing power with specialgraphics enhancements)
ÿ Mobile phone (lots of DSP and moderately control processing)
ÿ Home appliances (low speed control)
ÿ Printer (stream control and computation, no real-time requirement)
Embedded ProcessorsEmbedded Processors ((cont’dcont’d ))
ÿ Emphasis here on embedded RISCÿ (embedded) DSP will be discussed next
ÿ How to achieve the design goals, especiallyÿ low power consumptionÿ small silicon area of the processorÿ small memory footprintÿ small interrupt latency and overhead
ÿ Examples on embedded RISC solutions for thisÿ ARMÿ MIPSÿ CompactRISC
33
Power ConsumptionPower Consumption
ÿ Basic power consumption formula in digital CMOSP = V2 ×××× f××××c where
ÿ V = voltage swing (usually equals supply voltage)ÿ f = clocking frequencyÿ c = capacitance switched at each clock
ÿ Three things to minimizeÿ V affects the most, but can be affected least by the design(er)
ÿ reduces also f (desired or not)ÿ f can be reduced if less performance is sufficientÿ c can actually be factored into
ceff = Σ ci × ai
where ai is the activity factor of node i
ÿ ci’s minimized by less circuitry (or slower circuitry)ÿ ai minimized by activity control
Power ConsumptionPower Consumption ((cont’dcont’d ))
ÿ ci worst in output nodesÿ minimizing the bandwidth of off-chip accesses by
ÿ caches (or bringing memory on-chip)ÿ small instruction length
ÿ ci in other nodes can be minimized byÿ accepting slower operationÿ having less complexity (less nodes)
ÿ no superscalar issue, out-of-order execution, etc.ÿ simple cache control (low associativeness)ÿ short buses (and otherwise small dimensions)
ÿ ai minimized by activity controlÿ different power-down modesÿ partial power-down of currently unused parts (clock gating)ÿ latched inputs on blocks connected to busesÿ attention on timing to avoid multiple transitions during a cycle
44
ProcessorProcessor ((CoreCore ) Area) Area
ÿ Giving up extreme performance requirements saves alsoareaÿ slower logic is smallerÿ less pipeline registers, stall & forward controlÿ less advanced features (branch prediction, speculative execution)
ÿ Giving up accuracy or range saves areaÿ 8, 16, 32, 64-bit processors for different segments of embeddedÿ single-cycle multipliers etc. as application specific extensionsÿ typically no floating-point hardware
Memory OptimizationMemory Optimization
ÿ For low cost, the amount of memory is crucialÿ on-chip memory and/or cache (chip cost)ÿ off-chip memory (system/board cost)
ÿ Mainly two things affect the program memory sizeÿ program lengthÿ instruction word length
ÿ Program length shortened by powerful instructionsÿ Instruction word length shortened by simple instructions!ÿ Compromise of these goals needs to be foundÿ One solution is to have two instruction modes
ÿ full-length powerful instructionsÿ compressed instructions
55
Interrupt LatencyInterrupt Latency andand OverheadOverhead
ÿ The latency and overhead of interrupts crucial in (reactive) embeddedsystems
ÿ The latency of inter-instruction interrupt is made up ofÿ time of synchronization of the requestÿ time to complete or abort the (longest) instruction in executionÿ time to enter the interrupt service mode (with possible state saving)ÿ (and possibly time to wait for getting enabled)
ÿ Overhead consists ofÿ mode switching (with state saving and restoring)ÿ interrupt processing
ÿ As short instructions (in cycles) as possible for less latencyÿ Possibly long instructions reversible to enable aborting and re-issueÿ Different register sets for interrupt processing mode(s)
ÿ improves both latency and overheadÿ Efficient interrupt processing instructions (e.g. not using compressed
instructions in interrupts)
ARMARM InstructionInstruction SetSet
66
ARMARM ArchitectureArchitecture
ÿ small (32x8) multiplierÿ barrel shifterÿ 16 (31) GP registersÿ 6 status registersÿ many dedicated busesÿ compressed
instructions (Thumb)
ThumbThumb
ÿ Compressed mode of ARM processors
77
Thumb InstructionThumb Instruction SetSet
ARMARM CachesCaches
ÿ Separate instruction and data cachesÿ 4 kbytes eachÿ Organization
ÿ four cache segmentsÿ 64-way set-associative (each segment fully associative)ÿ four words per block (4 seg’s x 64 lines x 4 words x 4 bytes = 4kbytes)ÿ word-aligned cache access
ÿ Regions of data cache can be marked uncacheableÿ Flexible cleaning and flushing utilitiesÿ 8-word write buffer, configurable region-wise as write-through, write-
back, or disabled
31 6 5 4 3 2 1 0
segment wordtag
88
ARMARM SolutionsSolutions for . . .for . . .
ÿ Memory footprintÿ compressed instruction mode (Thumb)ÿ some additional arithmetic efficiency (barrel shifter, small multiplier)ÿ conditional instructions (less delay slots to be filled)
ÿ Processor sizeÿ not full-size multiplierÿ (only) three-stage pipeline (in ARM7, five-stage in ARM9)ÿ small on-chip caches (in ARM7 no cache by default)ÿ only physical addresses, no address mappingÿ no branch prediction etc. fancy things
ARMARM SolutionsSolutions for . . . (for . . . ( cont’dcont’d ))
ÿ Power consumptionÿ Thumb instruction compressionÿ cachesÿ short dedicated buses in the core, small coreÿ low-depth pipelineÿ everything simple but working
ÿ Interrupt latency and overheadÿ FIQ, Fast Interrupt Request for data transfersÿ total of six (partially overlapping) sets of registers for different
modesÿ always handles the interrupts in the (non-compressed) ARM mode
99
ARMARM Register SetsRegister Sets
The MIPSThe MIPS ApproachApproach
ÿ Discrete R3000 and R4000 (R2000) processors from multiplemanufacturers
ÿ Three core familiesÿ MIPS 32 (32-bit RISC)
ÿ one R3000 & R4000 compatible low-power core
ÿ one with fast (= single-cycle) multiply-accumulate added
ÿ one additionally optimized for WindowsCE and other OS’s
ÿ MIPS 64 (64-bit RISC)ÿ one synthesizable high-performance core
ÿ one with 3D graphics extensions added
ÿ MIPS 16ÿ ”code compression providing 40% reduction in memory footprint”
ÿ MIPS compared to ARMÿ seems to target a broader range (also high performance market)ÿ is lagging in some embedded-specific solutions (like code compression)
1010
MIPSMIPS InstructionsInstructions
ÿ Three instruction formatsÿ Nothing very special
MIPSMIPS Code CompressionCode Compression
ÿ Limited opcodesÿ Limited register setÿ Short immediatesÿ Decompression as in
ARM Thumb
1111
MIPSMIPS Register MappingRegister Mapping
ÿ Register set in compressed modeÿ not very straightforward mappingÿ access to other registers by special
register-to-register moves
ÿ Moving between modesÿ JALX instruction calls a subroutine
and toggles the modeÿ in returns the mode of caller is
restored (merged with the address)
MIPS Solutions for . . .MIPS Solutions for . . .
ÿ Memory footprintÿ code compression
ÿ Processor sizeÿ multiply-accumulate as an peripheral option onlyÿ smallish on-chip caches (0-16 kbytes 4-way set-associative
separately for I and D)ÿ the basic design is simple and enables compact implementationÿ however, pipeline varies (8-stage pipeline in R4000!)
ÿ Power consumptionÿ cachesÿ code compression
ÿ Interrupt latency and overheadÿ nothing specific
1212
National CompactRISCNational CompactRISC
ÿ Scalable RISC architecture with 8/16/32/64-bit dataÿ Available as coresÿ Variable instruction word length 16/32-bit (/48-bit in CR32)ÿ Three-stage pipelineÿ Interrupt stack in hardwareÿ Barrel shifterÿ Multi-cycle multiply operationÿ 12-13 truly GP registers (total 16), dedicated registers
CompactRISC RegistersCompactRISC Registers
1313
CompactRISC Solutions for . . .CompactRISC Solutions for . . .
ÿ Memory footprintÿ dynamic instruction size
ÿ Processor sizeÿ no caches by defaultÿ simple basic design enables compact implementationÿ multi-cycle multiplierÿ different word length core implementationsÿ three-stage pipeline
ÿ Power consumptionÿ dynamic instruction sizeÿ however, no caches included in the core design
ÿ Interrupt latency and overheadÿ separate interrupt stackÿ barrel shifter
SummarySummary
ÿ Key things in embedded processors areÿ keeping the memory footprint smallÿ keeping the processor area smallÿ keeping the power consumption low (in most cases)ÿ keeping the interrupt latency and overhead low (in most cases)ÿ keeping the price/performance ratio as low as possible
ÿ The means to achieve these goals varies
1414
End ofEnd of Embedded ProcessorsEmbedded Processors
next we will look at DSP processors