Upload
derrick-john-mccoy
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Motivation
• Mobile embedded systems are present in:– Cell phones– PDA’s– MP3 players– GPS units
Mobile Computing Design Considerations
• Low power
• Real-time data processing
• Small size
• Low cost
• Quick time to market
Metric Introduction
• Processor specialization
• Instruction set
• Interconnect
• Memory specialization
• Functional & Data path units
• Power Specialization
Metric: Processor Specialization
• Central controlling point of embedded system
• Examples:– VLIW to perform multiple instructions in
parallel.– RISC architecture
Metric: Instruction Set Specialization
• Introduction of new instructions to extract optimal performance from the processor
• Examples:– Multiply-accumulate– Vector operations
Metric: Interconnect
• Provides means for different modules to communicate
• Optimizations can lead to reduced complexity, cost, and power consumption
Metric: Memory Specialization
• Specialization is achieved through optimization of number and size of memory banks, number and size of access ports
• Optimizations can improve performance, power consumption, and chip area
Metric: Functional & Data Path Units
• Functional units are often specialized hardware units implementing a frequently used software algorithm
• Examples:– DSP co-processors, interrupt priority co-
processors, memory access modules, and timer modules
Metric: Power Specialization
• Major concern in mobile systems
• Kept under control by:– Using low voltage– Slow clock speed– Custom circuit solutions
Architectures to be discussed
• M*CORE
• D30V/MPEG
• SuperENC
• 1.3-GOPS Parallel DSP
• IA-32 w/ Enhanced Data Streaming
M*CORE
• Low power embedded applications
• Wireless mobile devices
• Cellular phones
M*CORE Processor Specialization
• Simple RISC architecture
• 4 stage pipeline
• 16-bit instruction word length
• Compiler designed in parallel with architecture
• Barrel shifter built into ALU
M*CORE Instruction Set Specialization
• Multimedia instructions– Multiple data transfers from memory to register
and register to memory.– Fast register saves
• FF1 – Find First 1 – Finding highest priority interrupt in hardware
M*CORE Interconnect Specialization
• 16 – bit data bus to match 16 bit word length– Reduces memory bandwidth, complexity, chip
area layout, and power consumption
• MDI – MCU–to-DSP Interface– Dual access memory messaging unit
• General I/O bus for a peripherals
M*CORE Memory Specialization
• Alternate register bank– Fast register saves for context switches
M*CORE Functional & Data Path Units
• 32 channel programmable interrupt controller
• Protocol timer
• DSP core
M*CORE Power Specialization
• 1.8 Volts • Uses 0.5 Watts• Power aware pipeline• Programmable power states
– Stop– Wait– Dose– Normal
M*CORE Summary
• Low power and programmable power states make it ideal for mobile devices
• Interface to built in DSP core makes it ideal for cell phone applications
650 MHZ IA-32
• Microprocessor designed to accelerate data-streaming applications
• Three-dimensional graphics
• Video encode/decode
650 MHZ IA-32 Processor Specialization
• IA-32 architecture
• 70 new instructions
• SIMD floating point data type
• Improvements in regard to circuit implementation
650 MHZ IA-32 Instruction Set Specialization
• 70 new instructions– SIMD FP operations– Control for new 8-entry register file– Multimedia extension
• 12 new integer instructions
650 MHZ IA-32 Interconnect Specialization
• Front Side Bus of 66, 100, 133 MHz
• Back Side Bus – Half the clock frequency for mobile and
desktop applications– Full clock frequency for server/workstation
applications
650 MHZ IA-32 Memory Specialization
• 3 new non-temporal store instructions with write combining buffers– Burst write protocol– Write data throughput of 1.066 Gbytes/sec on a
133 MHz bus
• 4 new data pre-fetch instructions– Overlap, reduces cache miss penalties
650 MHZ IA-32 Functional Specialization
• 8 entry register file– Reduces register starvation for SIMD unit
– 128 bits wide• four independent single precision elements packed in parallel
• Dedicated table based lookup unit for reciprocal operations– Completes reciprocal operations in one clock cycle
– Error of 1.5 * 2^-12
650 MHZ IA-32 Low Power Usage
• 1.4 V ~ 2.2 V at 650 MHz close to room temperature
650 MHZ IA-32 Performance
• 1.5X to 2.0X performance boost for 3-D transform and lighting kernels
• Real-time MPEG-2 video/audio encoding at 30 frames per second– Achieved through improvement to SIMD unit,
at a cost of only 2% increase of unit area size
D30V/MPEG
• Multimedia applications – Decoding MPEG-2
D30V/MPEG Processor Specialization
• 2 way VLIW
• Dual issue RISC pipeline
• 2 way assigned SIMD module
• Pipeline has ability to re-route data through execution path
D30V/MPEG Instruction Set Specialization
• Saturate and Add
• DSP instructions built in– Modular addressing– Block repeat– Multiply accumulate
• Half word instructions– Effectively double number of useable registers
D30V/MPEG Interconnect Specialization
• Chip layout specialized for decoding streaming mpeg data
D30V/MPEG Memory Specialization
• 32 Kbyte data RAM
• 64 Kbyte instruction RAM
• 4 Kbyte RAM for Variable Length Encoder/Decoder (VLC/VLD) tables
• Special Registers– MOD_S & MOD_E for modulo addressing– RPT_S, RPT_E, and RPT_C for looping
D30V/MPEG Functional Specialization
• VLC/VLD Variable Length Encoding/Decoding units
D30V/MPEG Low Power Usage
• 2.5 Volts at 243 MHz
• Uses 2.0 Watts
D30V/MPEG Performance
• 12 % speedup from inter-pipe bypasses
• Special VLC/VLD functional blocks speedup MPEG decoding
1.3 GOPS Parallel DSP
• Achieve real-time image processing capability• Employ data parallelism to achieve goal
– High level algorithms, non-parallelizable• Arithmetic encoding
– Medium level algorithms, medium parallelizable• Contour tracking of binary images
– Low level algorithms, high parallelizable• Filters and transforms• Data independent control and data flow• 80 % of MPEG-2, 60% of MPEG-4
1.3 GOPS Parallel DSP Processor Specialization
• Central control unit – RISC based– Controls multiple SIMD units
1.3 GOPS Parallel DSP Instruction Set Specialization
• VLIW instructions– 3 instructions per issue
• 1 load/store 16 bit data
• 2 arithmetic operations on 16/32 bit data
1.3 GOPS Parallel DSP Interconnect Specialization
• DMA/MCU (Direct Memory Access/Memory Control Unit) – Handles cache misses – Performs prefetch operations from matrix
memory– Interfaces with external 64 bit data bus and 32
bit address bus for SRAM and DRAM modules
1.3 GOPS Parallel DSP Memory Specialization
• Memory tailored to image processing needs– Provides parallel high bandwidth access to
shared data with matrix shaped access patterns
• Individual Cache Memory– Services irregular memory requests
1.3 GOPS Parallel DSP Functional Specialization
Multiple SIMD units– Currently 4 units for prototype– 16 units planned for future versions– SIMD approach has been extended with
ASIMD, autonomous instruction selection capability
• Improves handling of conditional branches
1.3 GOPS Parallel DSP Low Power Usage
• 3.3 Volts
• Using 650 milliwatts
1.3 GOPS Summary
• Sustained performance 380 MIPS – Around 90% utilization
SuperENC
• MPEG-2 video encoder
SuperENC Processor Specialization
• Software implemented RISC architecture– 5 stage pipeline– 81 MHz, 32 bit wide data/instruction path
• Software implemented SIMD/SDIF (SDRAM Interface) modules
SuperENC Instruction Set Specialization
• There is no instruction set specialization mentioned in the paper.
SuperENC Interconnect Specialization
• SDIF – All memory access goes through SDIF– Relay data without going to external memory
• Reduces memory bandwidth and power consumption
SuperENC Memory Specialization
• Uses external RAM– Can access two 16 Mbit SDRAMS or one 64
Mbit SDRAM
SuperENC Functional Specialization
• MPEG algorithm is broken up into hardware functional blocks– Example
• DCT, Discrete Cosine Transfer
• IDCT, Inverse Discrete Cosine Transfer
• ME. Motion Estimation
• MC, Motion Compensation
SuperENC Low Power Usage
• 2.5 Volts internal
• 3.3 Volts I/O
• 1.5 Watts
SuperENC Summary
• SuperENC makes use of many hardware functional blocks to implement the MPEG decoding algorithm
Metric Results
• D30V/MPEG highest rated
Processor Functional Unit Power Instruction Set Interconnect Memory TotalD30V/MPEG 5 4 1 5 5 3 23M*CORE 1 1 5 4 4 2 171.3-GOPS Parallel DSP 4 3 4 1 0 5 17SuperENC 3 5 3 0 3 1 15Enhanced IA-32 2 2 2 3 0 4 13