Upload
abedi
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Low Power Multimedia Reconfigurable Platforms. Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab. http://vada.skku.ac.kr. Communication bandwidth [Hansen ’ s law]. µ processor integration density (1.2/year ). - PowerPoint PPT Presentation
Citation preview
Low Power Multimedia Reconfigurable Platforms
Jun-Dong ChoSungKyunKwan Univ.
Dept. of ECE, Vada Lab. http://vada.skku.ac.kr
VLSI Algorithmic Design Automation Lab. at SKKU2
What are the Challenges ? [ST microelectronics, MorphICs, Dataquest, eASIC]
1
2
0 10 12 18
months
factor
Com
mun
icat
ion
band
wid
th [H
anse
n’s
law]
Integratio
n density (1
.4/year) [M
oore’s
law]
µprocessor integration density
(1.2/year)
4y
VLSI Algorithmic Design Automation Lab. at SKKU3
Reconfigurable System
Reconfigurable systems are suitable for the dynamic application and communication environment of wireless multimedia devices
such as SDR.
A hierarchical system model is used in which Quality of Service and energy consumption play a crucial role.
Dynamically partition tasks of an application.
VLSI Algorithmic Design Automation Lab. at SKKU4
Reconfigurable SOC
As technology (supply voltage) scales down, logic (transistor) is virtually free while the interconnect becomes the bottleneck and power consuming.
Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a viable solution.
It can compromise the three orthogonal issues: design time, power consumption, and performance.
VLSI Algorithmic Design Automation Lab. at SKKU5
Context SoC and Customizable Platform Based-Design
Specifications
Processing power
Area
Power consumption
etc.
ReconfigurableHardware
(Coarse Grain)ASIC 1
DSP
Reconfigurable
Hardware (Fine Grain)
We need metrics to compare !
ASIC 2
ControllerCPU
RAMROM
Flash
?
ControllerCPU
RAMROM
Flash
?
VLSI Algorithmic Design Automation Lab. at SKKU6
First choose the right architecture …
MAC
Unit
Addr
Gen
P
Prog Mem
Embedded Processor
(lpArm)
Direct MappedHardware
EmbeddedFPGA
DSP(e.g. TI 320CXX )
Fle
xib
ility
Area or Power
Reconfigurable Processors (Maia)
Factor of 100-1000
100-1000 MOPS/mW
10-100MOPS/mW
.5-5MIPS/mW
Jan Rabaey
VLSI Algorithmic Design Automation Lab. at SKKU7
Design Space of Reconfigurable Architecture
RECONFIGURABLE ARCHITECTURES(R-SOC)
FINE GRAIN(FPGA)
MULTI GRANULARITY(Heterogeneous)
COARSE GRAIN(Systolic)
Processor +Coprocessor
Tile-BasedArchitecture
Coarse Grain Coprocessor
Fine GrainCoprocessor
IslandTopology
Hierarchical Topology
LinearTopology
HierarchicalTopology
MeshTopology
• Chameleon• REMARC• Morphosys
• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC
• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA
• Altera Stratix• Altera Apex• Altera Cyclone
• Systolic Ring• RaPiD• PipeRench
• DART• FPFA
• RAW• CHESS• MATRIX• KressArray• Systolix Pulsedsp
• aSoC• E-FPFA
VLSI Algorithmic Design Automation Lab. at SKKU8
Semiconductor Revolutions“Mainstream Silicon Application
is switching every 10 Years”
Makimoto’s Wave
TTL
custom
standard
1957
1967
1977LSI,MSI
µproc.,memory
1987
1997ASICs,accel’s
1st
desi
gn
cri
sis
2n
d d
esi
gn
cri
sis
hardware people new breed (M&C)
software people new breed needed
2007
reconfigurable
Communication gap:
Terminology clean-up
instruction
streamsdata
streams
structured
VLSI design
VLSI Algorithmic Design Automation Lab. at SKKU9
3 different mind sets
TTL µproc.,memory
1957
1967
1977
1987
1997
2007
ASICs,accel’s
LSI,MSI
FPGAs
coarsegrain
soft CPU
s
hardware people CSpeople new breed needed
Common terminology needed
VLSI Algorithmic Design Automation Lab. at SKKU10
Machine paradigms
von Neumann
data-stream machine
instruction stream machine
M
I/O
instructionsequencer
CPU
instructionstream
I/OMM MM M
(r)DPU
DPU
Software
I/OMM MM M
(r)DPA
memoryembedded memory architecture*
M
DPU or rDPU
data addressgenerator
(data sequencer)
memory
data streamI/O
asM*
Configware
Flowware
VLSI Algorithmic Design Automation Lab. at SKKU11
FPGA Chip DSP Chip
Programming Language
VHDL, Verilog C, Assembly Language
Ease of software
programming
Fairly easy, however, a programmer needs to understand the hardware architecture before programming
Easy
Performance Can be very fast if an appropriate architecture is designed
Speed is limited by the clock speed of a DSP chip
Reconfigurability
SRAM-type FPGAs can be reconfigurable infinite times
Can be configurable by changing program memory content
VLSI Algorithmic Design Automation Lab. at SKKU12
FPGA Chip DSP Chip
Reconfiguration method
downloading configuration data to a chip electronically
reading a program at a memory address
AreaFIR filter, IIR filter, conrrelator, convolver, FFT
A signal processing program
Power consumption
Can be minimized if the circuit is designed to save power
Power consumption does not change
Speed of MAC Can be fast if a parallel algorithm is used.
Limited by the speed of a DSP chip
Parallelism Can be parallelized to archieve high performance
DSP chip programming is usually sequential
VLSI Algorithmic Design Automation Lab. at SKKU13
Architecture Choices forReal-time Embedded System
Greg Delagi, TI
VLSI Algorithmic Design Automation Lab. at SKKU14
Fine-Grained RSOCs Xilinx Virtex II-Pro
Xilinx, Inc., San Jose, CA Up to 4 PowerPC 405 Processor
Cores Up to 160k Reconfigurable Logic
Cells (4-i/p 1-o/p Lookup Table) Up to 216 18-bit x 18-bit
Dedicated Multipliers Up to 216 18-kbit On-Chip
Distributed Memory Blocks Up to 852 I/O Pins www.xilinx.com
VLSI Algorithmic Design Automation Lab. at SKKU15
Xilinx 의 Xtreme
VLSI Algorithmic Design Automation Lab. at SKKU16
Fine-Grained RSOCs Altera Excalibur
Altera, San Jose, CA
32-bit ARM9 Based Microprocessor @200 MHz
Up to 256kbytes SRAM
Up to 1M programmable logic gates
200 MHz Bus
Built-in SDRAM Controller
VLSI Algorithmic Design Automation Lab. at SKKU17
Fine-Grained RSOCs: Triscend A7 CSOC
A7 Family, Triscend, 32-bit ARM 7 with 8kB
Cache3200 logic cells max. (40K
gates)Up to 3800 flip-flopsUp to 300 Prog. I/O pinswww.triscend.com
VLSI Algorithmic Design Automation Lab. at SKKU18
Chameleon Structure Coarse-Grained RSOCs
Chameleon Systems Inc.
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters32-bit ARC control processor
Up to 84 32-bit Datapath Units (DPU)DPU=a 32-bit ALU+a 32-bit barrel
shifter Up to 24 of 16x24-bit multipliersUp to 48 of 128x32-bit local memory
modulesUp to 160 Prog. I/O pinsTargeted at 3rd gen. wireless
basestation, wireless local loop, SW radio, etc.
www.chameleonsystems.com
VLSI Algorithmic Design Automation Lab. at SKKU19
Architectural Rationale and Motivation
Configurable processors have shown orders of magnitude performance improvements
Tensilica has shown ~2x to ~50x performance improvements Specialized functional units Memory configurations
Tensilica matches the architecture with software development tools
FU
RegFile
Memory
ICache
FUFU
RegFile
Memory
ICache
HUFDCT FUConfiguration
Set memory parametersAdd DCT and Huffmanblocks for a JPEG app
Scott WeberScott WeberUniversity of California University of California at Berkeleyat Berkeley
VLSI Algorithmic Design Automation Lab. at SKKU20
Architectural Rationale and Motivation
In order to continue this performance improvement trend Architectural features which exploit more concurrency are required Heterogeneous configurations need to be made possible Software development tools support new configuration options
FUFU
RegFile
Memory
ICache
HUFDCT FU
PE PE
PE PE
PE
PE
PE PE PE
FUFU FU
RegFile
Memory
ICache
DCT HUF
...begins tolook like aVLIW...
PE PE
PE PE
PE
PE
PE PE PE
...concurrent processesare required in orderto continue performanceimprovement trend...
...generic meshmay not suit theapplication’stopology...
PE PE
PE PE
PE
PE PE PE
...configurable VLIWPEs and network topology...
VLSI Algorithmic Design Automation Lab. at SKKU21
IXP1200 Network Processors
Six micro-engines Support 24
contexts Hash instructions StrongArm core Bus and memory
controllers Example of an
architecture we want to be able to configure to
SDRAMCtrl
MicroEngPCI
Interface
SRAMCtrl
SACore
MicroEng
MicroEng
MicroEng
MicroEng
MicroEng
MiniDCache
DCache
ICache
ScratchPad
SRAM
IX BusInterface
HashEngine
IXP1200 Network Processor (Intel)
VLSI Algorithmic Design Automation Lab. at SKKU22
Architecture Goals Provide template for the exploration of a range of architectures Retarget compiler and simulator to the architecture Enable compiler to exploit the architecture Concurrency
Multiple instructions per processing element Multiple threads per and across processing elements Multiple processes per and across processing elements
Support for efficient computation Special-purpose functional units, intelligent memory, processing
elements
Support for efficient communication Configurable network topology Combined shared memory and message passing
VLSI Algorithmic Design Automation Lab. at SKKU23
Architecture Template Prototyping template for array of processing elements
Configure processing element for efficient computation Configure memory elements for efficient retiming Configure the network topology for efficient communication
FUFU FU
RegFile
Memory
ICache
DCT HUFFUFU FU
RegFile
Memory
ICache
FU FU FUFU FU
RegFile
Memory
ICache
DCT HUF
Memory
RegFile
...configurePE...
...configurememoryelements...
...configure PEsand network tomatch the application...
VLSI Algorithmic Design Automation Lab. at SKKU24
Architecture Template
Templates provide prototyping platform for constrained refinement
Estimators feedback system performance and guide configuration System designer refines configuration or the process is automated Refined elements have a compatible interface in the system
.o Simulator
gen uArch Designer
gen
Compiler
Estimation
Programmer’sModel
VLSI Algorithmic Design Automation Lab. at SKKU25
Synthesis of Architectures Not inventing new architectures We are providing a tool for the prototyping and synthesis
of a family of architectures Gives a micro-architecture, ISA, compiler, and simulator Refine within an instance to improve characteristics of
the design Most existing architectures are a point in the architecture
spectrum We want to allow a wide range of architectures to be
realized Each coupled with supporting software development tools
VLSI Algorithmic Design Automation Lab. at SKKU26
Initial Processing Element
VLIW class architecture HPL-PD architecture Exploit ILP
Malleable elements Memory size Cache size Register file size Number of functional units Specialized functional units
FUFU FUFU SFU
Register File
Memory System
Instruction Cache
VLSI Algorithmic Design Automation Lab. at SKKU27
Future Processing Element Specialized memory systems for efficient memory utility
Multi-ported, banked, levels, and intelligent memory
Split register file allows greater register bandwidth to FUs Groups of functional units have dedicated register files
Sticky state for specialized FUs saves register file reads and writes
Multiple contexts for a processing element provide latency tolerance
Hardware for efficient context switching to fill empty instruction slots
Specialized functional units and processing elements SIMD instructions Re-configurable fabrics for bit-level operations Re-use IP blocks for more efficient computation Custom hardware for the highest performance
VLSI Algorithmic Design Automation Lab. at SKKU28
Initial Distributed Architecture
Array of concurrent PEs and supporting network
Malleable network topology Topology matches
application Efficient communication
PE PE
PE PE
PE
PE
PE PE PE
PE PE
PE PE
PE
PE
PE PE PE
VLSI Algorithmic Design Automation Lab. at SKKU29
Initial Distributed Architecture Array of concurrent PEs and
supporting network Malleable network and PEs
Topology matches application Refine to meet system
constraints Memory organized around a
PE Each PE has physical memory Message passing between
PEs
PE PE
PE PE
PE
PE
PE PE PE
PE PE
PE PE
PE
PE PE PE
VLSI Algorithmic Design Automation Lab. at SKKU30
Future Distributed Architecture
Multiple processing elements share a memory space
Shared memory communication Snooping cache coherency protocol Directory based protocol required if PEs in a shared memory
space is large
Introspective processing elements Use processing elements to analyze the computation or
communication Identify dynamic bottlenecks and remove them on the fly Reschedule and bind tasks as the introspective elements
report
VLSI Algorithmic Design Automation Lab. at SKKU31
Communication Models
Shared memory Hardware handles loads and stores from PEs to a common
memory Synchronization is separate from communication Interacting threads on a single or group of processing
elements Message passing
Hardware to send and receive messages and invoke a handler
Synchronization and communication are together Interacting processes between single or group of
processing elements
VLSI Algorithmic Design Automation Lab. at SKKU32
Memory Model
Relax the consistency model Hardware implements lock and unlock mutex instructions Synchronization instructions inserted in program Loads and stores before a lock must complete before
loads and stores after the lock are started Relaxes the ordering of reads and writes in order to increase
memory utility Compiler is constrained on reordering around
synchronization barriers
VLSI Algorithmic Design Automation Lab. at SKKU33
Range of Architectures
Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures
Plan to extend the family with the micro-architectural features presented
FU
Register File
Memory System
Instruction Cache
VLSI Algorithmic Design Automation Lab. at SKKU34
PE PE
PE PE
PE
PE
PE PE PE
FUFU FUFU FU
Register File
Memory System
Instruction Cache
Range of Architectures
Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures
Plan to extend the family with the micro-architectural features presented
VLSI Algorithmic Design Automation Lab. at SKKU35
FUFU FFT
Register File
Memory System
Instruction Cache
DCTDES
Range of Architectures
Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures
Plan to extend the family with the micro-architectural features presented
VLSI Algorithmic Design Automation Lab. at SKKU36
FUFU FFT
Register File
Memory System
Instruction Cache
DCTDES
Range of Architectures
Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures
Plan to extend the family with the micro-architectural features presented
PE
PE PE
PE
PE
PE
PE PE PE
VLSI Algorithmic Design Automation Lab. at SKKU37
Range of Architectures
Scalar Configuration EPIC Configuration EPIC with special FUs Mesh of HPL-PD PEs Customized PEs, network Supports a family of architectures
Plan to extend the family with the micro-architectural features presented
PE PE
PE PE
PE
PE PE PE
VLSI Algorithmic Design Automation Lab. at SKKU38
Range of Architectures (Future)
Template support for such an architecture
Prototype architecture
Software development tools generated
Generate compiler
Generate simulator
SDRAMCtrl
MicroEngPCI
Interface
SRAMCtrl
SACore
MicroEng
MicroEng
MicroEng
MicroEng
MicroEng
MiniDCache
DCache
ICache
ScratchPad
SRAM
IX BusInterface
HashEngine
IXP1200 Network Processor (Intel)
VLSI Algorithmic Design Automation Lab. at SKKU39
The Research Playground
Component AssemblyComponent Assemblyand Synthesisand Synthesis
MicroarchitectureMicroarchitecture
ArchitectureArchitecture
Verification and Verification and Manufacture TestManufacture Test
What is theWhat is theProgrammer’sProgrammer’s
Model?Model?
AlgorithmAlgorithm
SoftwareSoftwareImplementationImplementation
CompilationCompilationand SW and SW
EnvironmentEnvironment
ApplicationApplication
Mescal CompilerManish VachharajaniPrinceton University
VLSI Algorithmic Design Automation Lab. at SKKU41
Outline
Compiler goals Compiler research issues Compiler infrastructure requirements
Trimaran 2.0 compiler infrastructure Ongoing work Summary
VLSI Algorithmic Design Automation Lab. at SKKU42
So What’s Different?
General purpose compiler hand tuned to: SPEC benchmarks A particular general purpose machine
Need compiler tuned to: Specific application A particular application specific machine
And… Meet code density, real-time, and power constraints Do this automatically for a range of
applications/architectures
VLSI Algorithmic Design Automation Lab. at SKKU43
So What’s Different? Traditional application hw/sw design requires
Hand selection of traditional general purpose OS components Hand written customization of
device drivers memory management…
Instead… Application specific synthesis of traditional OS components
scheduling synchronization…
Automatic synthesis of hardware specific code from specifications
device drivers memory management…
VLSI Algorithmic Design Automation Lab. at SKKU44
Compiler Goals
Develop a retargetable compiler infrastructure that enables a set of interesting applications to be efficiently mapped onto a family of fully programmable architectures and microarchitectures.
10 Year Vision: Will have fully automatically-retargetable compilation, OS
synthesis, and simulation for a class of architectures consisting of multiple heterogeneous processing elements with specialized functional units / memories
Compiled code size and performance will be within 10% of hand-coding
VLSI Algorithmic Design Automation Lab. at SKKU45
Compiler Research Issues
Synthesis of RTOS elements in the compiler On the application side: Generation of an efficient application-
specific static/run-time scheduler and synchronization On the hardware side: Generation of device drivers, memory
management primitives, etc. using hardware specifications Automatic retargetability for family of target architectures
while preserving aggressive optimization Automatic application partitioning
Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model
Effective visualization for family of target architectures
VLSI Algorithmic Design Automation Lab. at SKKU46
Compiler Infrastructure Requirements
High level of usability good documentation, well coded
Large suite of machine-independent code optimizations Significant level of retargetability Strong support for instruction-level parallelism Support for memory as a first-class citizen Simulation tools Preferably
visualization tools a good support team
VLSI Algorithmic Design Automation Lab. at SKKU47
Trimaran 2.0 Compiler Overview
IMPACT/ELCOR features strong VLIW data structure and algorithm support
Data structures basic, hyper, super blocks loop analysis procedure analysis miscellaneous, e.g. lists, sets
Algorithms if-conversion software pipelining scheduling/register allocation
C
IMPACTFront-End
MDES
Simulator & Visualization
ELCORBack-End
U. of Illinois IMPACT Group
HP Labs CAR Group
NYU ReaCT-ILP Group
www.trimaran.org
VLSI Algorithmic Design Automation Lab. at SKKU48
Trimaran 2.0 Overview:Simulator and Visualization Tools
Cycle-level simulator easily extensible to support new specialized operations
Simply augment table specifying operation semantics
Visualization tools visualize assortment of useful static / dynamic information
Instruction schedule Data-dependency graphs Total cycles per function / region Percentage of total function operations that are branches,
loads, stores, integer ALU, floating-point ALU, etc.
VLSI Algorithmic Design Automation Lab. at SKKU49
Trimaran 2.0 Overview:Machine Description (MDES)
Target specified in high-level machine-description language
Translated into low-level language
ELCOR supports Playdoh Parameterized non-clustered
VLIW architecture Support for
speculative/predicated execution, software pipelining
User may modify following playdoh parameters:
number of registers number of integer, floating-
point, memory, branch FUs operation latencies
C
IMPACTFront-End
TRIMARAN
High-levelPlayDoh
MDES
Simulator & Visualization
ELCORBack-End
Low-levelPlayDoh
MDES
VLSI Algorithmic Design Automation Lab. at SKKU50
Extensions to Trimaran 2.0:Support for Multiple PEs
ELCOR does not provide MDES and data structure support for multiple Playdoh PEs
New MDES format has been devised to support multiple PEs with varying connectivity
Array of MDES data structures maintained, one per PE
Each code region must be associated with an MDES PE prior to code generation
Communication channels between PEs currently not modeled
PE1: machine description
PE2: machine description
PEm: machine description
.
.
.
MESCAL Machine Description
Channel1: from PE1 to PE2
.
.
.
Channel2: from PE1 to PE3
Channeln: from PEi to PEj
VLSI Algorithmic Design Automation Lab. at SKKU51
Extensions to Trimaran 2.0:Support for Specialized FUs and Operations
ELCOR lacks support for specialized FUs and operations
MESCAL supports specialized FUs and operations via function intrinsics which get translated into special operations.
Special operations only require map from intrinsic for implementation.
Normalization Hardware
AssemblyNORM B
x = NORM(y)Intrinsic
VLSI Algorithmic Design Automation Lab. at SKKU52
Mescal Compiler Framework MESCAL source code layer
exists on top of ELCOR All Trimaran source code
needing modification is copied over to the MESCAL layer
MESCAL source code is compatible with future Trimaran releases
C
IMPACTFront-End
TRIMARAN
Simulator & Visualization
ELCORBack-End
MESCAL
ELCORMDES
MESCALMDES
VLSI Algorithmic Design Automation Lab. at SKKU53
Mes
cal C
ompi
ler
What Do You Get? Mescal Compiler will feature:
Automatic retargetability for architectures consisting of multiple heterogeneous PEs and a configurable communication topology
Mapping of coarse-grain parallelism onto multiple PEs via guidance from programmer’s model
Programmer’s model will allow code-generation with size and performance comparable to hand-coding
Synthesis of RTOS elements and synchronization that are tuned to the application
Application Code in
Programmer’s Model
Hardwaredescription
RTOS synthesis
Compiler front end
Compiler back end
System Code
VLSI Algorithmic Design Automation Lab. at SKKU54
Ongoing Work
Automatic device driver synthesis from a system specification
Xiaoling Xu, Minxi Gao: UCB Support for additional classes of processors (e.g. DSPs) within Mescal framework
Subbu Rajagopalan: Princeton
involves adding support for memory as a first-class citizen
Tuning of front/back end code optimizations, based on application and micro-architectural characteristics
Manish Vachharajani: Princeton
Automatic synthesis of RTOS elements in compiler
Shaojie Wang: Princeton Dynamically-reconfigurable computing for systems-on-a-chip
Zhining Huang: Princeton MESCAL compiler overview.
Niraj Shah, Michael Shilman: UCB
VLSI Algorithmic Design Automation Lab. at SKKU55
The Research Playground
Component AssemblyComponent Assemblyand Synthesisand Synthesis
MicroarchitectureMicroarchitecture
ArchitectureArchitecture
Verification and Verification and Manufacture TestManufacture Test
What is theWhat is theProgrammer’sProgrammer’s
Model?Model?
AlgorithmAlgorithm
SoftwareSoftwareImplementationImplementation
CompilationCompilation
ApplicationApplication
MESCAL Programmer’s Model
Niraj ShahUniversity of California at Berkeley
VLSI Algorithmic Design Automation Lab. at SKKU57
Outline
Motivation Goals Our Approach Initial Model Ongoing Research
VLSI Algorithmic Design Automation Lab. at SKKU58
Motivation
Silicon integration is allowing for high micro-architectural complexity on a die (e.g. Intel IXP1200)
multiple processors specialized execution units hardware context swap
SDRAM Controller
enginePCI
Interface
SRAMController
StrongArmCore
I-Cache
engine
engine
engine
engine
engine
MiniD-Cache
D-Cache
IX BusInterface
HashEngine
ScratchPad
SRAM
Circuit architects are designing more complex devices
How do we program these architectures?
VLSI Algorithmic Design Automation Lab. at SKKU59
Example: C Language
C compilers of the early 70’s were not good, but C became the standard for writing efficient code.
C provided an abstraction (programmer’s model) of standard processors that allowed programmers to write efficient code
They found the 20% of the assembler capability to capture 80% of program efficiency:
register keyword pointer arithmetic bit-level operations
VLSI Algorithmic Design Automation Lab. at SKKU60
Goals Capture the 20% of architectural features of new
architectural platforms to get 80% of the performance
Concurrency processor level functional unit level bit level
Memory useful characteristics of specialized memories address generation units
Present the programmer with an abstraction of the architecture while giving them the power to write efficient code
VLSI Algorithmic Design Automation Lab. at SKKU61
Our Approach
Combine bottom-up and top-down views Bottom-up: create an abstraction of the
architecture visibility - sufficient detail of the architecture to allow
the program to improve the efficiency of the program opacity - hide micro-architectural details from
programmer Top-down: expressive enough for the
programmer to relay all the information he/she knows about the program to the compiler
VLSI Algorithmic Design Automation Lab. at SKKU62
Bottom-Up View
Visible Specialized hardware
FU’s PE’s
Communication explicit message passing shared address space
Opaque Micro-architectural features
pipelines cache details
FUFU FUFU SFU
PE PE
PE PE
PE
PE PE PE
VLSI Algorithmic Design Automation Lab. at SKKU63
Top-Down View
Parallelism at different levels Process level - communicate via message passing Task/thread level - communicate via shared memory
OS capabilities Scheduling Binding Synchronization
VLSI Algorithmic Design Automation Lab. at SKKU64
Initial Programmer’s Model
Start with C View specialized FU’s through intrinsics (e.g.
normalization)
Model process level concurrency through a hybrid communication model
Processes - subset of Message Passing Interface (MPI) Threads - shared memory
AssemblyNORM B x = NORM(y)
Intrinsic
VLSI Algorithmic Design Automation Lab. at SKKU65
Message Passing Interface (MPI)
A standard interface for communication on multiprocessor systems
Messages are passed between processes, which the user must specify
“Push” style communication – sender specifies data rate
Types of Communication Blocking: stall until send/receive buffer can be used Non-blocking: allows overlap of computation and
communication
Simulator included
VLSI Algorithmic Design Automation Lab. at SKKU66
Ongoing Research
The programmer’s model is the Holy Grail of the MESCAL project
Right abstraction for memory Incorporate bit level concurrency Compiler for Intel IXP1200 - test initial
programmer’s model
VLSI Algorithmic Design Automation Lab. at SKKU67
The Research Playground
Component AssemblyComponent Assemblyand Synthesisand Synthesis
MicroarchitectureMicroarchitecture
ArchitectureArchitecture
Verification and Verification and
Manufacture TestManufacture Test
What is theWhat is theProgrammer’sProgrammer’s
Model?Model?
AlgorithmAlgorithm
SoftwareSoftwareImplementationImplementation
CompilationCompilation
ApplicationApplication
Scalable Self-Test for Designs with Embedded Programmable Components
Tim ChengUniversity of California, Santa Barbara
VLSI Algorithmic Design Automation Lab. at SKKU69
Test and diagnosis are applications of a highly programmable system!!Test and diagnosis are applications of a highly programmable system!!
Goals Reuse of on-chip programmable components for
test Processor/DSP/FPGA cores for on-chip test
generation, measurement, response analysis and even diagnosis
Self-test a processor/DSP using its instruction set for high structural fault coverage
Use the tested processor/DSP to test buses, interfaces and other components, including analog and mixed-signal components
Extend for self-diagnosis
VLSI Algorithmic Design Automation Lab. at SKKU70
Functional Self-Test for Structural Faults -Motivation
At-speed testing of GHz IC’s increasingly difficult with external testers
Growing gap between IC and tester performance Growing cost of high performance testers Increasing yield loss caused by inherent tester inaccuracy
Self-testing using instructions enables natural application of at-speed test of GHz processors and SoC’s
Potential advantages over structural BIST (such as scan-based BIST) include: area, performance, design time, power consumption during test
VLSI Algorithmic Design Automation Lab. at SKKU71
Functional Self-Test vs. Structural BIST Good understanding of the capability and
limitations of functional self-test could support further new development of hybrid solutions combining strengths of functional and structural self-test
Lesson from memory self-test: from functional, to structural, now back to functional self-test
Logic self-test?
VLSI Algorithmic Design Automation Lab. at SKKU72
Initial Projects on Processor Functional Self-Test
Self-Testing of Embedded Processor Cores and SoC (UCSD) Delivering deterministic tests using instruction set Automatic synthesis of programs for:
on-chip test generation (constraint-aware software LFSR) test pattern delivery test response analysis
Self-Testing of Processor Cores for Delay Faults (UCSB) Automatic synthesis of test programs for path delay faults Applying deterministic delay tests by execution of test
program Tests generated by integrated process combining structural
ATPG and instruction-level ATPG
VLSI Algorithmic Design Automation Lab. at SKKU73
Self-Test for Embedded Processor Components
ExternalTester
Instr. memory Data memory
Processor bus
CPU
Processor bus
On-chip testgenerationprogram
Test patterndeliveryprogram
Test responseanalysisprogram
Self-testsignature
Processor busProcessor bus
Self-testsignature
Processor bus
Test patterns
Processor busProcessor busProcessor bus
Test response
Processor busProcessor busProcessor bus
Responsesignature
Processor bus
VLSI Algorithmic Design Automation Lab. at SKKU74
Functional Self-Testing of Processor Cores for Path Delay Faults
Spatial and temporal constraints between registers and control signalsInstr. Set Architecture,Instr. Set Architecture,
-architecture-architecture && NetlistNetlist
Test Program SynthesisTest Program SynthesisTest Program SynthesisTest Program Synthesis
Automatic Constraint ExtractionAutomatic Constraint ExtractionAutomatic Constraint ExtractionAutomatic Constraint Extraction
Constrained Structural ATPGConstrained Structural ATPGConstrained Structural ATPGConstrained Structural ATPG
Path ClassificationPath ClassificationPath ClassificationPath Classification
Test ProgramTest Program
Some structural testable paths not functionally testable by instructions
Identifying functionally testable paths
Vector generation for functionally testable paths
Mapping test vectors to instruction sequences
VLSI Algorithmic Design Automation Lab. at SKKU75
No. of paths: No. of paths: ~430K paths~430K paths
datapathdatapathNo. of paths: No. of paths: ~18K paths~18K paths
controllercontroller
Path Classification: DLX - A 32-bit RISC Processor
Structurally testableStructurally testable~97%~97%
Structurally testable Structurally testable ~51%~51%
Functionally testableFunctionally testable~40%~40%
Functionally testableFunctionally testable~46%~46%
Automatic identification of paths testable by instructions Structurally testable but functionally untestable paths need
not be tested.
VLSI Algorithmic Design Automation Lab. at SKKU76
Self-Test for Analog and Mixed-Signal Components in Highly Programmable Systems
Reuse on-chip digital programmable components and A/D and/or D/A converters for test signal generation, on-chip measurement and response analysis for analog/mixed signal components
To relieve the need for expensive mixed-signal testers To avoid noisy external measurement To provide maximum flexibility for customized/optimized
self-test solutions for different types of analog components
VLSI Algorithmic Design Automation Lab. at SKKU77
Analog/Mixed-Signal Self-Test Approaches DSP-based analog self-test
Targeting systems with both DAC and ADC
Pulse-Density-Modulation-based analog self-test Targeting systems without an ADC and/or an DAC
VLSI Algorithmic Design Automation Lab. at SKKU78
D/A A/D
AnalogAnalogComponentComponent
UnderUnderTestTest
DSP/Programmable Components
Synchronization
• • more efficientmore efficient• • single setup for single setup for multiple types of testsmultiple types of tests
ProsPros
ConsCons • • limited measurement limited measurement resolution (improving)resolution (improving)
DSP-Based Self-Testing
Test signal:Test signal: • • digitized sinusoiddigitized sinusoid • • digitized multi-tonedigitized multi-tone • • pseudo randompseudo random
Response analysis:Response analysis: • • FFTFFT • • IEEE 1057 sinewave fittingIEEE 1057 sinewave fitting • • cross-correlationcross-correlation • • auto-correlationauto-correlation
VLSI Algorithmic Design Automation Lab. at SKKU79
Pulse-Density-Modulation-Based Self-Test Targeting designs without a DAC and/or an ADC
Use simple yet high-tolerant DA & AD conversion techniques Use DSP techniques for test synthesis and response analysis Excellent flexibility
AnalogCUT
AnalogCUT
Test Synthesis
Software1-bit
modulator
1-bit DAC& low-pass
filter
1-bit DAC& low-pass
filter
..0101...
memoryTest
stimulus
Spec.
pass/fail
Response Analysis
1-bit modulator
1-bit modulator
DSP..0101...
SOCATE SOCATE
VLSI Algorithmic Design Automation Lab. at SKKU80
PDM-Based Analog Self-Test: Current Status
A general self-test architecture for mixed-signal systems
Use modulation principle for stimulus generation and signal acquisition
Characterization and calibration of 1-bit first-order modulator for on-chip signal analysis
For compensating the error caused by the imperfections associated with the modulator
A self-test scheme for testing on-chip ADC and DAC
VLSI Algorithmic Design Automation Lab. at SKKU81
Directions for the Next Three Years
Processor self-test and self-diagnosis Adding new “test instructions” to aid self-test and self-
diagnosis Test program synthesis for response analysis and self-
diagnosis Analog/mixed signal self-test
Hardware validation of proposed PDM-based schemes High-frequency applications Defect-oriented test synthesis and response analysis
Full-chip self-test using self-tested processors Testing buses, interfaces and other digital components Reconfiguration of bus arbiters and communication protocols
for test delivery
VLSI Algorithmic Design Automation Lab. at SKKU82
The Research Playground
Component AssemblyComponent Assemblyand Synthesisand Synthesis
MicroarchitectureMicroarchitecture
ArchitectureArchitecture
VerificationVerification and and Manufacture TestManufacture Test
What is theWhat is theProgrammer’sProgrammer’s
Model?Model?
AlgorithmAlgorithm
SoftwareSoftwareImplementationImplementation
CompilationCompilation
ApplicationApplication
Functional Verification for a Family of Microarchitectures
Serdar TasiranUniversity of California at Berkeley
VLSI Algorithmic Design Automation Lab. at SKKU84
Outline Verification goal State-of-the-art in processor verification Our strategy
Rationale Implementation
Current projects Future extensions
Three year goals
VLSI Algorithmic Design Automation Lab. at SKKU85
Verification GoalDevelop comprehensive, focused functional verification
support for identified microarchitectural familyIdeally, the verification approach… …must be adaptable: must not require new theory and tools for
different configurations different environments for design different verification requirements
(cache coherence, consistency with programmer’s model, …) …must lend itself to incremental changes in design …must degrade gracefully
VLSI Algorithmic Design Automation Lab. at SKKU86
Processor Verification: State-of-the-artHeated research activity on verification of pipelines, superscalar
processors, out-of-order and speculative execution. Datapath abstraction
Reduce width Symbolic representations (e.g. multiway decision graphs)
Symbolic simulation Theorem proving
Verifying functional units (ALUs, FPUs, etc.) Compositional (assume-guarantee) reasoning
Divide verification problem into pieces Can use a variety of methods for each piece
Reduce problem to equivalence checking of formulas Propositional logic with uninterpreted functions and predicates
VLSI Algorithmic Design Automation Lab. at SKKU87
Processor Verification: State-of-the-art Formal verification valuable when applicable, but
each technique addresses only one aspect of the problem verification expertise required from designer methods not incremental or adaptable difficult to use in large design groups capacity much short of current processor complexity
Validation relies heavily on simulation Even more likely to be the case for complex, highly
programmable systems
VLSI Algorithmic Design Automation Lab. at SKKU88
Our Verification Strategy Validation of complex, highly programmable systems will
require semi-formal methods The natural way to verify these systems is to simulate and
debug Practical goal: Make “optimal” use of simulation resources
IDEAL: Comprehensive validation with minimal redundant effort
OUR APPROACH: Use coverage analysis to guide verification Identify good verification coverage metrics Develop corresponding vector generation methodology
VLSI Algorithmic Design Automation Lab. at SKKU89
Validation using Simulation: Current Picture
Simulationdriver
Simulationengine
Monitors
SHORTCOMINGS: Vector generation
Manual: A lot of user effort, ad hoc Random: Little control over what
gets exercised Quantifying comprehensiveness
Low bug detection rate is the main criterion Likely interpretation: Not generating quality vectors any more.
Functional
testing
Weeks
Bugs
per
week
TapeoutPurgatory
Courtesy Prof. Dill
VLSI Algorithmic Design Automation Lab. at SKKU90
Verification Using Intelligent Simulation
Simulationdriver
Simulationengine
Monitors
Symbolicsimulation
Coverageanalysis
Diagnosis ofnon-verified
portions
Vectorgeneration
Conventional
Novel
VLSI Algorithmic Design Automation Lab. at SKKU91
Verification Using Intelligent Simulation – Rationale
Simulationdriver
Simulationengine
Monitors
Symbolicsimulation
Coverageanalysis
Diagnosis ofnon-verified
portions
Vectorgeneration
Conventional
Novel
Need formal means to: Gauge status and progress of verification Automate generation of quality vectors
VLSI Algorithmic Design Automation Lab. at SKKU92
Coverage Analysis – Why? What aspects of design
haven’t been exercised? Guides vector generation
How comprehensive is the verification so far?
A heuristic stopping criterion Coordinate and compare
Separate sets of simulation runs Model checking, symbolic simulation, … Helps allocate verification resources
Simulationdriver
Simulationengine
Monitors
Symbolicsimulation
Coverageanalysis
Diagnosis ofunverifiedportions
Vectorgeneration
Conventional
Novel
VLSI Algorithmic Design Automation Lab. at SKKU93
Observability and Coverage Analysis
Portion of design covered only when it is exercised (controllability) a discrepancy originating there causes
discrepancy in a monitored variable (observability)
We initially focus on tag coverage [Devadas, Keutzer, Ghosh ’96]
Code coverage metrics + observability requirement. All other verification metrics overlook observability
Tag coverage: Bugs modeled as errors in assignments. A buggy assignment may be stimulated, but still missed
Wrong value generated speculatively, but never used.
VLSI Algorithmic Design Automation Lab. at SKKU94
Biased-Random Vector Generation - Rationale Vector generation methods
trade-off between Time to find “good” vectors Time to simulate vectors
Typically > 50% of time spent on biased random simulation. Improved random vectors Improved overall validation quality Less intelligence for selecting next step but many more vectors
Can explore deeper into state space Deterministic methods bad at “deep errors”
Example: 8-bit counter must expire for bug to be exercised
Find Simulate
0% 100%Portion of Computation Time
VLSI Algorithmic Design Automation Lab. at SKKU95
Contrast with Alternatives Elaborate vector generation methods
justified if they yield better verification quality for
given computation time, or if they exercise difficult corner cases BUT: Hard to judge “quality” of test vectors a-priori.
Heavyweight methods have limited application Can’t handle large sequential depth. Too costly to use all the time
We spend most effort on initial determination of weights Can run many simulation/emulation cycles fast
Our target: Get all but the most difficult bugs out.
Find Simulate
0% 100%
VLSI Algorithmic Design Automation Lab. at SKKU96
Our Approach to Biased Random Vector Generation
Primary inputs at each clock cycle selected according to a probability distribution
Distributions are functions of circuit state Distributions ( “weights” ) determined prior to simulation
Faster simulation Algorithm determines weights chosen based on
Set of tags targeted A structural netlist describing the circuit
Goal of weight determination algorithm: Maximize expected number of tags covered in a given #
of simulation cycles
VLSI Algorithmic Design Automation Lab. at SKKU97
Current ProjectsBiased-Random Vector Generation for Tag Coverage
(Chinnery, Jin, Keutzer, Tasiran, Weber, UCB)Select primary input distributions based on
Œ Circuit structure Current state Ž Tags to be covered
Heuristic based on circuit structure and set of tags IDEA: At each gate, bias inputs towards pins with more tags
in their transitive fan-in. Estimate and optimize detectability of tags
Propagate input probability distributions across circuit Estimate steady-state distributions of latches Estimate detectability of tags along “most likely” paths Modify input weights to maximize expected number of detected
tags
VLSI Algorithmic Design Automation Lab. at SKKU98
Current ProjectsVector Generation for Tag Coverage
of Processor Datapaths(Keutzer, Meyerowitz, Tasiran, UCB)
Identify commonly encountered structures in processor datapaths
Determine input distributions that increase tag coverage of these structures
(In collaboration with configurableprocessor IP provider)
Initial approach: Model control by hand-written
abstract machine
sinit
s3
s4
s2
s5
s6
Control
Datapath
VLSI Algorithmic Design Automation Lab. at SKKU99
Now: Biased-random vector generation Initial focus: Configurable processor
control and datapaths Topology-based heuristics with
tag coverage goal Later:
More sophisticated methods for bias selection Methods that address control and datapath together
Overall: “Closed feedback loop” that integrates a variety of Coverage metrics, analysis and feedback methods Coverage guided, automatic vector generation methods
Simulationdriver
Simulationengine Monitors
Symbolicsimulation
Coverageanalysis
Diagnosis ofunverifiedportions
Vectorgeneration
Directions for the Next Three Years
VLSI Algorithmic Design Automation Lab. at SKKU100
The Research Playground
Component AssemblyComponent Assemblyand Synthesisand Synthesis
MicroarchitectureMicroarchitecture
ArchitectureArchitecture
Verification and Verification and Manufacture TestManufacture Test
What is theWhat is theProgrammer’sProgrammer’s
Model?Model?
AlgorithmAlgorithm
SoftwareSoftwareImplementationImplementation
Compilation andCompilation andSoftware EnvironmentSoftware Environment
ApplicationApplication
VLSI Algorithmic Design Automation Lab. at SKKU101
Evaluation Strategy
Quantify quality of results of final implementation according to:
Speed Power Area/cost Design time Design cost
Compare to: Other purely programmable solutions
FPGA, microprocessor, specialized processor ASIC solutions
VLSI Algorithmic Design Automation Lab. at SKKU102
Ten Year Vision Elaborated
Significant percentage of embedded system applications fielded using only fully programmable components.
Supporting efficient but fully programmable solutions in areas of emerging standards.
Design-time brought within acceptable limits to achieve time-to-market goals.
Enabling new applications: Supporting greater complexity. Reducing overall design cost.
VLSI Algorithmic Design Automation Lab. at SKKU103
What Will Get Us There?…
Flexible architectural templates covering a large design space.
Multiple levels of support for concurrency Automated software development environment.
Retargetable compilers/assemblers/debuggers Architectural simulators Run-time environments – schedulers/synchronizers Analysis tools – design visualization, performance
monitoring, power analysis…
VLSI Algorithmic Design Automation Lab. at SKKU104
First Year Progress Against Strategies
Identified and assembled key application – VPN router. Identified and assembled compiler infrastructure:
Trimaran 2.0. Initiated multiple compiler/run-time environment
projects. (Mostly) identified initial architectural family.
Simulator for one processing element of the architectural family assembled.
Test strategy for one processing element determined.
VLSI Algorithmic Design Automation Lab. at SKKU105
Further Progress Against StrategiesIn two years: Automatic retargeting onto a family of
architectures and microarchitectures from a hardware-description language.
Automatically generated performance estimator, simulator.
Automatic generation of assembler, compiler, run-time system.
Automatically generated hardware for special purpose units?
In five years: Much like the above, but across a much broader range of
architectures/microarchitectures.
Real breakthrough will be in the development of a natural programmer’s model
VLSI Algorithmic Design Automation Lab. at SKKU106
Reconfigurable Reconfigurable Computing(FPFComputing(FPFA)A)
Energy-efficient Energy-efficient wireless wireless communicationcommunication
System System architecture for architecture for mobile mobile multimedia multimedia computerscomputers
SecuritySecurity8
Field Programmable Function Array: Chameleon
VLSI Algorithmic Design Automation Lab. at SKKU107
Montium Processing Tile
VLSI Algorithmic Design Automation Lab. at SKKU108
Montium Tile Processor
VLSI Algorithmic Design Automation Lab. at SKKU109
U-P vs XPP
VLSI Algorithmic Design Automation Lab. at SKKU110
A SDR/Multimedia Solution
VLSI Algorithmic Design Automation Lab. at SKKU111
PACT’s SDR XPP
VLSI Algorithmic Design Automation Lab. at SKKU112
PACT’s SDR XPP
VLSI Algorithmic Design Automation Lab. at SKKU113
Current Multimedia Processors
Digital Signal Processor => Multimedia Processor RISC instruction set and pipelining to gain higher clock
frequency Instruction level parallelism (ILP) Concern more and more on data movement and I/O
interface Pay more attention on low power design
VLSI Algorithmic Design Automation Lab. at SKKU114
Current Multimedia Processors
Name TMS320C82 Mpact 2 Trimedia TM1 MSP
Architecture Multiproc. VLIW VLIW Multiproc.
CMOS Technology 0.5 0.35 0.35 0.35
Vcc (Volts) 3.3 3.3 3.3 3.3
Power (Watts) 3 (@50 MHz) 4.45 4 4
Clock frequency (MHz) 50,60 125 100 100
Performance(BOPS 8-bit integer)
1.5 6 4 6.4
Manufacturer TIToshiba &Chromatic Res.
Philips Sumsung
VLSI Algorithmic Design Automation Lab. at SKKU115
TMS320C6x VelociTI
Highest Performance (1 GFLOPS) Floating point DSP 6-ns Instruction Cycle Time 167-MHz Clock Rate Eight 32-Bit Instructions/Cycle Instruction packing Complex programming model Poor energy and memory efficiency 600Mhz, $110 Good tools and third party support
VLSI Algorithmic Design Automation Lab. at SKKU116
StarCore SC140, Infineon
6-issue 16-bit fixed-point architecture Up to four 16-bit MACs per cycle5-stage pipeline with single-cycle latencyStrong Performance on most metricsMulti-vendor Architecture :Motorola, Agere and now
Infineon Limited Product Offerings: poor cost-efficiency, 300Mhz, $132
VLSI Algorithmic Design Automation Lab. at SKKU117
Analog Devices TigerSHARC
4-issue fixed- and floating-point hierarchical SIMD atrchitecture
Upto 8 16-bit fixed point MACs per cycle Special CDMA-oriented instructions High memory bandwidth (8Gb/s) 250Mhz, $175 2-level SIMD complicates programming Good tools
VLSI Algorithmic Design Automation Lab. at SKKU118
LSI Logic ZSP400 A 4-Way Superscalar DSP Core Up to 2 16-bit MACs per cycle Five-stage pipeline with single-cycle latencies Available as core, ASIC library component ,ASSP 200 Mhz, $36 Cost, energy and memory efficient Superscalar architecture simplifies, complicates
programming Unproven tools and third party support
VLSI Algorithmic Design Automation Lab. at SKKU119
Target Applications
Video - DVD, MPEG 1 & 2 decoding Audio - Dolby AC-3, 3D Audio, MPEG Decode,
Wavetable Synthesis Graphics - 2D & 3D acceleration Communication
Vocoder ADSL, Fax/MODEM : V.34, 56k Echo chancellor Desktop Videoconferencing
VLSI Algorithmic Design Automation Lab. at SKKU120
Advanced DSP130 nm Copper Technology
Greg Delagi, TI
Greg Delagi, TI
VLSI Algorithmic Design Automation Lab. at SKKU121
Reconfigurable Computing Research Group
DARPA’s Adaptive Computing Systems Project Virginia Tech University of California at Berkeley Brigham Young University Chameleon Systems Inc. Morphic Inc. Quicksilver Technology Inc. Sirius Inc.
VLSI Algorithmic Design Automation Lab. at SKKU122
Quicksilver 의 ACM
VLSI Algorithmic Design Automation Lab. at SKKU123
SDR-processing requirements for Mobile Communications (GSM)
Modem w/ basic equalizer2 MFLOPS for CDMA sector2.5 MFLOPS for a wideband CDMA4 MFLOPS for a G4
Requires high performance devices s.tPowerPC G4PowerPC with Altivec CPUs
TMS320-C6x SHARC/Tiger-SHARC DSPs
VLSI Algorithmic Design Automation Lab. at SKKU124
The need for a software configurable platform
That is capable to handle standards like AM, FM, GSM, UMTS, digital broadcasting standards(DAB, Sirius, XM-Sat Radio), analog and digital television and other data links.
A fully software reconfigurable multi-channel broadband sampling receiver for standards in the 100 MHz band
VLSI Algorithmic Design Automation Lab. at SKKU125
IN
F1
VRB1
VRB6VRB4 VRB5Master
MPU
OUT
VRB2 VRB3
High Speed
Low Speed
F2 F3
F4 F5 F6
전력관리
Versatile Reconfigurable Block Array
장점 대기 지연 시간이 없다 , 적은 silicon area 를 요구한다 . 간단한 wrapper 를 통해서 IP 과 호환성 있는 데이터 전송
단점 대용량 시스템에서 timing 정확성이 감소 복잡한 시스템의 경우 Test 가 어려움 Master 의 증가에 따라 arbiter 지연이 증가한다
VLSI Algorithmic Design Automation Lab. at SKKU126
Comparisons
Only 1 cycle to (re)configure the DSP Few cycles to (re)configure coarse grain RA (8) Many cycles to (re)configure fine grain RA
NPE Nc RName Type F (MHz)
2304 0.14 16457
24 4 6
24 4 6
128 16 8
ARDOISE
Systolic Ring
DART
MorphoSys
TMS320C62
Fine Grain RA
Coarse Grain RA
Coarse Grain RA
Coarse Grain RA
DSP VLIW 8 8
33
200
130
100
300 1
FcNc
FeNR PE
.
.
VLSI Algorithmic Design Automation Lab. at SKKU127
Multi-DSP Tree Structure
A. K. Salkintzis, N. Hong and P. T. Mathiopoulos
VLSI Algorithmic Design Automation Lab. at SKKU128
Multi-DSP Network Structure
Multiplexing &Burst Construction Encription
ChannelCoding
Interleaving
DataProcessing
CRCinsertionModulation
Sequencer
Spreading
Equalization
Rate matching Channelization
Segmentation
RadioResource
Data traffic is reduced with each connection
VLSI Algorithmic Design Automation Lab. at SKKU129
Platform 분류
Application Platform: 멀티미디어 platform: Nexperia, TI 의 OMAP 3G 무선 platform: Infineon 의 M-gold Bluetooth platform: Parthus 무선 platform: ARM 의 PrimeXsys
Process-centric platform Improv System, ARC, Tensilica, Triscend
Communication-centric platform: Sonics, Palmchip
VLSI Algorithmic Design Automation Lab. at SKKU130
Recent Computing Machines
ACM (Adaptive Computing Machine) – Quicksilver: www.qstech.com (image appl.)
RCF (Reconfigurable Compute Fabric) – Motorola (SDR base-station), array of DSP cores connected through high-bandwidth interconnect and high-speed local memory, controlled by a RISC.
VLSI Algorithmic Design Automation Lab. at SKKU131
What is Software Radio
A transceiver in which all aspects of its operation are determined using versatile general purpose hardware whose configuration is under software control
Flexible all-purpose radios that can implement new and different standards or protocols through reprogramming.
Same hardware for all air interfaces and modulation schemes
VLSI Algorithmic Design Automation Lab. at SKKU132
Key Technological Constraints
High speed wide band ADCs. High speed DSPs. Real Time Operating Systems (isochronous
software) Power Consumption
VLSI Algorithmic Design Automation Lab. at SKKU133
Applications
User Applications and Base Station Applications Evolve as a universal terminal Spectrum management: Reconfigurability is a big
advantage Application updates, service enhancements and
personalization
VLSI Algorithmic Design Automation Lab. at SKKU134
Research and Commercialization
DARPA’s Adaptive computing system project Virginia Tech – algorithms and architecture ; multi user
receiver based on reconfigurable computing ; generic soft radio architecture for reconfigurable hardware
UC Berkeley – Pleiades, ultra low power, high performance multimedia computing ; high power efficiency by providing programmability
Sirius Inc – Software Reconfigurable Code Division Multiple Access (CDMAx)
VLSI Algorithmic Design Automation Lab. at SKKU135
Research and Commercialization
Brigham Young University – Development of JHDL to facilitate hardware synthesis in reconfigurable processors
Chameleon Systems- Reconfigurable Platform Architecture for wireless base station
MorphIC Inc -Programmable hardware reconfigurable code using DRL
Quicksilver Tech. Inc – Universal Wireless `Ngine (WunChip) baseband algorithms
VLSI Algorithmic Design Automation Lab. at SKKU136
Programmable OFDM-CDMA Tranceiver.
CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using SDR.
VLSI Algorithmic Design Automation Lab. at SKKU137
Programmable OFDM-CDMA Tranceiver.
CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using SDR.
VLSI Algorithmic Design Automation Lab. at SKKU138
Programmable OFDM-CDMA Tranceiver.
CDMA suffers from Multiple access interference and ISI.
OFDM reduces interference and helps better spectrum utilization and attainment of satisfactory BER.
It is proposed that this might be implemented by using SDR.
VLSI Algorithmic Design Automation Lab. at SKKU139
SDR Architecture
Signal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
Hitachi Kokusai Electric Inc., [email protected]
VLSI Algorithmic Design Automation Lab. at SKKU140
Signal processing/control unit
The signal processing/control unit consists of the following module Data converter Quadrature Modem Baseband Modem Interface/Control
Every module is connected to each other by PCI bus, and provides a CPU in addition to the FPGA and DSP devices.
VLSI Algorithmic Design Automation Lab. at SKKU141
Quadrature modem module
The Quadrature modem uses FPGAs to process
to generate baseband sampling rate
Quadrature modulation Quadrature detection Sampling rate conversion Filtering
Signal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
VLSI Algorithmic Design Automation Lab. at SKKU142
Baseband modem module The Baseband modem processes
Multi-channel modulation Multi-channel demodulation
Using four floating points DSP devices
individual DSP is assigned for each channel. Therefore, even if processing of either channel is under execution, a program can be downloaded to another channel.
Signal processing/control unitRF unit
Rx SYN
Tx SYN
Rx SYN
Tx SYN
RX
TX
RX
TX
EX.
EX.
PA
PA
LNA
LNAData converterQuadrature MODEMBaseband MODEMInterface Control
C- PCI bus
HMITerminal
Input/Output
Receive/Transmit
Receive/Transmit
VLSI Algorithmic Design Automation Lab. at SKKU143
Specification of Prototype
RF range 2~500MHz
Waveform SSB, AM, FM, BPSK, QPSK, 8PSK, 16QAM
Number of channel Four full-duplex
Radio relay Repeat/Bridge
Frequency accuracy <0.1ppm
Rx IF frequency 70MHz
Tx IF frequency 25MHz
Dynamic range 14bits
Rx IF sampling frequency 40MHz
Tx IF sampling frequency 100MHz
VLSI Algorithmic Design Automation Lab. at SKKU144
Specification of Prototype
Signal processingFPGA : Quadrature MODEM
DSP : Baseband MODEM
FPGA XCV2000E x 3
DSP TMS320C6701 x 4
CPU Control module : Celeron Peripheral module
System bus cPCI
Operating system Linux
HMI Operates from web browser
InterfaceAudio I/OSerial I/O
Ethernet(100BASE-TX)