ENG6530 RCS ENG6530 Reconfigurable Computing Systems Part I: Memory Technologies and Programmable Logic Devices (A Review)

ENG6530 RCS ENG6530 Reconfigurable Computing Systems Part I: Memory Technologies and Programmable Logic Devices (A Review) Slide 2 ENG6530 RCS2 Topics o History of Reconfigurable Computing o Memory Technology o Simple Programmable Logic Devices (SPLD) o Programmable Logic Devices (PLD) o Programmable Logic Arrays (PLA) o Programmable Array Logic (PAL) o Complex Programmable Logic Devices (CPLD) Slide 3 ENG6530 RCS3 References Architectures of FPGAs and CPLDs: A Tutorial by S. Brown and J. Rose (On the Web) FPGA-Based System Design, by Wayne Wolf The Design Warriors Guide to FPGAs, by Clive Maxfield Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications, by Bobda http://www.xilinx.com http://www.xilinx.com/publications/xcellonline / Slide 4 4 Estrin at work. Substantial efforts on Reconfiguration Gerald Estrin Fix-Plus Machine Attempts to have a flexible hardware structure that can be dynamically modified at run-time to compute a desired function are almost as old as the development of other computing paradigms. Fix-Plus Machine. In 1959, Gerald Estrin, at UCLA, introduced the concept of reconfigurable computing by introducing the Fix-Plus Machine. Slide 5 5 Fixed plus Variable structure computer o Proposed by G. Estrin in 1959 three parts o Consist of three parts general purpose 1.A high speed general purpose computer (the fix part F). variable part 2.A variable part (V) consisting of various size high speed digital substructures which can be reorganized in a problem oriented special purpose configurations. supervisory control 3.The supervisory control (SC) coordinates operations between the fix module and the variable module. Speed gain over IBM7090 (2.5 to 1000) Gerald Estrin Fix-Plus Machine Slide 6 ENG6530 RCS6 Current Programmable Logic Devices Slide 7 ENG6530 RCS7 Programmable Logic Devices (PLDs) Boolean functions can be implemented as 1. Sum of Minterms (two level logic) 2. Product of Maxterms (multi level logic) array of AND Decoders can realize Boolean functions since they consist of an array of AND which provide us with all necessary minterms. OR gate An OR gate can be used to sum the minterms. Slide 8 ENG6530 RCS8 Decoders: Implementing Logic Example: Implement the following boolean functions 1. S(A 2,A 1,A 0 ) = SUM(m(1,2,4,7)) 1.Since there are three inputs, we need a 3-to-8 line decoder. 2.The decoder generates the eight minterms for inputs A 0,A 1,A 2 3.An OR GATE forms the logical sum minterms required. Slide 9 ENG6530 RCS9 Programmable Boolean Functions Multiplexers can also be used to realize Boolean functions since they consist of an array of AND gates followed by an OR gate. Slide 10 ENG6530 RCS10 C AB 0123456701234567 1010001110100011 S2 8:1 MUX S1S0 F Multiplexers: Implementing Logic 2 n :1 multiplexer implements any function of n variables 1. With the variables used as control inputs and 2. Data inputs tied to 0 or 1 3. In essence, a lookup table Example: F(A,B,C) = m0 + m2 + m6 + m7 = A'B'C' + A'BC' + ABC' + ABC Slide 11 ENG6530 RCS11 Programmable Boolean Functions Memory units can be used to implement a Boolean function by storing the output of the truth table in the memory and accessing the values by using variables of the truth table as address lines. LUT A B C D Z LUT implementation A B C D Z Gate implementation Slide 12 ENG6530 RCS12 Memory/Programmable Systems memory systems It is important to understand the different technologies behind memory systems since they are used extensively to design programmable systems. Electronic systems in general and computers in particular make use of two major classes of memory devices: 1.Non-Volatile (ROM, EEPROM, FLASH) and 2.Volatile (RAM, SRAM, DRAM). Slide 13 ENG6530 RCS13 Memory Classification Slide 14 ENG6530 RCS14 SRAM Cell Slide 15 ENG6530 RCS15 DRAM Cell Slide 16 ENG6530 RCS16 SRAM based Programmable Cell control NMOS RAM based devices can be used to control NMOS transistors to be on/off (Control Switch) control Multiplexers RAM can also be very useful to control Multiplexers, Routing, etc. Slide 17 ENG6530 RCS17 SRAM vs. DRAM Summary Tran.AccessNeedsNeeds per bit timerefresh?EDC?CostApplications SRAM4 or 61XNoMaybe100xCache memories DRAM110XYesYes1XMain memories, frame buffers DRAM technology is of little interest with regard to programmable logic. One disadvantage of having a programmable device based on SRAM cells is that each cell consumes a significant amount of silicon real estate (6 transistors). Another disadvantage the configuration data will be lost when power is removed from the system. reprogrammed quickly and repeatedly However, such devices have the corresponding advantage that they can be reprogrammed quickly and repeatedly as required. EDC: Error Detection & Correction Slide 18 18 Mask-Programmed Devices The entire ROM consists of a number of row (word) and column (data) lines forming an array. Each column has a single pull-up resistor attempting to hold that column to a weak logic 1 value. Every row-column intersection has an associated transistor and, potentially, a mask-programmed connection. Slide 19 19 Fusible-Link Based PROM (OTP) very expensive The problem with mask-programmed devices is that creating them is a very expensive unless you intend to produce them in large quantities. programmablePROM For this reason, the first programmable read-only memory (PROM) devices were developed at Harris Semiconductor in 1970. The designer The designer can program the link by either blowing the fuse or leaving it intact! Slide 20 20 Standard MOS vs. EEPROM Transistor EPROM An EPROM transistor has the same basic structure as a standard MOS transistor, but with the addition of a second polysilicon floating gate isolated by layers of oxide. Slide 21 ENG6530 RCS21 EEPROM Cell: Functionality I ds V gs V t is pushed from 0.7 volts towards 5-7 Volts. So transistor will be off unless it is reprogrammed again. Slide 22 ENG6530 RCS22 EEPROM Cell: Programming Slide 23 ENG6530 RCS23 EEPROM Transistor-based Memory Cell Observe how the cell functions: un-programmed state In its un-programmed state, all the floating gates in the EPROM transistors are uncharged (normal NMOS transistor): data lines are pulled to logic 0. In this case, placing a row line in its active state will turn on all of the transistors and column data lines are pulled to logic 0. is programmed data line is logic 1. When the transistor is programmed, applying the Row word will not activate the transistor and therefore the data line is logic 1. Slide 24 ENG6530 RCS24 Memory: Comparison Memory typeDensitySpeedSizeCostVolatility DRAMV. HighFastSmallCheapY SRAMLowV. fastLargeCostlyY ROMHighV. fastSmallCheapN PROMHighV. fastmoderateCheapN EPROMHighFastSmallCheapN EEPROMMediumFastModerateCheapN Flash HighFast!MediumCheapN Slide 25 ENG6530 RCS25 PLDs: Classification The first programmable ICs were generically referred to as (PLDs). Slide 26 ENG6530 RCS26 Simple PLDs Programmable Or Array Programmable AND array Programmable Or Array Programmable AND array Slide 27 ENG6530 RCS27 Programmable Logic Array (PLA) Like programmable inverter Tied to 0 F 1 not inverted Tied to 1 F 1 is inverted Slide 28 ENG6530 RCS28 Integration of several PLD blocks with a programmable interconnect on a single chip PLD Block PLD Block PLD Block PLD Block Interconnection Matrix I/O Block PLD Block PLD Block PLD Block PLD Block I/O Block Interconnection Matrix Complex PLDs (CPLDs) Slide 29 ENG6530 RCS29 Complex PLDs (CPLD) typically combine PAL combinational logic with Flip Flops Organized into logic blocks connected in an interconnect matrix Combinational or registered output Usually enough logic for simple counters, state machines, decoders, etc. CPLDs logic is not enough for complex operation CPLDs logic is not enough for complex operation FPGAs have much more logic than CPLDs e.g. Xilinx Coolrunner II, etc. Complex PLDs (CPLDs) Slide 30 ENG6530 RCS30 Conclusion Attempts to overcome the inefficiency of the Von Neumann architecture go back to the late 50s. Memory Memory is an important component in designing programmable logic (especially current FPGAs) two level circuits Programmable logic devices are very well suited to implement two level circuits (i.e., sum of products) limitation The main limitation of PLAs and PALs is their low capacity. CPLDs SPLDs CPLDs overcome the limitation of SPLDs. too small CPLDs are still too small for usage in reconfigurable computing systems and are mainly used as glue logic or to implement small functions. Slide 31 ENG6530 RCS31 ENG6530 Reconfigurable Computing Systems Part II: Programmable Logic Devices Field Programmable Gate Arrays Slide 32 ENG6530 RCS32 Topics o Field Programmable Gate Arrays o Internal Structure o Look Up Tables o Input/Output Blocks o Programmable Interconnect o Fine Grain vs. Medium Grain FPGA o Soft Cores and Hard Cores o Clock Managers, I/O Transceivers, . Slide 33 ENG6530 RCS33 FPGAs Around the beginning of the 1980s, it became apparent that there was a gap in the digital IC continuum. 1. At one end, there were programmable devices liks SPLDs and CPLDs, which were highly configurable but could not support large designs. 2. At the other end of the spectrum were ASICs which can support complex functions but were expensive, time consuming, . Slide 34 ENG6530 RCS34 Simple Generic FPGA Arch bridge the gap FPGAs can successfully bridge the gap between PLDs and ASICs. On the one hand they are highly configurable and have the fast design and modification times associated with PLDs. large and complex functions On the other hand, they can be used to implement large and complex functions that had previously been the domain only of ASICs. NOTE: high-performance NOTE: ASICS are still required for the really large, complex, high-performance designs, but as FPGAs increased in sophistication, they started to encroach further and further into ASIC design space. Slide 35 ENG6530 RCS35 FPGAs In order to address this gap, Xilinx developed a new class of IC called a Field Programmable Gate Array, or FPGA, which they made available to the market in 1984. The first FPGAs were based on CMOS and used SRAM cells for configuration purposes. Although these early devices were comparatively simple and contained few gates by todays standard, many aspects of their underlying architecture are still employed to this day. Slide 36 ENG6530 RCS36 SRAM (LUT-based) Flexible and Fast Ideal technology for Manufacturers (all CMOS Tech) Security Issues related to Intellectual Property (Security) additional circuitry Requires additional circuitry for reconfiguration. Radiation Susceptible to upset due to collision from high energy particles (Radiation)! EEPROM (Flash) Advantages: Are non-volatile (restore information after power down) More Tolerant to Radiation (space applications) Drawbacks: Requires multiple voltage sources Slower in reconfiguration time Less Dense FPGA Programming Technologies Slide 37 ENG6530 RCS37 SRAM Based FPGAs The majority of FPGAs are based on the use of SRAM configuration cells. Advantages: 1. Can be configured over and over again. 2. Can take advantage of new RAM technology Disadvantages: 1. Reconfigured every time the system is powered on. 2. Slow reconfiguration time for RTR Applications 3. It can be difficult to protect the intellectual property (IP) in the form of a design. 4. Susceptible to radiation affects Slide 38 ENG6530 RCS38 Programming Technologies Slide 39 39 The Programmable Marketplace The Programmable Marketplace Q1 Calendar Year 2007 Source: Company reports Latest information available; computed on a 4-quarter rolling basis Xilinx Altera Lattice Actel QuickLogic: 2% Xilinx All Others Two dominant suppliers, indicating a maturing market PLD SegmentFPGA Sub-Segment Other: 2% 51% 33% 5% 7% Altera 58% 31% 11% Slide 40 ENG6530 RCS40 Typical organization 1.Symmetrical Array 2.Row based 3.Sea of gates 4.Hierarchical FPGA Organization Slide 41 ENG6530 RCS41 Generic FPGA architecture: Configurable Logic Block (CLB) Connection Block Switch Block Routing Channels I/O pad Wire segments Slide 42 ENG6530 RCS42 SRAM Cell (Pass Transistor) An SRAM cell can drive the gate (G) terminal of an NMOS transistor. If SRAM (M) = 1 then signals passes from S D An SRAM cell can also be attached to the select line of a MUX to control it. Configuration time depends on speed and amount to be downloaded Slide 43 Combinatorial Logic A B C D Z Look-Up Tables Combinatorial logic is stored in Look-Up Tables (LUTs) Also called Function Generators (FGs) Capacity is limited by the number of inputs, not by the complexity Delay through the LUT is constant Delay through the LUT is constant ABCDZ 00000 00010 00100 00111 01001 01011... 11000 11010 11100 11111 Slide 44 ENG6530 RCS44 Programmable Logic Block Early devices were based on the concept of programmable logic block, which comprised: 1. 3-input lookup table (LUT), 2. register that could act as flip flop or a latch, 3. multiplexer, along with a few other elements. Slide 45 ENG6530 RCS45 LUT Example: Implement the function using: 2-input LUTs 3-input LUTs 4-input LUTs A F = ABD + BC BCD+ A B D B C D A B C F A B D B C D A B C C D A B F F FPGA Look Up Table (LUT) Slide 46 ENG6530 RCS46 3-, 4-, 5-, or 6-input LUTs? The key feature of n-input LUT is that it can implement any possible n-input combinational logic function. Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells! The first FPGAs were based on 3-input LUTs. FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input LUTS. The current consensus is that 6-input LUTS offer the optimal balance of pros and cons. In the past, some devices were created using a mixture of different LUT sizes because this offered the promise of optimal device utilization. However current logic synthesis tools prefer uniformity and regularity Slide 47 Major Differences between Xilinx Families Number of CLB slices per CLB Number of LUTs per CLB slice Look-Up Tables Spartan 3 Virtex 4 Virtex 5, Virtex 6, Spartan 6 4-input6-input 4 2 2 4 Slide 48 ENG6530 RCS48 Input/Output Blocks (IOB) To communicate with the outside world, programmable I/O blocks are provided I/O Pin can be programmed to act as input or output Slide 49 ENG6530 RCS49 Programmable Interconnect: Switch Matrix Connections Connections between Configurable Logic Blocks (CLBs) and IOBs are made using wiring segments in both horizontal and vertical channels lying between the various blocks. 6 pass transistors Four segments meet, on each there is 6 pass transistors. Slide 50 ENG6530 RCS50 Segmented Routing Length 1 wires Length 2 wires Long wires Vertical channels not shown Signal delay is proportional to the number of switches (pass T s ) that the signal passes through Segmented routing Segmented routing allows you to reduce the switches on the signal path, to speed it up Example: Xilinx Segmented Routing Structure Slide 51 ENG6530 RCS51 Early Xilinx FPGAs (XC4000) LUTs and Flip Flops allow both combinational and sequential circuits to be implemented. Two functions of up to four variables and selected functions of up to nine variables can be implemented. Positive/Negative edge triggered Clock Enable Slide 52 ENG6530 RCS52 f2f2 f3f3 f1f1 Programming an FPGA? A B C D E F f1f1 f2f2 f3f3 ABC D E F Technology Mapping Placement Routing Slide 53 ENG6530 RCS53 Remember! Programmable Lookup Tables (LUTs) Programmable routing structure Main bottleneck with state-of-the-art fine grain FPGAs is the routing enabled by pass transistors! Slide 54 ENG6530 RCS54 Remember! Programmable Lookup Tables (LUTs) Programmable routing structure LUT x y z f... f SRAM x y z... 0 0 1 0 1 Look-up-tables are flexible but require lots of configuration and suffer from power dissipation! Slide 55 ENG6530 RCS55 LUT vs. distributed RAM vs. SR The main function of an LUT is to realize any Boolean function. In addition to this primary role, some vendors allow the cells forming the LUT to be used as a small block of RAM (16x1 RAM) distributed RAM block RAM This is referred to as distributed RAM because the LUTs are distributed across the surface and using the term distributed RAM will differentiate is from the larger chunks of block RAM that are currently embedded in more advanced FPGAs. shift register Some vendors allow the SRAM cells forming a LUT to be treated and used as a shift register. Slide 56 ENG6530 RCS56 Xilinx Logic Cell The core building block in a modern FPGA from Xilinx is called a Logic Cell (LC). An LC comprises a 1. a 4-input LUT (which can also act as a 16x1 RAM or 16-bit shift register), 2. a multiplexer, 3. a flip flop. The equivalent building block in an FPGA from ALTERA is a Logic Element (LE) Slide 57 ENG6530 RCS57 Slicing and Dicing The next step up the hierarchy is what Xilinx calls a slice. A slice contains two logic cells Although each LC (LUT, MUX, Register) have their own data inputs and outputs, the slice has one set of clock, clock enable, and set/reset signals common to both logic cells. Slide 58 ENG6530 RCS58 Configurable Logic Blocks Moving one more level up the hierarchy, we come to what Xilinx calls a configurable logic block (CLB). Some Xilinx FPGAs have two slices in each CLB while others have four. The reason for having this type of logic block hierarchy is that it is complemented by an equivalent hierarchy in the interconnect. There is fast interconnect between LCs in a slice, then slightly slower interconnect between slices in a CLB, followed by the interconnect between CLBs. Slide 59 ENG6530 RCS59 Slices and CLBs Each Virtex -II CLB contains four slices Local routing provides feedback between slices in the same CLB, and it provides routing to neighboring CLBs A switch matrix provides access to general routing resources CIN Switch Matrix BUFT COUT Slice S0 Slice S1 Local Routing Slice S2 Slice S3 CIN SHIFT Slide 60 ENG6530 RCS60 Evolution of the FPGA Early FPGAs were used mainly for glue logic between other components Simple CLBs, small number of inputs Focus was on implementing random logic efficiently As capacities grew, other applications emerged FPGAs as alternative to custom ICs for entire applications Emulation of ASICs. Computing with FPGAs FPGAs have changed to meet new application demands Carry chains, better support for multi-bit operations Block RAMs Integrated memories, such as the Block RAMs. multipliers Specialized units, such as multipliers, to implement functions that are slow/inefficient in CLBs Clock Managers Clock Managers to control the Frequency of Operation entire CPUs Newer devices incorporate entire CPUs: Xilinx Virtex II Pro has 1-4 Power PC CPUs Slide 61 ENG6530 RCS61 Embedded Ram Blocks A lot of applications require the use of memory, so FPGAs now include relatively large chunks of embedded RAM called e-RAM or Block RAM (BRAM). Depending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns These blocks can be used for a variety of purposes, such as implementing 1. Standard single/dual port RAMs, 2. FIFO, 3. Queues, e.t.c. Slide 62 Memory Types in Xilinx Memory Distributed (MLUT-based) Block RAM-based (BRAM-based) Inferred Instantiated Memory Manually Using Core Generator Slide 63 ENG6530 RCS63 Embedded Multipliers Some functions, like multipliers are inherently slow if they are implemented by connecting a large number of programmable logic blocks together. Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications). Slide 64 Dedicated Multiplier Blocks 18-bit twos complement signed operation Optimized to implement Multiply and Accumulate functions Multipliers are physically located next to block SelectRAM memory 18 x 18 Multiplier 18 x 18 Multiplier Output (36 bits) Data_A (18 bits) Data_B (18 bits) 4 x 4 signed 8 x 8 signed 12 x 12 signed 18 x 18 signed Slide 65 Xilinx XtremeDSP Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAs multiply-accumulate core Essentially a multiply-accumulate core with many other features Now also in Spartan-3A, Spartan 6, Virtex 5, and Virtex 6 Slide 66 ENG6530 RCS66 Multiply/Accumulate Units One operation that is very common in DSP type applications is called a multiply-and-accumulate (MAC). Some FPGAs provide entire MACs as embedded functions Granularity is starting to become coarse rather than fine! Slide 67 DSP48 Slice: Virtex 4 Slide 68 Mathematical Functions mathematical functions DSP 48 can perform mathematical functions such as: Add/Subtract Accumulate Multiply Multiply-Accumulate Multiplexer Barrel Shifter Counter Divide (multi-cycle) Square Root (multi-cycle) create filters Can also create filters such as: Serial FIR Filter (Xilinx calls this MACC filters) Parallel FIR Filter Semi-Parallel FIR Filter Multi-rate FIR Filters Slide 69 ENG6530 RCS69 Embedded Processor Cores (Hard and Soft) The majority of designs make use of microprocessors. These appeared as discrete devices on the circuit board. Lately, high-end FPGAs have become available that contain one or more embedded microprocessors (referred to as microprocessor cores). There are two types of cores: hard core A hard core is implemented as a dedicated predefined block (Power PC, Arm Processor) soft core A soft core is implemented by configuring a group of programmable logic blocks to act as a microprocessor (micro blaze). Slide 70 ENG6530 RCS70 Soft Core As opposed to embedding a microprocessor physically into the fabric of the chip, it is possible to configure a group of programmable logic blocks to act as a microprocessor. Soft cores are simpler (more primitive) and slower than their hard-core counterparts. 1. The user need only implement a core if he/she needs it. 2. Also, the user can instantiate as many cores as they require until they run out of resources! ADVANTAGE? Slide 71 Slide 72 PowerPC Cores PowerPC System Slide 73 Zynq - Extensible Processing Platform Slide 74 Xilinx FPGA Families High-performance families Virtex (220 nm) Virtex-E, Virtex-EM (180 nm) Virtex-II (130 nm) Virtex-II PRO (130 nm) Virtex-4 (90 nm) Virtex-5 (65 nm) Virtex-6 (40 nm) Virtex-7 (28 nm) Low Cost Family Spartan/XL derived from XC4000 Spartan-II derived from Virtex Spartan-IIE derived from Virtex-E Spartan-3 (90 nm) Spartan-3E (90 nm) logic optimized Spartan-3A (90 nm) I/O optimized Spartan-3AN (90 nm) non-volatile, Spartan-3A DSP (90 nm) DSP optimized Spartan-6 (45 nm) Artix-7 (28 nm) Slide 75 ENG6530 RCS75 Virtex FPGA Family Virtex FPGA Family (II, 4, 5, 6, 7) 18x18 Multipliers & 18kbit block RAMs introduced Gbit Serial I/O Communications & Power PC Processors Introduced Complex Floating Point Algorithm Implementation now possible Virtex Family Logic Slices 18Kbits BRAMs 18x18 Multipliers PowerPC Processors Gbit I/O User I/O Slide 76 ENG6530 RCS76 Clock Trees, Clock Managers Clock signals typically originate in the outside and comes into the FPGA via a special clock input pin. The main clock signal branches to form a clock tree. This structure is used to ensure that all of the flip-flops see their versions of the clock signal as close together as possible. Slide 77 ENG6530 RCS77 Digital Clock Managers (DCM) The clock pin is usually connected to special hard-wired function called a clock-manager that generates daughter clocks. The daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board. There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, ) Slide 78 ENG6530 RCS78 DCM: Frequency Synthesis The frequency of the clock signal being presented to the FPGA from the outside world might not be exactly what the designer engineer wishes for. The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal. Slide 79 ENG6530 RCS79 Configurable I/O Impedances The signals used to connect devices on todays circuit board often have fast edge rates. In order to prevent signals reflecting back it is necessary to apply appropriate terminating resistors to the FPGA input and output pins. In the past, resistors were applied as discrete components (outside the FPGA). Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user. Slide 80 ENG6530 RCS80 Gigabit Transceivers The traditional way to move large amounts of data between devices is to use a bus. However, Buses require a lot of pins/tracks on the device. Routing these tracks so that they all have the same length and impedance becomes increasingly painful as boards grow in complexity! Solution? Slide 81 ENG6530 RCS81 Gigabit Transceivers Cont Todays high-end FPGAs include special hard-wired gigabit transceiver blocks. These blocks use one pair of differential signals to transmit (TX) data and another pair to receive (RX) data. These transceivers operate at incredibly high speeds, allowing them to transmit and receive billions of bits of data per second. Slide 82 Using FPGA to Interface Between Multiple Standards Slide 83 TechnologyLow-costHigh- performance 120/150 nmVirtex 2, 2 Pro 90 nmSpartan 3Virtex 4 65 nmVirtex 5 45 nmSpartan 6 40 nmVirtex 6 22nmVirtex 7 Xilinx FPGA Devices Slide 84 ENG6530 RCS84 Latest Devices: Capacity & Features Xilinx Virtex-5/6/7/Zync 65/40/22 nm process Up to 960 I/Os >200000 logic cells Up to 552 18kb block RAMs (~10Mb RAM) 450 DSP slices (18x18 multiplier-accumulator) 20 digital clock managers (DCM) 24 high-speed serial transceivers (622Mb/s to 11.1Gb/s) Up to four PowerPC 405 cores Altera Stratix-II 90nm process Up to 1170 I/Os 179000 logic elements 9.6Mb embedded RAM 96 DSP blocks: 380 18x18 multipliers 12 PLLs Serial I/O up to 1Gb/s No hard processor cores Slide 85 ENG6530 RCS85 Zynq-7000 AP: ZED Board Slide 86 ENG6530 RCS86 Inside the APU Dual ARM Cortex-A9 MPCore with NEON extensions to 800-MHz Up to 800-MHz operation 2.5 DMIPS/MHz per core Separate 32KB Separate 32KB instruction and data caches Snoop control unit L1 cache snoop control Accelerator coherency port Level 2 cache and controller Shared 512 KB cache with parity Slide 87 ENG6530 RCS87 Zynq-7000 AP Slide 88 ENG6530 RCS88 Zynq-7000 Family Highlights Complete ARM-based processing system Complete ARM-based processing system Application Processor Unit (APU) Dual ARM Cortex-A9 processors Caches and support blocks Fully integrated memory controllers I/O peripherals Tightly integrated programmable logic Tightly integrated programmable logic Used to extend the processing system Scalable density and performance Flexible array of I/O Flexible array of I/O Wide range of external multi-standard I/O High-performance integrated serial transceivers Analog-to-digital converter inputs Slide 89 ENG6530 RCS89 FPGA Shortcomings Circuit Delay Delay increases due to programmable switches in the FPGA routing architecture Area Configuration cells and programmable resources incur substantial area penalty Power Typically not suited for low power applications PerformanceCost ASIC FPGA ASIC FPGA Time to market ASIC FPGA Need to improve Slide 90 ENG6530 RCS90 Conclusion FPGAs are the main enabler of Reconfigurable Computing Systems FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS. Advantages: Flexible, programmable, Disadvantages: Power dissipation, performance w.r.t. ASIC Applicability of FPGAs relies on CAD tools provided by different vendors such as Xilinx and Altera RCS can be realized with several technologies: FPGAs: Fine/Medium Grain Coarse Grain Reconfigurable Architectures: CGRAs 19 Slide 91 ENG6530 RCS91 ENG6530 Reconfigurable Computing Systems Part III: Programmable Logic Devices Coarse Grain Reconfigurable Arrays Slide 92 ENG6530 RCS92 Topics o What is wrong with current FPGAs? o Coarse Grain Devices o Classification o Academic Examples o Industrial Examples o Advantages and Disadvantages o Conclusion Slide 93 ENG6530 RCS93 References Coarse Grain Reconfigurable Architectures, by R. Hartenstein The Design Warriors Guide to FPGAs, by C. Maxfield Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications, by Bobda Slide 94 ENG6530 RCS94 General vs. Special purpose With the LUT as function generators, FPGA can be seen as general purpose devices General purpose device: they are flexible but inefficient Flexible because any n-variables Boolean function can be implemented in a n-input LUT Inefficient since complex functions must be implemented in many LUTs at different locations. Routing overhead (signal delay, more power dissipation) Low silicon area of FPGA High volume of configuration (large memory, power) Mapping of applications from HLLs is difficult as the granularity of the FPGA does not match that of the operations in the source code Slide 95 ENG6530 RCS95 General vs. Special purpose Example: Implement the function using 2-input LUTs. LUTs are grouped in logic blocks (LB). LB Connection inside a LB is efficient (direct) Connection outside LBs are slow (Connection matrix) A B D A C D A B C F Connection matrix DELAY! Slide 96 ENG6530 RCS96 General vs. Special purpose Idea: Implement frequently used blocks as hard-core module in the device A B D A C D A B C F Connection matrix A B C D Slide 97 ENG6530 RCS97 CGRA devices: Features Overcome the inefficiency of FPGAs by providing coarse grained functional units efficiently implemented A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to executed only one operation on a given period (until the next configuration) Memory exist between and inside the PEs. The functional units communicate via busses or can be directly connected using programmable routing matrices Communication among the PEs can be either packet oriented (on busses) or point-to-point (using crossbar switches) Slide 98 ENG6530 RCS98 CGRAs Try to overcome the disadvantage of FPGA-based computing solutions by providing multiple-bit wide datapaths and complex operators instead of bit-level configurability. Wide datapath allows the efficient implementation of complex operators in silicon. Routing overhead generated by having to compose complex operators from bit-level processing units is avoided. Less processing elements (LUTS vs. PEs) results in less time to configure and reconfigure the devices Slide 99 ENG6530 RCS99 CGRAs: Advantages 1. Very efficient in term of speed (no need for connections over connection for basic operators) 2. Direct functional units instead of LUT implementation 3. Massive reduction of configuration time 4. Drastic complexity reduction of the P&R problem. Slide 100 ENG6530 RCS100 Fine-grained Vs. Coarse-grained ALU Coarse grained Fine grained Fine control over bit-width Bit-level operations CAD tools Available Flexible Speed, Power Consumption Time to Configure Time to Configure Less Routing. Better Instruction Density. Better cycle times. Small configuration sizes. Little CAD support Less flexible! Slide 101 ENG6530 RCS101 FPGAs vs. CGRAs Flexibility Performance ASIC ISP DSP FPGA CGRA Slide 102 ENG6530 RCS102 Coarse-grained Architectures Since 1990 several approaches for coarse grained reconfigurable architectures have been published. We can classify the different architectures according to three properties: 1. The basic interconnect structure (crossbar, Mesh) 2. The width of the datapaths (4,8, 16, ) 3. The reconfiguration model (static, dynamic, .) Slide 103 ENG6530 RCS103 Coarse-grained Architectures Slide 104 ENG6530 RCS104 The Chess HP Labs Bristol, England, in 1999 Intended for the implementation of multimedia applications. 2-D array of processing elements Contains more FPGA-like routing resources. No reported software or application results Slide 105 ENG6530 RCS105 Chess Architecture The architecture consists of ALUs and switchboxes. With each switchbox being surrounded by four ALUs and each ALU being surrounded by four switches. chessboard-like pattern The components are arranged in a chessboard-like pattern The ALUs and all routing resources are four bits wide. The routing structure consists of segmented four-bit buses of different lengths. Connections are sufficient to link an ALU to all of its eight surrounding neighbors. Slide 106 ENG6530 RCS106 Chess Basic Block The ALU features two inputs and one output for four bit data words. The instruction set features 16 operations, including add and subtract, nine logical operations, and several test operations. RAMS in the switchboxes can also be used as a 4-input, 4-output LUT In order to enable operations which cannot be mapped onto an ALU or to make fine- grained interconnections not supported by the four bit wiring, the RAMS in the switchboxes can also be used as a 4-input, 4-output LUT. Slide 107 ENG6530 RCS107 Summary FPGAs bear different disadvantages for computational tasks (routing overhead, configuration time) Architectures moving in the direction of coarse-grained blocks. CGRAs provide multiple-bit wide data-paths and complex operators instead of bit-level configurability. CGRAs are a promising platform High throughput, power efficient computation Applicability of CGRAs critically hinges on the compiler Software support still a major issue. Slide 108 ENG6530 RCS108

Documents

ENG6530 RCS ENG6530 Reconfigurable Computing Systems Part I: Memory Technologies and Programmable Logic Devices (A Review)