Upload
buiduong
View
230
Download
0
Embed Size (px)
Citation preview
Field Programmable Gate
Arrays
TIE-50206 Logic Synthesis
Arto Perttula
Tampere University of Technology
Spring 2016
Outline
• FPGA Architectures
– Logic, interconnects, clocking, integrated macros
– Selection criteria
• Snippets from commercial FPGA architecture:
Stratix III
– Details regarding logic, interconnects etc.
– DRAM interface case study
10.2.2016 Arto Perttula 2
SPLDs
• First PLDs were PROMs in 1970
– OR gates were programmable
• Evolution led to Programmable
Logic Arrays (PLA) in 1975
– Both ANDs and Ors programmable
• These are classified as Simple
Programmable Logic Devices
10.2.2016 Arto Perttula 5
a b c
l l l
Address 0 &
Address 1 &
Address 2 &
Address 3 &
Address 4 &
Address 5 &
Address 6 &
Address 7 &
a !a b !b c !c
!a !c!b& &
!a c!b& &
!a !cb& &
!a cb& &
a !c!b& &
a c!b& &
a !cb& &
a cb& &
Predefined AND array
Pro
gra
mm
ab
le O
R a
rra
y
w x y
Predefined link
Programmable linkPROM
PLA a b c
&
&
&
a !a b !b c !c
N/A
Programmable AND array
Pro
gra
mm
ab
le
OR
arr
ay
Predefined link
Programmable link
l l l
w x y
N/A
N/A
CPLDs
• Complex Programmable Logic Devices were introduced circa 1980
• Main idea was that majority of the building blocks were not supposed (or could not be) connected
to each other
• Usually every link is not required, some pins are unidirectional
• Significant save in interconnection area
– => Programmable interconnections nonetheless
• Often non-volatile
• None/few hard-macros Programmable
Interconnect
matrix
Input/output pins
SPLD-like
blocks
FPGAs
• Xilinx developed the first in 1984
• The AND- and OR-arrays are replaced by Programmable Logic Blocks
• Contains essentially a LUT and a flip-flop
• Look-up Table (LUT) implements a truth table
– For example, a 4-input LUT can implement any function that has four inputs and one output
7 A very simple programmable logic block
3-input LUT
a
b
c flip-flop
clock
mux
y
q d
reset
FPGA Basic Logic Cells
• Include fixed amount of combinational
logic and registers
a) LUT is the prevailing. Flexible.
b) MUX-based structures could do the
trick also
• Usually FPGAs contain 1-4
programmable registers per logic cells
• In some architectures, the LUTs can
also be used as tiny memory banks
(Xilinx, Altera Stratix III)
10.2.2016 Arto Perttula 10
LUT
MUX
FPGA Architecture
• The logic cells are typically grouped into larger arrays of logic blocks
– Altera Stratix 2: Logic Array Block (LAB)
• Equals 8 ALM (Adaptive logic modules)
• Equals ~16 basic logic cells (LC, á 4-LUT + FF)
– Xilinx Configurable Logic Block (CLB)
• Equals 4 Slices = 8 Logic cells
• And what’s best, these names tend to change with every new device and also new
terms are introduced… • E.g., Xilinx CLB is 2-4 slices depending on the device…
• However, FPGA architectures differ more and more which makes the direct
comparison a bit harder
• Always report #LUT, #FFs, memory bits, #MUL from your own design
10.2.2016 Arto Perttula 11
Clock Networks in FPGAs
• FPGAs are designed for synchronous logic
– This is the case with 99% of FPGAs even if some exotic devices exist
• FPGAs include clock networks and support different clock domains within the device
• Clock networks are hierarchical
1. Global clocks (Gclk)
2. Regional clocks (Rclk) (may have several)
– Number of Rclk >> Gclk (e.g., 100 Rclks and 16 Gclks)
• Gclks provide a zero-skew clock network spanned over the whole chip
• Rclks provide a zero-skew network within some portion of the chip
– Naturally, all the blocks using given Rclk must reside in same portion
• Someone should tell that to the EDA tools also…
10.2.2016 Arto Perttula 14
Generating Clocks
• The mystical ”clk” signal is generated by
a) Input (crystal) oscillator
– Input to the FPGA device (dedicated pins) which buffers the signal
– This can be directly used
b) DLL/PLL circuitry that multiplies/divides the input clock
– Locks to the required frequency, may provide phase shift
– E.g., create a stable 200 MHz clock from 50 MHz input clock
c) Internal feedback-loop clocks or clock dividers
– The most hazardous way
– Doable, but don’t use this
• Prone to variations on process, voltage, and temperature
• Static timing analysis often cannot be used (verification very difficult)
• Place-and-route may change timing (exact timing cannot be set by tools)
• The pulse width will change if you migrate to a different device
10.2.2016 Arto Perttula 15
FPGA Devices 1: SRAM-Based
• SRAM is used to configure the interconnection switches and LUTs
• Majority of FPGAs nowadays
• Usually implemented with leading-edge technology
• Can be re-programmed arbitrarily many time
– Ideal for prototyping and rapid development
• Since SRAMs lose their contents when powered off, an external device (+non-volatile
memory) is required to program them during boot-up
• One concern is security: the device configuration bitstream can be copied during the
programming
– Bitstream encryption can prevent this
• Manufacturers include Altera, Atmel, Lucent, and Xilinx
10.2.2016 Arto Perttula 17
FPGA Devices 2: Antifuse
• Need a special in-chip programmer circuitry (may be big), but retain the program during shut-down
(non-volatile)
– Fast boot, good security, low power
– One-time programmable (OTP) only
• No need for external circuitry
• They are rad-hard (quite immune to radiation effects)
– Good for, e.g., space applications
• Compared to the SRAM-based with same technology, antifuses have
– Better density (logic gates/mm2)
– Lower interconnect delay – Faster
– BUT! Usually available chips are even several technology generations behind SRAM-counterparts due to
extra processing steps required
• Cancels some of the benefits
• Manufacturers include, e.g., Actel
10.2.2016 Arto Perttula 18
FPGA Devices 3: EEPROM/FLASH
• Programming is similar to SRAM-based, but non-volatile
– Both re-programmable and fast boot
• Good security
• EEPROM and FLASH 1-bit cells need two (special) transistors
– Typical 1-bit SRAM implementation requires 6 transistors
– => Smaller cells than in SRAM devices
– Faster, more density
• BUT! Also few generations behind the leading edge
• Some devices integrate small Flash memory but LUT and wire configuration is done
with SRAM
• Manufacturers include, e.g., Actel, Xilinx
10.2.2016 Arto Perttula 19
FPGA Device Technologies: Summary
Arto Perttula 20
State-of-the-art
Feature
Technology node
SRAM AntifuseE2PROM /
FLASH
One or more
generations behind
One or more
generations behind
Fast
Reprogramming
speed (inc.
erasing)
----3x slower
than SRAM
Yes
Volatile (must
be programmed
on power-up)
NoNo
(but can be if required)
MediumPower
consumptionLow Medium
Acceptable(especially when using
bitstream encryption)
IP Security Very Good Very Good
Large
(six transistors)
Size of
configuration cellVery small
Medium-small
(two transistors)
NoRad Hard Yes Not really
NoInstant-on Yes Yes
YesRequires external
configuration fileNo No
Yes
(very good)
Good for
prototypingNo
Yes
(reasonable)
Yes
(in system)Reprogrammable No
Yes (in-system
or offline)
Configurable SRAM-Based FPGA
• The device needs a programming file (also called as bit file or bitstream)
– Includes the programming into for each cell of the FPGA
– Usually proprietary format
• Again, a lot of variation accross manufacturers and devices
• Each cell and interconnection needs to be configured at start-up
– Programming file size from several kilobytes to megabytes
– Takes time in the order of milliseconds or more
SRAM-Based FPGA Configuration (2)
• Common procedure is to use serial configuration circuit in order to save the PCB area and precious I/O pins
• The process can be visualized as a shift register chain of cells and on every clock tick one cell is programmed
– Millions of cells, slow, similar idea as in scan chain
• The internal implementation of the ”register chain” varies
10.2.2016 23
Configuration data in
Configuration data out
= I/O pin/pad
= SRAM cell
FPGA
HARD MACROS AND
GIGABIT TRANSCEIVERS
Transceiver is a device that has both a transmitter and a receiver
which are combined and share common circuitry
10.2.2016 Arto Perttula 27
Integrated Hard Macros
• The devices have increasing number of integrated hard macros
– Not built from LUTs (much faster and smaller)
– Included in each device despite of usage everyone pays
– Includes the most common functions
– Allow some configuration even if they are “hard”
• E.g.,
– PLL/DLLs for clock manipulation
– Memories
– High-speed multipliers with accumulate (MAC)
– Integrated microprocessors (e.g., ARM, PowerPC)
– High speed I/O link controllers
10.2.2016 Arto Perttula 28
Basics for I/O Transceivers
• Parallel buses have long been the prevailing data transmission type, but high-speed parallel
wiring is very hard to manage
– Signal integrity issues (crosstalk, susceptibility to noise etc., track length on PCB)
• Serial communication simplifies many things
– Unidirectional point-to-point links, only two devices instead of multi-master (compare to shared bus)
– Necessitates higher frequency than parallel communication
10.2.2016 30
FPGA
Differential
pairs
Transceiver block
Transmit (TX) to other device
Receive (RX) from other device
Differential Signaling
• Only the difference between the signal levels matter
– Always carry complementary values
• If the tracks are close to each other, noise will affect both lines similarly the difference stays the
same
31
IN
Standard
Input
Differential
PairRXN
RXP
FPGAOutside
World
IN
RXN
RXP
0
1
Noise spikes
Noise spikes
0
1
(a) (b)
Traditional
Differential
Integrated Gigabit Transceivers
• E.g., Stratix 3 supports speeds up to 1.25 Gbps
– Fastest implementations are 3-4x faster
• However, we can group a set of transceivers so we can further improve the data rate
– Using 8 transceivers would result in, e.g., 10 Gbps speed
– Extra logic required to pack and unpack the data being sed from device to device
• One should try to utilize the FPGA board’s capabilities as much as possible instead of
developing own proprietary solutions
• Sidenote: e.g., 4 Gbps serial link => data transfer rate 4 GHz = 0.25 ns period
– Speed of light is 299,792,458 m/s. Light traverses 7.5 cm during one period, electrons
somewhat less…
10.2.2016 Arto Perttula 33
FPGA PERFORMANCE AND
SELECTION CRITERIA
10.2.2016 Arto Perttula 36
k Unit price [$]
10-100
400-600
1k – 18k
Orig. table: [P. Jääskeläinen,et al. "TCEMC: A Co-Design Flow for Application-Specific
Multicores", SAMOS XI, July 2011, pp. 85-92]
TTA
Typical Application Domains
ASIC:
– Mass products, consumer electronics
– Mobile phones
– Computers
– MP3-players
– Digital cameras
FPGA:
– Industrial (/military) electronics
– Some consumer products (e.g., DVB)
– Cell phone base stations
– Factory automation
– Internet routers
– ”Glue logic” F-16 AN/APG-68
Programmable Signal Radar
Processor uses Altera Stratix II
Mars rover project used
Actel and Xilinx FPGAs
See also: http://www.altera.com/corporate/cust_successes/customer_showcase/view_industry/csh-vindustry.jsp
Tools
• You need a simulator, synthesizer, place-and-route, timing analyzer, and programmer
– In practice, also virtual logic analyzer and design viewers (schematic, RTL, technology, chip level) are
invaluable
• The basic set of tools is provided by the FPGA vendor
• Typically these have sufficient features and are good enough
• Most of all, they’re cheap!
• Development boards can be obtained fairly cheaply (~few hundred to few thousand $)
• The major players like Mentor, Synopsys, and Cadence also offer tools for synthesis (and recently
for physically aware synthesis also)
– May have extra features / better performance
– Not necessarily required
10.2.2016 Arto Perttula 39
Design Performance: Speed
• Total delay in an FPGA is sum of three factors:
1. Delay from FF clock to FF Q (constant)
2. Interconnect delay
3. Logic cell delay (LUT)
• Interconnect delay and #LUTs in path vary depending on logic function
• Interconnect delay depends on the number of switches in the path (which form the path from source to
destination) and the route length
• Typically, routing delay is 60-80% of total delay of critical path!
• Maximum operating frequency of the FPGA (generally)
1. Big designs ~100 MHz
2. Small designs ~up to 200 MHz
– Note that most SoCs operate around 1 GHz
10.2.2016 40
1 2
3
Design Performance: Area
• Very dependent on the application
– FPGA is good for register-heavy designs
• The more area the design takes, more difficult it is to route
less clock frequency
• Largest high-end FPGAs can hold very complex architectures,
comprising several soft RISC processors and other hardware
– ”Multi-million ASIC gates”
• Design with small area can be fitted into cheaper FPGA
• 3rd basic measure, power, getting more important
10.2.2016 Arto Perttula 41
Separation of Targets
• Strong separation between high-end and low-end FPGA devices
1. Low-end
– Low cost, lower logic capacity, less memory, less integrated hard macros
– Target is the traditional cost-sensitive consumer products and glue-logic domain with possible fancy features, such
as signle simple soft processor
– Price from few to tens of euros, cheaper for high quantities
2. High-end are highly optimized, usually for speed and large capacity
– Pricing thousands of euros/device, up to 10k-range for the best (depends again on the volume)
– Target is the traditional ASIC domain
– When high performance is required but not enough products are manufactured to compensate for ASIC’s higher
NRE costs
3. Emerging trend is also to offer structured ASIC of the design
– The design of an FPGA is ”burned” into a structured ASIC that cannot be re-programmed. Altera calls this ”hard-
copy” and Atmel uses term ”ULC”.
– Saves power and area, increases speed due to removal of the programming resources
– EETimes: power -40%, area -70%, performance +50-100%, price -30%
42 [http://www.eetimes.com/electronics-news/4124922/Altera-Unveils-HardCopy-for-Stratix]
[http://www.altera.com/products/devices/hardcopy-asics/about/migration/hrd-migration.html]
Hard-Copy FPGA
10.2.2016 43
Figure: [V. Betz, "Will Power Kill
FPGAs?," ACM/SIGDA International
Symposium on FPGAs, Monterey, CA,
2006]
http://www.eecg.toronto.edu/~vaughn/p
apers/fpga2006_power_panel.pdf
Table: [Generating Functionally
Equivalent FPGAs and ASICs With a
Single
Set of RTL and Synthesis/Timing
Constraints, Altera white paper, WP-
01095-1.2, February 2009, ver. 1.2]
NRE reduced from ASIC, e.g. by 2x -3x and consequently the cost break-even between FPGA and
hard-copy might be as low as 5k-10k units . [programmablelogicZONE Products for the week of May 19, 2008, http://www.en-
genius.net/site/zones/programmablelogicZONE/product_reviews/plp_051908]
FPGA Device Selection Criteria #1
1. Circuit capacity
– Amount of logic elements and registers, logic element size, (routing resources)
– Amount of RAM, types of RAM
– Required hard macros
– I/O signal routing (How the location of an I/O pin affects the routing)
2. Number of I/O signals and supported standards
3. Pricing
– Unit price in volume production
– Development cost
– Ranges a lot depending on the amount, specific device and package (and the client)
– Prices are subject to rapid changes long term contracts should be carefully considered
– FPGAs are rather expensive, e.g., 5-150 euros, and cheapest microcontrollers are ~0.95-5 euros
4. Temperature range, radiation-hardness
5. Power consumption
10.2.2016 Arto Perttula 46
FPGA Device Selection Criteria #2
6. Programming style
– Re-programming, flexibility vs. security
– External components required and their price
7. Future
– Availability of the chips in volume and in time
– Structured ASICs available?
– Compatible pin/package mapping between different flavors of the device
8. Voltage levels, inside the chip and for I/O
– Compatibility with PCB and adequate noise margins
9. Circuit speed
– Basic cell speed, routing speed, routing delay predictability
– Affects only the most high-performance designs
10. Global signals – signals that fo to every cell (clk, reset)
– Clock networks, clock generation inside the chip, dedicated clock I/O pins
– Dedicated global reset pin
11. Development environment
– CAD tools, usability, support
12. Packaging (suitability for chosen PCB assembly etc.)
10.2.2016 Arto Perttula 47
Availability and Life Span
• The digital CMOS technology develops rapidly
– New devices are introduced faster and faster
• The life span of certain device is dictated by its demand
– Widely used devices are more certain to stick around for years
– Very widely used devices may life quite long (even 10 years, e.g., Xilinx XC3000, Altera Flex 10k)
• The old device may be convertible to a new device without modifications
– Package, pins, operating voltage, configuration
– Operating voltage tends to change between technology generations and that causes most of the problems
with compatibility
• The manufacturer may give some guarantees of life span
• Choosing between different vendors may be complicated. The experience with certain
manufacturers devices may be the dominant factor.
• Relying purely on soft, FPGA-vendor-independent IP cores, helps in porting the system to another
device
10.2.2016 Arto Perttula 48
Physical Size
• The actual size of the IC is not available
• Examples…
• 8:1:1 user I/O/Gnd/V ratio to reduce the loop inductance in the package
Arto Perttula 57
Logic Array Block (LAB)
• Each Logic Array Block (LAB) consists of ten Adaptive Logic Modules (ALM) + interconnection lines
• Some LABs can be implemented as Memory LAB (MLAB)
– ALM is used as 64x1 or 32x2 RAM block
• LABs may perform in low power or high performance mode, the synthesis tool automatically sets non-critical paths
to low power and vice versa
10.2.2016 Arto Perttula 59
Source:
http://www.altera.com/products/devices/stratix3/
overview/power/st3-power.html
ALM Contents
• ALM operating modes
1. Normal
2. Extended LUT mode
3. Arithmetic
4. Shared Arithmetic
5. LUT-Register
• There are 8 general-purpose data inputs, carry in and shared arithmetic
connector from previous ALM or LAB, and register chain connection
• LAB-wide signals
– Clock, async clear, sync clear, synch load, clock enable
10.2.2016 Arto Perttula 60
ALM Modes
• Usually dictated by the synthesis software and does not need
manual tweaking
• Other than normal mode can be used to implement special
structures, such as fast arithmetics
– Circuits that need a lot of arithmetic, e.g., all the counters and
comparators
• Extended LUT mode allows specific set of 7-input functions to be
implemented (a mux-function)
• LUT-register mode forms one DFF from the 2 LUTs of ALM (so the
ALM has 1+2=3 flip-flops)
10.2.2016 Arto Perttula 64
Register Packing
• Device can use the register and the combinational logic for unrelated functions
• Improves utilization
10.2.2016 Arto Perttula 65
Hard Macros: TriMatrix Memory
• Configurable, fast (up to 600MHz) on-chip SRAM memories
• Various bit widths supported, can be grouped together to form different sized
memories
10.2.2016 Arto Perttula 66
TriMatrix Memories
• Packed mode: pack two single-port
memories to one physical dual-port memory
• Simple dual port: simultaneous read and
write
• True dual-port: any combination of
simultaneous two operations of read and
write supported
– e.g., rd+rd, wr+wr, wr+rd, rd+wr
10.2.2016 Arto Perttula 67
Hard Macros: DSP Blocks
• High-performance, power-optimized, fully registered and pipelined multiplication
• Number of DSPs range from 27 to 112 (>54 36x36 multipliers or more)
– Not to be confused with DSP processors…
• Natively supported
– 9-bit, 12-bit, 18-bit, 36-bit word lengths
– 18-bit complex multiplications
– Floating-point arithmetics: 24-bit for single precision and 53-bit for double precision
– Signed and unsigned input support
• Built-in addition, subtraction and accumulation units to combine multiplication results
• Cascading 18-bit input bus to form tap-delay line for filtering applications
• Cascading 44-bit output bus to propagate output results from one block to the next block without
external logic support
• Rich and flexible arithmetic rounding and saturation units
• Efficient barrel shifter support, loopback capability for adaptive filtering
10.2.2016 Arto Perttula 68
LAB Interconnect
• The 10 ALM within LAB are
connected with local interconnect
• Moreover, there are three
dedicated paths between ALMs:
1. Register Cascade – for a fast
shift register
2. Carry-chain – for fast
addition/subtraction
3. Shared Arithmetic chain – for
fast adder trees
10.2.2016 71
C4 Interconnect
• Spans 4 interfaces in
the same column
– 4 LABs
– 1 DSP block
– ½ M144K memory
• LAB may drive C4 both
on its left and right side
Arto Perttula 73 From fig 3-3
...
...
DSP Blocks
• A DSP block is divided into four blocks
– Interface with four LAB rows on the left
and right
• Can be cascaded by fast local links
• One DSP block corresponds to roughly
60-100 LEs, depending on parameter
widths and types
10.2.2016 Arto Perttula 75
Clock Resources
• The clock networks are zero-skew networks (i.e., heavily buffered and delay-compensated)
• The clock lines can also be used to drive other high-fanout signals such as device-wide reset
10.2.2016 80 (notes 1-4) : depends on device type
Global and Regional Clock Networks
• Global clocks can be used to drive logic and other blocks throughout the device
– 16 GCLKs
• Regional clocks can only be used in one device quadrant
• Only certain input pins can be connected to clock network
10.2.2016 81
PLL Properties
• Main goal of a PLL is to synchronize the phase and frequency of an internal or
external clock to an input reference clock
• Counters for divide and multiplication to get required frequency
– E.g., 50 MHz clk*2/3 33 MHz clk
– Parameters m and n in range 1-512 (f_out = f_in*m/n)
• Lock time: how long it takes to get the required frequency stabilized (~1ms)
• Jitter: how much the duty cycle/frequency varies
– E.g., cycle-to-cycle jitter: two consecutive cycles’ periods differ at mist by 17.5 ps
– E.g., period jitter: with 99.99% probability clock edge time differs at most by ±175 ps from
ideal clock (when measured over 10k cycles)
• Duty cycle: up/down times (e.g., 50/50)
• Phase shift: relation between input and output clock edges
10.2.2016 Arto Perttula 84
Input/Output Pins
• The way to interface external components, such as displays, buttons, and
memories
• Number of I/O pins depend on the package and device
– 296-1120 user I/O pins available in Stratix III
– Many pins are required for voltage and ground (not accounted in the above)
• A pin can be in, out, or three-stated (programmable)
• Stratix device also includes dynamic series and parallel on-chip termination
to provide I/O impedance matching and termination capabilities
• The I/Os are configurable and support a wide range of standards
10.2.2016 Arto Perttula 86
I/O Standards and Properties
• Single-ended, non-voltage-refernced and voltage-referenced I/O standards
• Low-voltage differential signaling (LVDS), reduced swing differential signal (RSDS), mini-LVDS, high-speed
transceiver logic (HSTL), and stub series terminated logic (SSTL)
• Single data rate (SDR) and half data rate (HDR – half frequency and twice the data width of SDR) input and output
options
• Up to 132 full duplex 1.25 Gbps true LVDS channels (132 Tx + 132 Rx) on the row I/O banks
• Hard DPA block with serializer/deserializer (SERDES)
• De-skew, read and write leveling, and clock-domain crossing functionality
• Programmable
– output current strength, e.g., 4-16 mA/pin
– slew rate – how fast voltage changes, e.g., 50 Volt/µsec
– delay, e.g., 0-1000 ps
– bus-hold – keeps the state of three-state bus until someone drives it
– pull-up resistor – provides default value if no-one drives, e.g., 25 kΩ
– Hysteresis/toggle point
• Open-drain output
10.2.2016 Arto Perttula 87
I/Os During Configuration
• Configuration has 3 phases: reset, configuration and initialization
• Before and during configuration, all user I/O pins are tri-stated
– Stratix, Arria, and Cyclone series have weak pull-up resistors on the I/O pins which are on, before and during
configuration
• Init phase initializes the internal logic and registers and enables I/O buffers
• User can delay configuration by holding the nCONFIG low
10.2.2016 Arto Perttula 91 [Configuring Altera FPGAs, Configuration devices Vol 1, Altera Corporation, Ver. 3,1, CF51001-3.1, Aug. 2013]
Design Security in Stratix III
• Configuration bitstream may be encrypted with 256b AES
– The stream that is stored in the Flash
– Available with only certain device configuration modes
• The key is stored in FPGA device and cannot be read out
– The key is also scrambled
• The configuration-file read-back is not supported
• Tamper Protection bit
– Once set, only bitstream encoded with the certain key may be used to program the FPGA
• Volatile and non-volatile key supported
– Volatile needs an external battery
– Non-volatile is one time programmable (fuses)
10.2.2016 Arto Perttula 92
Stratix III Family Features
• Compare sizes to: Nios II/f CPU core ~2000 ALUT, SDRAM ctrl 300 ALUT, motion
estimation 6900 ALUT, DCT-Quant 2100 ALUT
10.2.2016 Arto Perttula 94
Stratix II LAB Parameters
• Stratix II data, did not find for Stratix III
10.2.2016 Arto Perttula 97
Requirements in a Nutshell
• SDRAM is synchronous, hence we must provide the clock for SDRAM
• A controller is constructed in FPGA
– Fetches and stores data; refreshes memory periodically
• SDRAM must be usable with configurable frequency up to 133 MHz as well as the
controller
– Requires (static) computation of the timing parameters
• CAS latency (column access latency) increases with frequency
• Required refresh period
• Note the example is specific to sdram chip, FPGA device, and PCB
– Basics apply in general, but one has to adapt the actual values for own environment
10.2.2016 Arto Perttula 102
Practical Matters in VHDL
• Register the outputs of the FPGA
– Pins includes special I/O registers, you should instruct the place-and-route to use these
– Can be specified as VHDL attributes (useioff) entity sdram_controller is
...
data_to_sdram2hibi_out : out std_logic_vector(31 downto 0);
...
attribute useioff : boolean;
attribute useioff of data_to_sdram2hibi_out : signal is true;
attribute useioff of sdram_data_inout : signal is true;
end;
• 3.3V LVTTL I/O standard used – Default setting
– Defined in Quartus II
10.2.2016 Arto Perttula 104
Clock for the SDRAM
• We must provide a clock to the SDRAM controller
• No need to synchronize the data if we set the timing constraints correctly because both have the same frequency
• PLL is used to generate the clock for FPGA SDRAM controller and, e.g., 180° phase shifted clock for SDRAM (to obtain high frequencies)
– Typically memories have large setup and hold time requirements
– Thus we want that the clock rising edge is in the middle of the data valid period
• However, we must take into account several factors that affect timing
– Parameters of the FPGA and SDRAM I/O pins – timing varies with device family and speed grade
– Pin location on the FPGA – I/O pins connected to row routing have different timing than column routing
– Logic options used during the Quartus II compilation – Logic options such as the Fast Input Register and Fast Output Register logic affect the
design fit. The location of logic and registers inside the FPGA affects the propagation delays of signals to the I/O pins.
– SDRAM CAS latency
105 ts th
Example SDRAM Timing
10.2.2016 106
Note that required tds and tdh (=ts and th) may have
different duration
Notes on DRAM Timing
• Each transaction takes several cycles
– Might be hundreds of cycles in high-end CPUs
– Bank selection/row address first, and column address after few cycles
– Data fetch is several cycles, improves very little with technology
• Page miss takes about 50-60 ns
– Fetch time depends on previous accesses (same bank or row?, read after write takes longer than write after write…)
– Data is transmitted in bursts, e.g., min 4 or 8 words
– Refresh takes some time
• Access times are unpredictable and efficiency way less than 100% (1 word/cycle)
– 1 word accesses scattered randomly are very inefficient
• DDR transfers data in both rising and falling edge
– Reduces the data transmission time but not the other overheads
10.2.2016 Arto Perttula 107
How to Calculate the Phase Shift?
• Wrong clocking will cause problems either
1. in setup time or hold time
2. in memory write or read operation
3. in FPGA side or inside the DRAM
• SDRAM clock edge might be
– before FPGA clock
– simultaneous to FPGA clock
– after the FPGA clock
• Certain phase shift improves one thing and worsens the other
• We must check many cases and seek balance
10.2.2016 Arto Perttula 108
Find the Critical FPGA Params
• Note that these calculations show an estimate and basic principle only
• The unaccounted (design-specific) parameters are
– Signal skew due to delays on the printed circuit board – These calculations assume zero skew
– Delay from the PLL clock output nodes to destinations – These calculations assume the delay from the PLL
– SDRAM-clock output-node to the pin is the same as the delay from the PLL controller-clock output-node to the clock
inputs in the DRAM controller. If these clock delays are significantly different, you must account for this phase shift in
your window calculations.
10.2.2016 Arto Perttula 112
Wr FPGA->DRAM
Affects:
Wr FPGA->DRAM
Rd DRAM -> FPGA
Rd DRAM -> FPGA
How Early Can SDRAM Clock Be?
• How early can SDRAM clock be w.r.t. controller clock
• Select the lesser of Read Lag or Write Lag
Read Lag = tOH(SDRAM)– tH_MAX(FPGA)
Read Lag = 2.5ns –(–5.607ns)
Read Lag = 8.107ns
Write Lag = tCLK – tCO_MAX(FPGA)– tDS(SDRAM)
Write Lag = 20ns – 2.477ns
Write Lag = 17.523ns
• Read lag is smaller: 8.107 ns
• Remember that “lag” is negative with respect to controller clock edge (in Altera terminology)
10.2.2016 Arto Perttula 113
How Late Can SDRAM Clock Be?
• How late can SDRAM clock be w.r.t. controller clock
• Select the lesser of Read Lead or Write Lead
Read Lead = tCO_MIN(FPGA)– tDH(SDRAM)
Read Lead = 2.399ns – 1.0ns
Read Lead = 1.399ns
Write Lead = tCLK – tHZ(3)(SDRAM)– tSU_MAX(FPGA)
Write Lead = 20ns – 5.5ns – 5.936ns
Write Lead = 8.564ns
• Read lead is smaller: 1.339 ns
• ”Lead” is positive with respect to controller clock (in Altera terminology)
10.2.2016 Arto Perttula 114
Select the Phase Shift
• Read lag: -8.107 ns
• Read lead: 1.399 ns
• Data valid region is thus (read lag to read lead) = -8.107 ns to 1.399 ns
• Safest point is in the middle:
• (-8.107 + 1.399 )÷ 2 = –3.35ns
Phase shift the clock to SDRAM by -3.35 ns
• Clock edge is earlier in SDRAM than in controller
10.2.2016 Arto Perttula 115
SDRAM clock
controller clock
20 ns
3.35 ns
Green region highlights the legal phase shifts.
This example ought to work also without phase
shift, but shifting adds tolerance and enhances
dependability
Summary
• And now SDRAM works
– Of course, after this you must use a test block/program that just reads
and writes the memory
– Then you can try out different phase shifts to validate the calculations
• FPGAs
– Built from logic cells (LUT+DFF), hard macros, and routing network
– Excellent for prototypes and small volume products, especially when
many special IOs are needed
– 3 config types: SRAM, antifuse and EEPROM/Flash
10.2.2016 Arto Perttula 116