New ISE 8.1i Software - Xilinx · Synthesis The flow continues with synthesis, which provides an...

ISSUE 55, FOURTH QUARTER 2005XCELL JOURNAL

XILINX, INC.

Issue 55Fourth Quarter 2005

T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

New ISE 8.1i Software – Faster Timing Closure

DESIGNPERFORMANCEPhysical Synthesis andOptimization with ISE Software

VERIFICATIONEarly Defect Discoverywith Assertion-BasedVerification AcceleratesDesign Closure

PRODUCTIVITYUsing the ISE FoundationArchitecture Wizards

PARTIALRECONFIGURATIONPlanAhead Software as a Platform for PartialReconfiguration

DESIGNPERFORMANCEPhysical Synthesis andOptimization with ISE Software

VERIFICATIONEarly Defect Discoverywith Assertion-BasedVerification AcceleratesDesign Closure

PRODUCTIVITYUsing the ISE FoundationArchitecture Wizards

PARTIALRECONFIGURATIONPlanAhead Software as a Platform for PartialReconfiguration

The Programmable Logic CompanySM

FFoorr mmoorree iinnffoorrmmaattiioonn vviissiittwwwwww..xxiilliinnxx..ccoomm//ssppaarrttaann33ee

Pb-free devicesavailable now

©2005 Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124. Europe +44-870-7350-600; Japan +81-3-5321-7711; Asia Pacific +852-2-424-5200; Xilinx is a registered trademark, and The Programmable Logic Company is a service mark of Xilinx, Inc.

IInnttrroodduucciinngg tthhee nneeww SSppaarrttaann--33EE ffaammiillyy —— tthhee wwoorrlldd’’ss lloowweesstt--ccoosstt FFPPGGAAss

PPrriicceedd ttoo ggoo..

TThhee iinndduussttrryy’’ss ffiirrsstt 110000,,000000--ggaattee FFPPGGAA ffoorr oonnllyy $$22..0000**Spartan-3E Platform FPGAs offer an amazing feature set for just $2.00! You get 100K gates, embedded

multipliers for high-performance/low-cost DSP, plenty of RAM, digital clock managers,

and all the I/O support you need. All this in a production-proven 90nm FPGA with a

density range up to 1.6 million gates.

PPeerrffeecctt ffoorr ddiiggiittaall ccoonnssuummeerr aappppss aanndd mmuucchh mmoorree!!With the Spartan-3E series, we’ve reduced the previous unit cost benchmark by over

30%. Optimized for gate-centric designs, and offering the lowest cost per logic cell in

the industry, Spartan-3E FPGAs make it easy to replace your ASIC with a more flexible, faster-to-market

solution. Compare the value for yourself . . . and get going on your latest design!

MMAAKKEE IITT YYOOUURR AASSIICC

* Pricing for 500K units, second half of 2006

TThere has never been a better time to be a design engineer. With today’s products from Xilinx, youcan accomplish more, in less time, with less risk, than ever before. With our Virtex™-4 FPGA family, you have the ideal silicon platform to tackle today’s most complex system-design challenges,and with our ISE™ software you can unleash that power. The Xilinx ISE tools continue to be thedesign community’s number-one choice; we hear this loud and clear from our customers.

We thank you for your support and we constantly strive to bring you more software innovationsthat maximize the benefits of our silicon solutions. For example, the 2005 EDA Survey conductedby EE Times indicates that the top three issues and challenges that you face are meeting timingbudgets, getting your design to work on the PCB, and completing functional verification. TheXilinx software team has been hard at work solving these challenges for you, to help reduce yourdesign time.

With the release of the ISE 8.1i software, we are introducing the new ISE Fmax technology, whichas the name implies is designed to improve design performance and reduce design bottlenecks.With the addition of advanced options such as the PlanAhead™ hierarchical floorplanner, theChipScope™ Pro analyzer, and wide industry support from leading EDA vendors, it is easy to seewhy our ISE software is the top choice.

We want to help you be more productive. With this in mind, this issue of the Xcell Journal includesa collection of articles highlighting tools and strategies to reduce the time you spend meeting yourtiming budgets, as well as articles on productivity enhancement tools and techniques in the areasof verification and board-level interfaces. We have also included case studies on how some of ourcustomers are successfully using Xilinx software in their designs, as well as a poster on timing closure technologies, which we hope will be a handy reference to help you maximize performancein the shortest possible time.

Thank you for reading Xcell !

L E T T E R F R O M T H E E D I T O R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. PowerPC is atrademark of IBM, Inc. All other trademarks are theproperty of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

A Roadmap to Productivity

Forrest CouchExecutive Editor

EDITOR IN CHIEF Carlis Collinscarlis.collins@xilinx.com408-879-4519

EXECUTIVE EDITOR Forrest Couchforrest.couch@xilinx.com408-879-5270

MANAGING EDITOR Charmaine Cooper Hussain

ONLINE EDITOR Tom Pylestom.pyles@xilinx.com720-652-3883

ART DIRECTOR Scott Blair

ADVERTISING SALES Dan Teie1-800-493-5551

O N T H E C O V E R

1414 4848

D E S I G N P E R F O R M A N C E V E R I F I C A T I O N

6868P A R T I A L R E C O N F I G U R A T I O N

Physical Synthesis and OptimizationMeet your performance targets with these tips and strategies.

P R O D U C T I V I T Y

Using the ISE Foundation Architecture WizardsStreamline the process of configuring and instantiating the complex blocks found in Xilinx devices.

Early Defect Discovery with Assertion-Based VerificationThe convergence of design, synthesis, and verification.

PlanAhead Software as a Platform for Partial ReconfigurationPlanAhead software delivers a streamlined environment to reducespace, weight, power, and cost.

Timing closure is perhaps the single most important designissue facing designers today. This article explores the methodology to achieve timing closure for Xilinx designs.

Viewpoint

F O U R T H Q U A R T E R 2 0 0 5 , I S S U E 5 5 Xcell journalXcell journalVIEWPOINTImproving Time to Design Closure with ISE Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

DESIGN PERFORNAMCEDesign Tools for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

Achieve Faster Timing Closure with Graph-Based Physical Synthesis . . . . . . . . . . . . . . . . . . . . .12

Physical Synthesis and Optimization with ISE Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Improve Design Performance Using PlanAhead Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . .18

Synthesis Tool Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

Accelerate Design Performance Using Xplorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

Achieve Your Performance Goals with PlanAhead Software . . . . . . . . . . . . . . . . . . . . . . . . . .28

HDL Coding Practices to Accelerate Design Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

Writing RTL Code for Virtex-4 DSP48 Blocks with XST 8.1i . . . . . . . . . . . . . . . . . . . . . . . . . .36

VERIFICATIONUn-Tethered Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

Shorter Verification Cycles at Lucent Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

Verifying Your Logic Design for First-Time Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

Early Defect Discovery with Assertion-Based Verification Accelerates Design Closure . . . . . . . . . .48

PRODUCTIVITYSimplifying FPGA Pin Assignment Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

Power Considerations in 90 nm FPGA Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56

Using the ISE Foundation Architecture Wizards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

PARTIAL RECONFIGURATIONBenefits of Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

PlanAhead Software as a Platform for Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . .68

GENERALSupporting Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

Real-Time Analysis of DSP Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75

Configuration Choices – Platform Flash or Commodity Flash . . . . . . . . . . . . . . . . . . . . . . . . .79

Accelerating PowerPC Software Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82

Nucleus Integration with Xilinx FPGA System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88

Programming FPGAs for High-Performance Computing Acceleration . . . . . . . . . . . . . . . . . . . . .92

Flexibility with EasyPath FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

Xilinx Embedded Ethernet MACs Negotiate the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99

Program SPI Serial Flash from Xilinx FPGAs and CPLDs . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

Add Valuable Software Modules with XPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106

REFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110

by Steve LassDirector, Software Product MarketingXilinx, Inc.steve.lass@xilinx.com

Timing closure is perhaps the single mostimportant design issue facing designers

today. With FPGAs and other deepsubmicron ICs, routing delays usu-ally dominate logic delays.

Although there are hundreds of waysto improve timing, such as using the

embedded PowerPC™ processor oranother high-speed core, the focus here is

on improving performance on the logic androuting portion of the design. In this article,I will explore the methodology to achievetiming closure for Xilinx® designs.

Timing Closure Design FlowFigure 1 shows a typical flow that startswith well-written HDL optimized forFPGA architectures. Xilinx ISE™ softwareprovides Verilog and VHDL templates tobegin crafting good HDL code. FPGAshave an abundance of registers, so addingpipeline stages can greatly improve timingand have very little impact on area. Someof the frequently used best practicesinclude keeping critical paths in the sameentity or module, utilizing clock enablesinstead of gated clocks, and avoiding theuse of latches, nested for-loops, and if-then-else statements in the HDL code.

Improving Time to Design Closure with ISE SoftwareImproving Time to Design Closure with ISE Software

6 Xcell Journal Fourth Quarter 2005

Meet your design targets with these tips and strategies.Meet your design targets with these tips and strategies.

V I E W P O I N T

Xilinx also recommends using synchro-nous resets for modules such as DSP48,FIFO16, and block RAMs, and usingadder chains instead of adder trees to helpachieve 500 MHz DSP48 performance(illustrated in Figure 1). A comprehensivediscussion on coding styles can be found inthe article, “HDL Coding Practices toAccelerate Design Performance.”

SynthesisThe flow continues with synthesis, whichprovides an early indication of whether ornot your HDL has a chance of meeting tim-ing and area requirements. Make sure youconstrain timing in the synthesis tools toavoid minimizing area at the expense of tim-ing – something that will likely occur if thetools are given no other direction. At a min-imum, constrain your clocks and I/O paths.

You can also allow the synthesis toolsto try harder by indicating an optimiza-tion effort. Another frequently usedoption to meet timing requirements isregister balancing, which moves registersforwards or backwards through logic toincrease clock frequency. If the synthesistools indicate that timing cannot be met,or if timing is very tight, you may be ableto further optimize your HDL code byusing one or more of the coding tech-niques previously discussed.

ImplementationHaving obtained an acceptabletiming estimate from the syn-thesis tool, use the implementa-tion tools (map, place, route,timing analysis) to determinethe true timing of the design.Xilinx ISE tools, powered byFmax Technology, proactivelyattempt to achieve the best per-formance possible, but dorequire that constraints are com-plete. A recommended set oftiming constraints (as shown inFigure 2) should include clockperiod, I/O offset, multi-cyclepath specification, and timingignore (TIG) to ignore falsepaths. If you are missing timingby more than 20%, you may

to turn on retiming and global optimiza-tion in ISE Mapper.

Xplorer UtilityXilinx recently introduced a new utilitycalled Xplorer that delivers optimaldesign results by employing smart con-straining techniques and a variety ofphysical optimization strategies. You can

need to do further optimization of yourHDL, especially within the module con-taining the worst-case path.

If you still haven’t met your timingrequirements in the implementationphase, there are many tool options thatcan provide dramatic improvements. Agood starting place is the use of retimingin your synthesis tool. Another option is

Fourth Quarter 2005 Xcell Journal 7

Coding StylesUtilize VHDL and Verilog language templates in Project Navigator

Pipeline! Add banks of Registers in RTL

Keep critical paths in the same entry/module

Utilize Clock Enable in lieu of Gated Clocks

Don't use reset to infer performance/areaoptimized shift registers (SRL)

Avoid nested “if-then-else,” “for loops,” and latches

Use Synchronous Resets for DSP48, FIFO16, and block RAM

Achieve 500 MHz DSP48 performance

500 MHz Adder Chain-Based Filter

Adder Tree-Based Filter

Figure 1 – Coding styles

Multi-Cycle PathFalse Path

I/O Path I/O PathInternal Clock Domains

Upstream

Device

Downstream

Device

Detailed Constraining

Strategies

Implementation Constraints

Internal Clock DomainsNET clock TNM_NET = clock;

TIMESPEC TS_CLOCK = PERIOD clock 5 ns;

Input and Output (I/O) PathsOFFSET = IN 10 ns BEFORE clock;

OFFSET = OUT 10 ns AFTER clock;

Multi-Cycle PathsTIMESPEC TS_MC = FROM MC_FFS TO MC_FFS TS_CLOCK * 2;

False PathsTIMESPEC TS_TIG = FROM PADS THRU RESET TO FFS TIG;

Figure 2 – Implementation constraints

V I E W P O I N T

use the Xplorer utility to automaticallytry these (implementation tool) optionsand even try different clock frequencies tofind the maximum achievable speed ofthe design. Once Xplorer has found thebest tool options, you should use thoseoptions the next time you run the imple-mentation tools, avoiding the long run-times of Xplorer. You can learn more onhow to achieve the best performance withXplorer in the article, “Accelerate DesignPerformance Using Xplorer.”

PlanAhead Design ToolsIf you’ve tried everything and still can’treach that timing closure pinnacle, there isstill hope. The Xilinx PlanAhead™ toolcan be used to analyze, and if necessary,floorplan the design to achieve higher per-formance (15% average, but can be as highas 2X). PlanAhead design tools providebetter insight into the place and routeprocess. You can quickly examine “what if ”scenarios, enabling you to identify and fixpotential problems early. You can alsogroup critical paths and modules toincrease routability through connectivityanalysis and utilization control.

See the article, “Improve DesignPerformance Using PlanAhead DesignTools” to learn more about the capabilitiesof this incredible tool. There are manyother advanced techniques like detailedfloorplanning and creating RPMs (rela-tionally placed macros), but we recom-mend you try the ideas in this article first.

ConclusionXilinx provides a comprehensive suite ofsoftware tools, powered by ISE FmaxTechnology, that you can use to improvedesign performance. ISE software, togetherwith the tips and strategies in this article,can help you quickly achieve timing closure.

Additionally, we work with leadingthird-party synthesis vendors to optimizedesigns and improve design performancefor Xilinx devices in their leading synthesissoftware. An entire section of the XcellJournal has been devoted to achievingdesign performance using Xilinx softwaretools. Please read the related articles in thisissue to learn more.

Get the latest Virtex newsdelivered to your desktopGet the latest Virtex newsdelivered to your desktop

Subscribe Now! www.xilinx.com/virtex4

Reach over 70,000 programmable logic

professionals worldwide.Advertise in the...

Xcell journalXcell journalfor more informationCall: (800) 493-5551xcelladsales@aol.com

for more informationCall: (800) 493-5551xcelladsales@aol.com

Reach over 70,000 programmable logic

professionals worldwide.Advertise in the...

V I E W P O I N T

TPS75003IS1

SW1FB1IS2

FB2SW2

OUT3FB3

AGNDDGND

IN1IN2IN3EN1

EN2SS1

SS2EN3SS3DGND

5 V_INPUT

VCCAUX

100�F

1.5nF 1.5nF 10nF

0.033 0.033

15�H VCCINT1.2 V at 2A

3.3V at 2A

VCCAUX

2.5V at 300mA

100�F

10�F

3ABUCK1

3ABUCK2

300mALDO

The TPS75003 power management IC for Xilinx’s SpartanTM and VirtexTM series of FPGAsintegrates multiple functions to significantly reduce the number of external componentsrequired and simplify design. Combining increased design flexibility with cost-effectivevoltage conversion, the IC includes programmable soft-start for in-rush current controland independent enables for sequencing the three channels. The TPS75003 meets allXilinx startup profile requirements, including monotonic ramp and minimum ramp times.

TI Powers Xilinx FPGAs

Applications

– DSL modems

– Set-top boxes

– Plasma TV display panels

– DVD players

Features

– Two 95%-efficient, 3-A buck

controllers and one

300-mA LDO

– Adjustable output voltages

– from 1.2 V for bucks

– from 1.0 V for LDO

– Input voltage range of

2.2 V to 6.5 V

– Independent soft-start for

all three power supplies

– LDO stable with small

ceramic output capacitor

– Independent enable for

each supply for flexible

sequencing

– 4.5 mm x 3.5 mm x 0.9 mm

20-pin, QFN package

– $1.90: 1 K price

POWER MANAGEMENT

www.ti.com/xilinxfpga-u 800.477.8924, ext. fpga

NEW! Power Management

Selection Guide

power.ti.com

For more information on TI’s complete lineof power management solutions for Xilinx

FPGAs—including a library of referencedesigns, schematics and BOMs—visit

www.ti.com/xilinxfpga-u.

by Frédéric RivoallonSynthesis Methodology ManagerXilinx, Inc.frederic.rivoallon@xilinx.com

The increase in FPGA densities is givingrise to more than just substantially largerlogic arrays. Today’s FPGA designs incor-porate an increasing amount of tightlyintegrated hard IP blocks – a trend that cre-ates new challenges for design software.One of the most critical challenges is thedifficulty of traditional logic synthesis toolsto correctly predict the critical path of thedesign as geometries shrink and wire delaysbecome predominant.

The latest advances in logic synthesisnow make it possible to use the integratedDSP blocks of Xilinx® Virtex™-4 FPGAsat their full potential. Physical synthesishas emerged as the key new technology toreconcile the RTL optimization effortwith performance bottlenecks seen at theplacement stage.

In this article, I’ll consider these andother new software challenges posed bystate-of-the-art FPGAs, and how Xilinxand its software partners Synplicity andMentor Graphics are responding.

Design Tools for PerformanceDesign Tools for Performance

New tools and features can help you achieve timing closure.New tools and features can help you achieve timing closure.

D E S I G N P E R F O R M A N C E

A Two-Fold ChallengeMost large designs are compiled from the topdown. Given their size and complexity, it isnot uncommon for such designs to engenderseveral potentially critical paths. As a result,when operating predominantly from inaccu-rate estimations based on empirical wire-loadmodels, a synthesis tool might optimizepaths that are in fact not critical and not con-sider others that truly could be critical.

A second challenge for synthesis tools isinferring larger hard IP elements (memoryor DSP blocks) found in Virtex-4 FPGAs.When you need to keep the RTL code fora given project portable (as is often thecase), the burden is on the synthesis tool togenerate excellent results based on thatportable code. It must be able to accuratelymap the logic onto the hard IP blocks.

New Software Tools and OptimizationsPhysical synthesis offers a solution to recon-cile front-end optimizations with the actualresults derived from performing place androute. Physical synthesis tightly couples synthesis and place and route by making syn-thesis aware of actual timing bottlenecksearly in the design. It ensures that synthesisoptimizations are effectively applied to theappropriate places and interacts with place-ment to deliver superior results. This tech-nology does not require manual interventionand can be used in “push-button” flows.

Physical synthesis algorithms are nowavailable in Xilinx ISE™ 8.1i software.Precision Physical from Mentor Graphicsand Synplify Premier from Synplicity alsoprovide physical synthesis capabilities. Thelatter uses an innovative approach called“graph-based synthesis” in which placementtakes into account the available routingresources. Synplify Premier offers this tech-nology to Xilinx FPGAs only.

Synthesis also addresses the use of moresophisticated hard IP blocks by providinginference algorithms that understand thedetailed structure of the FPGA. This recentenhancement enables the tools to effort-lessly produce pure RTL descriptions,yielding 500 MHz predictable perform-ance without the need for instantiation.

To provide the best push-button results,Xilinx also provides a multi-compile script

This requires more than a simpleimprovement to the “push-button” synthesis solution.

Xilinx offers a tool called PlanAhead™software that provides just this type ofresource management. PlanAhead softwarehas an intuitive graphical interface that letsyou browse the logic hierarchy of a designto create an optimal connection to thephysical layout of the targeted device.

Additionally, PlanAhead design toolscan help generate IP blocks and make itsimple for you to export them to otherdesigns. PlanAhead software also makes iteasy to use advanced block-based flows suchas incremental design, modular design, oreven flows involving reconfigurability.

ConclusionThe evolution of larger, more complexFPGA designs with increasing amounts ofhard IP poses a substantial challenge to toolsuppliers. Advances in areas such as physi-cal synthesis and new inference algorithmsmake the full performance potential ofnext-generation FPGAs accessible at virtu-ally the push of a button, while PlanAheadsoftware places the full suite of FPGAresources at your fingertips – even for themost complex, block-based design flows.The Xplorer script provides the most effi-cient path to discover the maximum per-formance or timing closure.

Other articles in this edition of the XcellJournal will provide more details on thesetopics, including coding styles, the physicalsynthesis capabilities of ISE 8.1i software,Synplify Premier, the Xplorer tool, andPlanAhead design tools.

called Xplorer. Xplorer has two modes: thefirst one attempts to obtain the maximumperformance for the design, while the sec-ond one works within the designer’s con-straints to meet timing. In this secondtiming closure mode, Xplorer applies dif-ferent algorithms for logic packing andplace and route, and reports the best set-tings for future design iterations.

Think HardwareSuccessfully coding RTL for performancerequires silicon considerations. Once you“think hardware,” your RTL descriptiondirects synthesis to use specific siliconfunctions. Consider the integratedXtremeDSP™ blocks in Virtex-4 FPGAs.

They enable ASIC performance, but thatperformance can be severely impacted ifthe RTL coding style implies an asynchro-nous reset. That’s because the native resetof the block is synchronous. Using a syn-chronous reset enables registers to bemerged into the block, thus improvingperformance (and area) to a large extend.

Regardless of software tools, coding stylesare essential. Combined with synthesis toolconstraints, options and synthesis directivescan drastically affect performance.

PlanAheadFPGAs offer specialized resources thatdesigners must manage to create an opti-mal solution. For example, you may wantto align certain blocks to a given clockdomain using specific resources, or groupthe design critical path logic to ensure atight implementation of those resources.

Regardless of software tools, coding styles

are essential. Combined with synthesis tool

constraints, options and synthesis directives

can drastically affect performance.

by Jeff GarrisonDirector of Marketing, FPGA ProductsSynplicity, Inc.jeff@synplicity.com

Advances in FPGA technology haveopened the door wide open for use in alltypes of applications, including wirelesscommunications, computer, industrial,defense/aerospace, medical, automotive,and even consumer. Xilinx® Virtex™-4devices have the capacity, performance, andcost structure to lead a migration from tra-ditional cell-based ASICs to programmabledevices in all but the highest volume andbleeding-edge applications. Along with thiscapability, however, are new challengesfrom a designer’s perspective. In this article,I’ll discuss a solution to one of these mostimportant challenges – timing closure.

One of the biggest reasons to useFPGAs in the first place is their ability todeliver working silicon to an electronicssystem quickly and reliably. As the com-plexity of FPGAs and all integrated circuitshas increased, the time-to-market advan-tage offered by FPGAs could be dimin-ished if you do not make significantchanges to your core design technology.The primary issue is timing closure – theability to reach your design’s timing goals

in a fast, predictable way. Gone are thedays when logic delay and wire-load mod-els for interconnect delay are enough toestimate timing and give predictableresults. In 90 nm FPGAs, you must incor-porate actual routing delay into the syn-thesis process to achieve rapid timingclosure for high-performance designs.

Timing Accuracy is EverythingThe underlying problem that determines ifyou will be able to close on timing is esti-mation accuracy. Historically, synthesis andplacement tools have been based on theassumption that the proximity of logic andwire-load estimation determines the routingdelay. Although this used to work reason-ably well for ASIC design, it does not workat all for FPGAs. Unlike an ASIC, FPGArouting is pre-determined. In an ASIC therouting is customized for the placement ofthe logic. In other words, once the place-ment of an ASIC is done, it is relatively easyto get a good estimation of routing delay bymeasuring Manhattan distances from onepoint to another (see Figure 1).

Because FPGAs have fixed routingresources, the design tool needs to under-stand the different types of routing and itsimplication on timing. In an FPGA, thefastest routing between two points may very

well not be the shortest. Think of your com-mute to work – sometimes it’s faster to goslightly out of your way and get on a freewaythan to travel the shorter distance on sidestreets. The same concept applies to FPGAs:some direct routing resources (freeways) arefaster than those that have to go throughswitch matrices (side streets) (Figure 2).

Figure 3 illustrates how placement is dif-ferent for proximity- and graph-based tools.

Graph-Based Physical SynthesisSynplicity invented graph-based physicalsynthesis to improve timing closure bymeans of a single-pass physical synthesisflow for 90 nm FPGAs. The essence of thegraph-based approach is that the pre-exist-ing wires, switches, and placement sites usedfor routing an FPGA can be represented as adetailed routing resource graph. The notionof distance then changes from proximity toa measure of delay and wire availability.

Synplicity’s graph-based physical syn-thesis technology merges optimization,placement, and routing to generate a fullyplaced and physically optimized design,providing rapid timing closure and a 5% to20% timing improvement.

Graph-based physical synthesis does notrequire you to create a floorplan or provideother information to the physical synthesis

Achieve Faster Timing Closure with Graph-Based Physical SynthesisAchieve Faster Timing Closure with Graph-Based Physical Synthesis

Graph-based physical synthesis was invented to improve timing closure by means of a single-pass physical synthesis flow. Graph-based physical synthesis was invented to improve timing closure by means of a single-pass physical synthesis flow.

process (often only known by expert users)in order to get good results. It is a fullyautomated methodology that can be usedwithout special knowledge of the physicalFPGA device. In addition to this fully auto-mated mode, you do have the option toguide physical synthesis by providing designplanning information (such as a floorplan)used during the physical synthesis process.

Synplify PremierTo directly address the challenge of keepingtiming closure under control for advancedFPGA technologies, Synplicity has intro-duced Synplify Premier, its first FPGAdesign product based on graph-based physi-

FPGA directly in the RTL source code or inwaveform. An incremental place and routecapability saves time by allowing you toquickly update instrumented nodes anddebug. The debugging technology withinSynplify Premier software is closely integrat-ed with synthesis and Xilinx ISE™ softwarefor a seamless development environment.

A second technology important for ASICprototyping with FPGAs is the ability toconvert gated clocks to FPGA clock-enablestructures without modifying your RTLsource. Synplify Premier performs this taskautomatically, along with handling generat-ed clocks and instantiations of most com-mon Synopsys DesignWare components.

ConclusionBecause of the increased design complexi-ty enabled by new devices such as Virtex-4 FPGAs, designers need EDA tools thatcan handle the physical properties ofFPGA architecture to achieve acceptabletiming closure. Several physical designtools based on ASIC technologies havebeen used to address FPGA design, butthey have had little success. The ASICapproach does not work for FPGAsbecause the silicon fabric is completelydifferent and, unlike ASICs, proximitydoes not imply better timing.

Synplicity’s Synplify Premier productwith graph-based physical synthesis direct-ly addresses the challenges of FPGA physi-cal design and results in faster designs donein less time. For more information ongraph-based physical synthesis, ASIC pro-totyping, and Synplify Premier, visitwww.synplicity.com/products/index.html.

cal synthesis technology. Synplify Premierincludes all of the features in Synplify Proand adds graph-based physical synthesis forthe Virtex-4, Virtex-II Pro, and Spartan™-3families. In addition to the new graph-basedphysical synthesis, Synplify Premier alsooffers a new capability for debugging andprototyping ASICs using FPGAs. One suchtechnology is RTL instrumentation anddebugging of live, running FPGAs.

This technology is based on Synplicity’sIdentify product, which allows you to navi-gate your design graphically and mark sig-nals directly in your RTL code as probes orsample triggers. After synthesis, you canview the signal values of a live, running

I/O I/O I/O I/O I/O I/O

I/O I/O I/O I/O I/O I/O I/O

Driver and loadsconnected withdirect, orthogonalrouting is possiblewith flexiblerouting

I/O I/O I/O I/O I/O I/O

I/O I/O I/O I/O I/O I/O I/O

Short, but slowconnection

Longer, but fast direct connection

Optimal Graph-BasedPlacement for Best Performance

Traditional Proximity-BasedPlacement

CLB with 4 LUTs, 4 Flops, andNo Intra-CLB Direct Connect

Figure 1 – Proximity-based placement is best when you use flexible routing.

Figure 2 – A graph-based approach is best when routing is fixed and of differing performance levels.

Figure 3 – Graph-based placement for LUT driving four flops

by Kevin Bixler Manager, ISE Technical MarketingXilinx, Inc.kevin.bixler@xilinx.com

David Dye Senior Technical Marketing EngineerXilinx, Inc.david.dye@xilinx.com

Advances in process technology have leadto dramatic increases in FPGA devicedensities. Several Xilinx® Virtex™ fami-lies have devices exceeding 1 million sys-tem gates. This increase in device densityand the use of 300 mm wafers have madeFPGAs affordable for volume production.

Designs that were once exclusivelytargeted at ASICs are now being imple-mented in programmable devices. Thelargest 90 nm Virtex-4 device providesmore than 200,000 logic cells, 6 MB ofblock RAM, and nearly 100 DSPblocks. Creating a design to efficientlyutilize the available resources in thesedevices and meet performance require-ments can be challenging. Fortunately,today’s EDA software tools have evolvedto meet these challenges.

Physical Synthesis and Optimization with ISE Software

These tips can help you get the most out of your implementation tools.

Logic optimization, logic placement,and minimized interconnection delays areall important to achieve maximum per-formance. Timing-driven synthesis tech-nology has provided a significantimprovement in design performance. Thelimiting factor to the effectiveness of tim-ing-driven synthesis is the accuracy of esti-mating routing delays.

Physical synthesis – the use of physicalplacement and routing information dur-ing synthesis – has been at the forefront toeffectively address these issues. Physicalsynthesis and optimization furtherexpands on this technology by involvingsynthesis in implementation decisionsafter the netlist is generated. This allowsfor dynamic re-examination of synthesismapping and packing decisions based onactual placement and routing informationduring implementation.

Benefits of Physical Synthesis and OptimizationInterconnection delays between logic lev-els are affected by the proximity of place-ment for the logic elements, routingcongestion, and local competition betweennets for the fastest routing resources. Theanswer to this problem is to revisit synthe-sis decisions during mapping, placement,and routing. During the mapping phase,the netlist can be reoptimized, packed, andplaced based on the urgency of individualtiming paths. This approach reduces thenumber of implementation cycles requiredfor timing closure.

Physical Synthesis and Optimization Flows Xilinx ISE software provides several soft-ware options to enable physical synthesisand optimization. You can use theseoptions individually or together, depend-ing on the specific needs of your design.

Define Timing RequirementsThe most important step for effectivephysical synthesis is to set up accurate,comprehensive timing constraints. Withthese constraints in place, the implemen-tation tools can make more informeddecisions that will improve your overall

Enable Timing-Driven Packing and PlacementTiming-driven packing and placement isat the heart of the physical synthesis capa-bilities available within the implementa-tion flow. When you enable this option(map -timing), the placement phase ofplace and route is done within Map,allowing packing decisions to be revisitedwhen initial results are less than optimal.This iterative flow does away with unre-lated logic packing.

Different levels of optimization existin Xilinx physical synthesis and opti-mization. The first level was introducedin ISE 6.1i software and began with logictransformations, including fanout con-trol, logic replication, congestion con-trol, and improved delay estimation.These routines led to much more effi-cient packing and placement of designs,resulting in faster clock frequencies anddenser logic utilization.

The next level added logic and registeroptimization; Map can now rearrange ele-ments to improve critical path delays.These transformations give much greaterflexibility to meet the timing require-ments of the design. A number of differ-ent techniques (including pin swapping,basic element switching, and logic recom-bination) are used to massage the physicalelements into a different yet logicallyidentical structure that will meet thedesign requirements.

ISE 8.1i software introduces one morelevel of physical synthesis – combinatoriallogic optimization. The -logic_opt switchenables a flow that examines all of thecombinatorial logic in the design. Givenplacement and timing information, youcan make more informed decisions aboutoptimizing LUT structures to improve theoverall design.

results. Constrain the clocks and I/O pinsthat have firm requirements to allow therest of the design to be relaxed.

The easiest way to define these timingconstraints is to use the ConstraintsEditor. This graphical tool allows you toenter clock frequencies, multi-cycle andfalse path constraints, I/O timing require-ments, and a host of other clarifyingrequirements. Constraints are written to auser constraint file (UCF), which mayalso be edited in any text editor.

If user-defined timing constraints arenot provided, a new feature in ISE™ 8.1isoftware will automatically generate tim-ing constraints for each internal clock. InPerformance Evaluation Mode (PEM),you can get the high-performance resultsof physical synthesis and optimizationwithout having to provide timing targets.

Run Global OptimizationFor designs containing IP cores or othernetlists, the NGD file available after thetranslate (NGDBuild) phase of imple-mentation represents the first time thatthe entire design has been completelyassembled. Global optimization, a newfeature added to the 7.1.01i version ofMap, will take the fully assembled designand attempt to improve design perform-ance by re-optimizing the combinatorialand register logic. Global optimization(map -global_opt on the command line)has been shown to increase design clockfrequencies by an average of 7%.

Two other options let you further con-trol the optimization completed duringthis phase: retiming (map -retiming) willmove registers forward and back to bal-ance combinatorial logic delays, andequivalent register removal (map -equiva-lent_register_removal) will remove regis-ters with redundant functionality.

During the mapping phase, the netlist can be reoptimized, packed, and placed based on the

urgency of individual timing paths. This approachreduces the number of implementation cycles

required for timing closure.

Examples of Physical Synthesis and Optimization

• Logic Duplication: If a LUT or flip-flopdrives multiple loads, and the placement ofone or more of those loads is too far awayfrom the source to meet timing, the LUTor flip-flop can be replicated and placedclose to that group of loads, thus reducingrouting delays (Figure 1)

• Logic Recombination: If the critical pathtraverses through multiple LUTs throughmultiple slices, the logic can be reassem-bled utilizing fewer slices by using a moretiming efficient combination of LUTs andmuxes to reduce the routing resourcesneeded for that path (Figure 2)

• Basic Element Switching: If a function isbuilt with LUTs and muxes within a slice,physical synthesis and optimization canrearrange the function to give the fastestpath (usually through the mux select pin)to the most critical signal (Figure 3)

• Pin Swapping: Each input pin of a LUTmay have a different delay, so Map has theability to swap pins (and the associatedLUT equation) so that the most criticalsignal is placed on the fastest pin (Figure 4)

Conclusion The physical synthesis and optimizationcapabilities within the Xilinx toolset will con-tinue to mature and expand with each soft-ware release. Along with improved quality ofresults, you can expect to see greater controlover the types of optimizations. Otherplanned enhancements include the consider-ation of more design elements in the reopti-mization phase (such as registers allowingmovement into and out of the I/O blocks ordedicated functions like block RAM andDSP blocks) and the inclusion of the routingphase into the reiterative physical synthesisand optimization system.

The physical synthesis and optimizationtools in Xilinx ISE software have been createdto re-examine the structure of your FPGAdesign during the packing and placementphases of implementation. With the knowl-edge of timing constraints and physical lay-out, optimizing synthesis decisions duringmap and place and route can significantlyimprove your results.

Figure 1 – Logic Duplication. Path LUT A → LUT B →LUT C → Out is critical. LUT B is driving

two critical loads and can be duplicated to reduce path delay.

Figure 2 – Logic Recombination. Path LUT A → LUT B → LUT C → Out is critical. LUTs B and C can be combined and replaced by LUT a, LUT b, and F5Mux f.

Figure 3 – Basic Element Switching. Path LUT A → LUT C → MuxF5 → Out is critical. The path through

the mux select pin is faster than through LUT.

Figure 4 – Pin Swapping. Path LUT A → LUT C → MuxF5 → Out is critical. Pin 2 is faster

than Pin 0 for LUT C. Swap pins 0 and 2 for LUT C.

Before After

�� !��!!"� �#�$%&'(

�� !� ��"�� #�� $��%�� "��"�� $� �&�� '�� '��(�� )�� *�� *��

�� +� )�� (,��-...�/�#��(��*�� )��0

��+ 1��'�� '��0�2��$��3��4��%� �� '��5� ��&� ��6

+ ��

+ /�� "��"��

+ /�� 6��&� ��6

+ 5��

+� 4��5�1�/�� 1��

+ 4��' ��#��

1�� '�� (�� 0�7��$� �7��$��7��!��3��//�� -89��/5)�*� �� '�,��1��'�� *� �� /�� *�� '� ��'� ��*�� '��

1��3��:9(;�&�� *� �� ' �� 3� �� 6��;�&��

by Mark GoosmanMarketing Product ManagerXilinx, Inc.mark.goosman@xilinx.com

Design problems – especially those charac-teristic of large, high-performance designs– are most effectively addressed by firstinvestigating the problem and then break-ing the larger design issues into smaller,more manageable hurdles. Looking at theevolution of programmable devices inrecent years, it is apparent that FPGAs haveundergone tremendous growth in size andcomplexity, but the PLD EDA tool flowhas remained relatively unchanged.

With a traditional flat design flow, eachdesign change means re-synthesizing and re-implementing the entire design. With com-plex designs on multi-million-gate devices,even a minor change can lead to unaccept-ably long place and route (PAR) runtimes,which itself often leads to inconsistentresults, not to mention the time lost fromRTL to PAR iterations for a typical design.

Few design teams can tolerate unexpect-edly low performance for a design that tooklonger than expected to complete, not tomention the associated frustration andstress. In addition, it may mean low utiliza-tion of the FPGA and even missed time-to-market opportunities.

Improve Design Performance Using PlanAhead Design Tools

PlanAhead software helps you tackle the need for speed.

PlanAhead Software Offers a SolutionA growing number of customers are findinga solution in the hierarchical design method-ology offered by the Xilinx® PlanAhead™design analysis tool. PlanAhead softwareadds visibility and control to the FPGAdesign flow. By addressing problems on thephysical side (between logic synthesis and theimplementation process), you can realizeimproved performance in your results.

Although advanced FPGA synthesisproducts provide a tremendous level ofautomated optimization for multi-million-gate designs, many designers require moreheuristic techniques to achieve optimalperformance goals. Through early analysisand floorplanning, PlanAhead design toolscan apply physical constraints to help con-trol the initial implementation of thedesign. After implementation, PlanAheadsoftware can analyze placement and timingresults to improve the floorplan used tocomplete the design. You can use the phys-ical constraints derived from the importedresults to lock placement during subse-quent implementation attempts. Theseconstraints can be used to create reusableIP, complete with locked placement, forother designs.

The PlanAhead design methodologyprovides performance, productivity, andrepeatability of results. With its hierarchi-cal design flow, PlanAhead software allowsyou to reduce the number of iterationsspent running PAR and then returning toRTL and synthesis. Instead, you can ana-lyze your design and address issues on thephysical side before implementation.

Faster Results in Less TimePlanAhead users are consistently seeing a10-15% performance improvement, withsome achieving much higher results. Inaddition, designers are also finding thatthey can squeeze an additional 10% logicinto a tight device. This combination offaster performance and better utilizationcan translate into a smaller, less expensivedevice, or design goals achieved with aslower speed grade.

PlanAhead design tools help reduceoverall design time while adding a level ofconsistency in the results. You can perform

trouble spots and place heavily connectedPblocks close together, or merge them.

Clock regions are also displayed and canbe used during floorplanning to optimizevarious clocks or minimize power usagewithin the device. By isolating clocks tospecific clock regions, they can run fasterand eliminate the need to power up otherclock regions.

You can use the analysis and explorationenvironment of PlanAhead design tools atvarious stages in the design process.Initially, you can analyze the design beforeimplementation. PlanAhead software pro-vides a static timing engine, TimeAhead,that examines the design’s feasibility rela-tive to timing. You can also perform analy-sis using estimated routing delays by

factoring in pure logic delays with no inter-connect. This allows you to see how muchtiming tolerance is built into the design.

You can then edit and fine-tune timingconstraints within the PlanAhead envi-ronment. These same analysis results canhelp determine what logic should begrouped together and floorplanned. Pathscan be logically sorted, grouped, andselected for floorplanning. The sameTimeAhead environment can also be

design iterations in much less time – withrepeatable results – by leveraging previousfloorplans or incremental design tech-niques. You can also leverage successfulresults by locking them down or reusingthem in other designs.

Tackling really tough performanceissues requires more than just the additionof new menu items or scripting capabili-ties. PlanAhead software provides a com-plete environment to make this hierarchicalmethodology interactive and easy to use bypresenting design data through the use ofvarious views (see Figure 1). These inde-pendent views are designed to work in con-junction with each other, allowing you toquickly identify and navigate critical designobjects and information.

Visually Identify Performance Bottlenecks ...The PlanAhead environment providesinsight into the data flow of the design bydisplaying I/O interconnect as well as phys-ical block (or “Pblock”) net bundles. Youcan control color and line thickness of thebundles depending on the number of sig-nals. This makes it easy to identify heavilyconnected Pblocks in the overall data flowthrough the design. You can then take cor-rective action to avoid routing congestion

Figure 1 – PlanAhead software provides different views of the design to display physical hierarchy,properties, netlists and constraints, device package pins, schematics, and much more.

leveraged with imported timing resultsfrom TRCE, the timing evaluation toolwithin Xilinx ISE™ software.

The timing constraints assigned to thedesign can be viewed and modified. Youcan define all ISE timing constraints as newconstraints within the editor. This makesconstraint assignment easier because youno longer have to remember specific con-straint formats. You can use this withTimeAhead to validate and optimize theconstraint set before running any ISEimplementation tools.

PlanAhead design tools provide visualaides to help you comprehend the physicalimplementation results. Design rule checks(DRCs) are provided to catch errors early.It also flags designs that do not properlytake advantage of certain device resources,such as the dedicated registers of theXtremeDSP™ slice or RAM within theVirtex™-4 FPGA.

By visualizing problem areas, you canaddress problems quickly, either in the RTLor on the physical implementation side,without having to continue RTL and syn-thesis iterations. The various logic modulescan be selectively highlighted to better

understand where they were placed, andPblocks created where the logic is most con-centrated. You can highlight failing timingpaths to visualize and understand what isphysically happening within your design.

PlanAhead software has includes metricmaps to quickly identify problem areas ofthe design (Figure 2). These can be relatedto timing or utilization. This is helpfulwhen trying to identify areas of the designto focus on for logic compression or timingconnectivity.

PlanAhead design tools allow you toexplore connectivity within the design.After selecting a particular net, Pblock, orinstance within the design, you can high-light all of the nets connected to the select-ed elements with a single mouse click.

After an instance or Pblock is selected,all of the nets connecting to that elementwill be highlighted. This process can becontinued, selecting and expanding thelogic cone. Running “Show Connectivity”will highlight the next level of nets con-nected to the selected instances. This is aneasy way to select a cone of logic starting ata particular instance or I/O port, takingreal advantage of the design hierarchy.

... Then Address the Performance Problem The whole idea is to provide a compre-hensive environment to analyze timingissues and easily constrain that logic toavoid or correct it. You can use timingresults from either TimeAhead or TRCEto drive a floorplan that will produce bet-ter performing designs by helping deter-mine what logic should be groupedtogether and floorplanned.

Critical paths often traverse the logichierarchy. PlanAhead software enables aphysical hierarchy that is independent ofthe logic hierarchy, allowing logic fromanywhere in the design to be groupedtogether and efficiently floorplanned.

PlanAhead software also providesresource utilization estimates to help sizeand shape Pblocks. These same statisticsreport clock information, carry chain, andRPM sizes for fit and a variety of otheruseful information.

PlanAhead design tools provide auto-matic floorplanning capabilities such asautomatic partitioning based on the logichierarchy and automatic Pblock sizing andplacement. Because it is often difficult toencompass the required device resourceswithin a single Pblock rectangle, non-recta-linear shapes can be created withmultiple rectangles. PlanAhead softwarealso allows you to create Pblocks withinPblocks, or “child” Pblocks, to help bettermaintain design hierarchy.

Device capacity can be improved bycompressing the logic with Pblocks. Thiscan be achieved in one of two ways. One isto use the Xilinx AREA_GROUP attributecalled COMPRESSION. AREA_GROUPis a design implementation constraint thatenables partitioning of the design intophysical regions for mapping, packing,placement, and routing. Using the COM-PRESSION attribute will cause the ISEMapper to pack unrelated logic intounused CLB sites. Use this with care, as itcan have an adverse affect on timing.

The best strategy for improving per-formance is to compress non-timing criti-cal logic, thereby opening up more spacein the device for timing critical logic. Thesecond option is to use the PlanAheadcapability to run PAR on Pblocks individ-

Figure 2 – Metric Maps provides a thermal metric display of the various potential problem areas in the design. The current metrics include utilization and timing

checks at both the Pblock and placed design level.

ually. You can continue to shrink thePblock size until PAR fails. This will reduceand pack the logic as tightly as possiblewithin the blocks and free up device space.

A Virtex-4 Floorplanning ExamplePlanAhead design tools allow you to easilyimport placement and timing results. Withthis information, you can view and sortcritical paths from the timing report andvisualize paths using either the schematicor device views. Once you’ve identified fail-ing paths, you can highlight all pathinstances on the floorplan to identify allpath instances in the schematic view.

Figure 3 shows a floorplan of a designtargeting a Virtex-4 FX140 device. In thedisplay, we’ve highlighted the flip-flopsalong a particular path that was not able tomeet timing. Because they are so widelydistributed across the device, design imple-mentation results in an unacceptably longdelay. With the large number of clockdomains available in Virtex-4 FPGAs, thisis a common situation.

By selecting each of these flip-flops andrestricting them to a single Pblock, you canthen adjust and optimize the Pblock sizeand location to reduce delays on criticalpaths, as shown in Figure 4. If necessary,you can even create nested Pblocks, creatinga child/parent hierarchy to further constrainsub-modules for additional performancegains. Depending on the resource require-ments of the captured logic, you can lockdown critical logic to locations for optimalaccess to necessary resources.

ConclusionVisit www.xilinx.com/planahead to down-load a free evaluation of PlanAhead softwaretoday. This 30-day evaluation provides youwith full access to all of the PlanAhead fea-tures and functionality. This site also allowsyou to view product demonstrations, down-load white papers, or just learn more.

Xilinx also offers PlanAhead QuickStart!,an exceptional level of support during the

most critical phase of a project. With thisservice, you will receive a QuickStart! engi-neer at your site for one week who willtrain and empower your team to completeyour project on time and make best use ofthe Xilinx device you have selected. Thishighly customizable service allows you to

develop a training plan tailored specifical-ly to the needs of your design team. Thiswill help prevent schedule slips later in theproject by ensuring that the team is skilledin the needed disciplines. It will also helpyou maintain a more effective and highlymotivated team.

Figure 3 – Initial Virtex-4 FPGA floorplan with the path highlighted that did not initially meet timing

Figure 4 – After constraining all primitives associated with the path, you can optimize the Pblock to allow this path to achieve necessary timing.

If necessary, you can even create nested Pblocks, creating a child/parent hierarchy to further constrain sub-modules for additional performance gains.

by Steve PereiraTechnical MarketingSynplicity, Incstevep@synplicity.com

You can benefit greatly from a proper syn-thesis strategy. Such strategies includeknowing the final target architecture,knowing what coding problems couldarise, and understanding what performancethe periphery will require. You should alsounderstand how to use IP. Are modelsavailable? Is cost an issue? Your initial setupcan greatly affect productivity and help youachieve quicker peripheral timing closure.

In this article, I’ll describe a known goodstrategy while using Synplify Pro tools.

Best PracticesSetting up your design correctly can resultin huge performance increases or reduc-tions in area. The following checklistdescribes the best practices to use when set-ting up your design.

1. Include any CoreGen EDIFs or timingmodels for black boxes. If you useblack-box IP in the design, ensure thatall EDIF, NGC netlists, or timingmodels are provided. It is essential thatthe Synplify Pro tool knows the timing

requirements into and out of the boxso that surrounding logic can bealtered to reduce or remove criticality.

If the design has an ngc file, use thengc2edn (provided in the Xilinx®

ISE™ /bin/ directory) converter utilityto produce an edn file for synthesis.

2. Ensure that the device is correct andsize the design. Ensuring that thewireload models are correct for syn-thesis can greatly affect the resultinglogic. The selection of wireload mod-els is simply a matter of selecting thecorrect device and speed grade. Ifsynthesis is performed on a differentdevice to the final implementation,sub-optimal results are quite likely.

Selecting the correct device will alsoprovide Synplify Pro with the accuratenumber of resources. One example ofthe importance of this is with block-select RAM mapping. Synplify ordersall of the RAMs from biggest to small-est, and then starts to map the largestRAMs until there are no block RAMsleft. The rest are placed into distributedRAM. If the wrong device is selected,sub-optimal mapping will occur.

Sizing the design also has a significantimpact. For example, if the designuses 80% of the device, the wireloadmodels are correct for the design. Ifthe design consumes only a small per-centage of the total resources (forexample, synthesizing just a part ofthe design for verification), the wire-load models will be inaccurate. Toresolve this problem, you can assignthe logic to an area group. Please seethe Synplify Pro documentation forinstructions on how to do this.

3. Provide accurate clock constraints.Under- or over-constraining results inreduced performance. Do not over-constrain by more than 15%. For max-imum performance, ensure that thereis 10% negative slack on the criticalclock. This ensures that critical pathsare squeezed. The Fmax field on thefront panel is fine for a quick run, butdo not use it if you need maximumperformance. Put unrelated clocks inseparate clock groups in the SynplifyPro .sdc file. If your clocks are in thesame group, the Synplify Pro toolworks out the worst-case setup time for the clock-to-clock paths.

Synthesis Tool StrategiesSynthesis Tool Strategies

Set up your designs in Synplify for performanceimprovement and area savings.Set up your designs in Synplify for performanceimprovement and area savings.

Figure 1 shows a timing diagram fortwo clocks that are in the same clockgroup. Synplify rolls the clocks for-ward until they match up again. The tool then calculates the mini-mum setup time between the clocks,in this case 10 ns.

If the clocks are unrelated, there maybe several hundred clock periodsbefore the clocks match up again. Thismay result in the worst-case setup timebeing very small (100 ps). You cancheck the setup time in the clock rela-tionships table in the log file. If thesetup time is too short, it is best to re-constrain the clocks so that they aremore related.

4. Specify timing exceptions. Provide alltiming exceptions, such as false andmulticycle paths, to the Synplify Protool. With this information, the toolcan ignore these paths and concentrateon the real critical paths.

5. Constrain I/Os. If the design has I/Otiming constraints, it is likely that thecritical path is through the I/O block(IOB). The Synplify Pro tool seesthese paths as the most critical andtries to optimize them. Usually, I/Opaths physically cannot be optimizedany further; as they are the most criti-cal, the Synplify Pro tool stops opti-mizing the rest of the design.

improve your design performance by asmuch as 50%. Retiming attributessuch as syn_allow_retiming let yourefine your constraints by surgicallyapplying retiming to a single register.Synplicity recommends that you enableboth of these switches.

• Resource sharing. Always turn thisswitch on. The behavior of thisswitch was changed in Synplify 8.0.With the switch on, timing-drivenresource sharing occurs. Non-criticallogic will have resource sharing, butcritical logic will not share resources(for performance reasons).

• FSM Compiler extracts and optimizesFSMs based on the number of states:

• 2 to 4: sequential

• 5 to 40: one-hot

• More than 40: gray

Synplicity also recommends enablingthe FSM Compiler and setting defaultenumerated encoding to the “default”value for VHDL designs.

• FSM Explorer timing-driven stateencoding. The Synplify Pro tool auto-matically selects the best encoding forthe specified timing. This switch isdesign-dependent. If the critical pathstarts or ends at a state machine, turnthe switch on.

ConclusionBecause synthesis tools are operating athigher levels of abstraction, synthesis opti-mizations can have a dramatic impact ondesign performance. After parsing throughthe HDL behavioral source code andextracting known functions (arithmeticfunctions, multiplexers, and memories),synthesis tools then map these functions onthe target architecture features.

The tools trade off area and performancebased on design constraints and tools set-tings; these influence the use of optimiza-tions such as replication, merging, re-timing,and pipelining. As a result, the right tools set-tings in synthesis can greatly increase pro-ductivity and time to market.

A new switch has been added to theSynplify Pro 7.3 release called “useclock period for unconstrained I/O.”When enabled, the tool does notinclude any unconstrained I/O pathsin timing optimizations.

6. Keep code generic. Keeping codegeneric and not locked down to aparticular architecture can aid design,reuse, and portability. For example,with the DSP48 block in theVirtex™-4 architecture, you canspecify generic code to implement aDSP function. The tool will map tothat component(s) when possible andreduce uncertainty regarding how thefunction was mapped. If DSP blocksare generated, timing is better knownand can speed up debug – and timeto market. You can specify the regis-ter configuration and code in theopcode to drive the DSP block con-figuration, all with generic code.

If for some reason you wish to use a dif-ferent device family, such as movingfrom Virtex-4 FPGAs to Virtex-II ProFPGAs, the porting process itself shouldbe seamless, but you may feel uncertainabout implementation and logic levels.

Good Switch Settings

• Retiming and pipelining. Enablingretiming and pipelining options can

10ns 20ns 30ns

30 45 60 75 90 105 120 135

20 40 60 80 100 120 140

Figure 1 – Clock rolling for clocks in the same domain

by Hitesh PatelSenior Manager, Software Marketing Xilinx, Inc.hitesh.patel@xilinx.com

The 2005 EDA Branding Study shows that71% of FPGA projects have difficultymeeting their timing budgets. Severalstrategies exist to help you meet your tim-ing goals, such as HDL code changes andsynthesis and implementation tools set-tings. In this article, we’ll describe theXplorer implementation tools strategy tomaximize design performance, whetheryou are evaluating the best achievable per-formance for a specified clock domain orattempting to meet timing requirementsfor designs with user constraints.

Implementation with XplorerXplorer is a perl script that seeks the bestdesign performance using Xilinx® ISE™software. After synthesis generates an EDIF(*.edf ) file, the design is ready for imple-mentation. During this phase, you coulduse Project Navigator in ISE software toapply design constraints and explore differ-ent tools settings for best performance.

Accelerate Design Performance Using Xplorer

You can realize up to a 70% push-button performance improvement.

An alternative approach may be to useXplorer. Xplorer is designed to help achieveoptimal results by employing smart con-straining techniques and various physicaloptimization strategies. Because no uniqueset of ISE options or timing constraints

works best on all designs, Xplorer finds theright set of tools options to either meetdesign constraints or find the best perform-ance for the design. Hence, Xplorer has twomodes of operation: best performance modeand timing closure mode.

Best Performance ModeIn this mode of operation, Xplorer optimizesdesign performance for a user-specified clockdomain, allowing easy evaluation of the max-imum achievable performance. You specifythe design name and a single clock to opti-mize. Xplorer implements the design withdifferent architecture-specific optimizationstrategies in conjunction with timing-drivenplace and route (PAR). It tightens or relaxesthe timing constraints depending onwhether or not the frequency goal isachieved, as shown in Figure 1. Xplorer esti-mates the starting frequency based on pre-PAR timing data. Adjusting timingconstraints such that PAR is neither under-nor over-constrained enables Xplorer to

Because Xplorer runs approximately 10iterations, you will experience longer PARruntimes. However, Xplorer is somethingthat users typically run once during theirdesign cycle. After an Xplorer run, you cancapture the set of options that will give thebest result from the xplorer.rpt file and usethat set of options for future design runs.Typically, designers will run the toolsmany times in a design cycle, so a longerinitial runtime will likely reduce the num-ber of PAR iterations later.

All Xilinx FPGA architectures are sup-ported by Xplorer; optimizations are per-formed based on architecture features.

Using XplorerXplorer is run from the command promptby typing:

xplorer <design name> [-clk <clkname>] [-p <partname>]

<design name>: Name of the top level edif/ngc file.

-clk <clkname>: Name of the clock to beoptimized. If the -clk option is omitted,the script uses the timespecs defined in theUCF file.

-p <partname>: The device name (forexample, XC4VLX100-11FF1152). Thedefault value is the part specified in theinput design.

-uc <ucf file>: The UCF file name. Thedefault value is <design name>.ucf.

Here is an example command for BestPerformance Mode:

xplorer cordic -clk clk -p XC4VLX15-12FF668

Here is an example command forTiming Closure Mode:

xplorer <design name> -uc <ucf file> -p <partname>

Results, with the tools settings used, aresummarized in xplorer.rpt. The best run isidentified at the end of the report file.

deliver optimal design performance.In addition to timing constraints,

Xplorer also uses physical optimizationstrategies such as global optimization andtiming-driven packing and placement.Global optimization performs pre-place-

ment netlist optimizations on the criticalregion, while timing-driven packing andplacement provides closed-loop packingand placement such that the placer can rec-ommend logic packing techniques thatdeliver optimal placement. If the design hasa user constraint file (UCF), Xplorer opti-mizes for the user constraints in addition tothe specified clock domain.

Timing Closure ModeIf you have a design with timing constraintsand your intent is for the tools to meet thespecified constraints, use the timing closuremode. In this mode, you should not specifya clock using the -clk <clock name> switch.Xplorer looks at the UCF to examine thetiming constraints goals. Using these con-straints together with optimization strate-gies such as global optimization,timing-driven packing and placement, reg-ister duplication, and cost tables, Xplorerimplements the design in multiple ways todeliver optimal design performance.

User-Specified ClockSlack Violation

Reduce Fmax Goal

Slack Violation

Reduce Fmax Goal

Slack Violation

Reduce Fmax Goal

Slack Violation

Reduce Fmax GoalFmax Met

Increase Goal

Fmax Met,

Increase Goal

Fmax Met,

Increase Goal

Figure 1 – Xplorer Best Performance Mode flow

Performance Improvement ResultsTo highlight the performance impact ofthese optimization strategies, we com-pared baseline results (attainable usingtightly constrained, high-effort timing-driven PAR) with Xplorer. Figure 2 showsthe two flows.

For a high-density, high-performanceVirtex™-4 customer design suite of morethan 75 designs Xplorer provides up to70% – and on average 10% – performanceimprovement, as shown in Figure 3. Thedesigns range in density from LX15 toLX200, covering (but not limited to) mar-ket segments such as consumer, video, stor-age, telecom/datacom, DSP, and glue logic.

In addition, performance improvementsfor eight OpenCores designs forSpartan™-3 FPGAs are shown in Figure 4.These OpenCores designs are written insynthesizable RTL and synthesized to thetarget technology without any code modifi-cations. The designs can be downloadedfrom www.opencores.org. For the eightOpenCores designs, the average perform-ance improvement using Xplorer is 10%,with the AES design realizing a 38% per-formance improvement.

How to Achieve Additional Performance GainsAt times, the Xplorer implementation strat-egy still might not be enough to meet yourtarget timing goals. In these cases, adoptsynthesis and RTL coding strategies gearedtowards performance.

Conclusion Xplorer helps you optimize logic perform-ance and meet your timing goals by usingsmart constraining techniques andemploying the right set of implementationtools strategies. Xplorer provides an aver-age performance improvement of 10% forXilinx FPGAs.

To view advanced options and to down-load Xplorer, visit www.xilinx.com/xplorer.

0.0VCS-DCT

Average 10%

CFFT FM VGA VCS-Huffman

Decoder

VCS-Huffman

Encoder

Cordic AES

Performance Improvement Using Xplorer for Virtex-4 FPGAs

75 Customer Designs

Average 10%

Synplify Pro

Baseline

Xplorer

ComparePerformance

RTLCode

Tighten Constraint

Until Slack Violation

Tighten Constraint

For Best Netlist

Figure 2 – Baseline and Xplorer flows

Figure 4 – Percentage performance improvement for Spartan-3 FPGAs using Xplorer on eight OpenCores designs

Figure 3 – Performance improvement for Virtex-4 FPGAs using Xplorer

These OpenCores designs are written in synthesizable RTL and synthesized to the target technology without any code modifications.

by Bill SapersteinSenior Director of EngineeringAnchor Bay Technologies, Inc.ws@anchorbaytech.com

Sanjay ThatteProduct Marketing ManagerXilinx, Inc.sanjay.thatte@xilinx.com

The Xilinx® PlanAhead™ hierarchicaldesign and analysis environment can beused in conjunction with Xilinx ISE™tools to improve design performance andpossibly enable incremental design and IPreuse. Several customers have benefitedfrom the unique capabilities thatPlanAhead software provides. In this arti-cle, we’ll describe how one Xilinx customer,Anchor Bay Technologies of Campbell,California, was able to successfully utilizePlanAhead design tools.

A Custom Chip in Three MonthsAnchor Bay Technologies specializes indesigning and developing video processingsystem- and silicon-based solutions for scal-ing, de-interlacing, and noise reduction.

Recently, Denon Electronics Companyrequired a very high-performance scalingchip for their high-end DVD players to takestandard-definition 480P video from theMPEG decoder and scale it to 1080P reso-lution for large display applications. AnchorBay had developed several scaling chips, butDenon required a custom design to fit theirspecific application. In particular, theywanted multiple video output streams, mul-tiple video formats and resolutions, and acustom I2C interface to the chip.

Faced with a very short developmentcycle and unable to turn an ASIC in thistimeframe, Anchor Bay decided to developa custom solution based on XilinxSpartan™-3 FPGA technology. They had

only three months to get the chip designedand tested for initial sampling.

Anchor Bay used ISE Foundation™design tools to perform basic design andsimulation. Because Denon required thelowest cost solution, they tried to fit thedesign into the smallest Spartan-3 devicethat had enough resources. But because ofthe aggressive performance requirements,achieving timing closure was close toimpossible using conventional ISE floor-planning and place and route (PAR). Thedesign involved four different clockdomains – the highest frequency at 148MHz. The design was using more than80% of the Spartan-3 XC3S1000FT256-5 part, 100% of the multipliers and clockbuffers, and 60% of the RAM blocks.

This heavy utilization made timing clo-sure very difficult. There was no way toguide the tools adequately to close on thecritical paths. In addition, the tools did not

Achieve Your Performance Goals with PlanAhead SoftwareAchieve Your Performance Goals with PlanAhead Software

Obtaining the lowest cost FPGA solution through area and speed optimization.Obtaining the lowest cost FPGA solution through area and speed optimization.

clearly point out routing bottlenecksthat were hindering timing closure.

After several attempts, AnchorBay decided to explore thePlanAhead design tool from Xilinxto see if it could provide a solution.Their FAE support team was veryresponsive. They obtained an eval-uation copy of PlanAhead softwareand quickly studied the tutorialbefore the FAEs came to theiroffices and walked them throughthe methodologies.

Problem SolvedPlanAhead design tools allowedAnchor Bay to quickly pinpoint theresource bottlenecks and the rela-tionship between timing paths andplacement. They were able toattempt several “what-if” scenariosto better open routing channels andgroup critical timing paths. Takingthe results from the ISE timing ana-lyzer and feeding them back intothe floorplanning tool was invalu-able. Also, the ability to view theschematic allowed them to changethe logic where necessary to reducethe critical paths. They found thatproviding a simple floorplan wasenough of a seed to allow the PARtool to meet their timing needs.

They did not need to usePlanAhead software to heavily con-strain the PAR tool, but only con-centrated on the critical paths andcongested areas. After more thantwo weeks of trying without suc-cess, they accomplished what theyneeded in two days usingPlanAhead design tools.

Anchor Bay is continually push-ing the limits of the FPGA devicesin their systems. This allows themto not only sell ASIC solutions, butalso to develop custom, cost-effec-tive silicon solutions based onFPGA technologies.

Another Customer ExampleLike Anchor Bay, a number ofother customers have benefited

from PlanAhead’s advanced capabili-ties. For one such customer, theobjective was to reduce PAR runtimeand meet timing. To achieve this,they used PlanAhead software toquickly analyze the design, find thebottlenecks, and create the necessaryphysical constraints.

PlanAhead’s ability to place indi-vidual instances at specific locationsand to constrain multiple instancesto desired area groups was very use-ful in this process. Figure 1 shows aPlanAhead view showing the criti-cal paths not meeting timing.Figure 2 shows how PlanAheaddesign tools were used to create spe-cific LOC constraints. The finalfloorplan created using PlanAheadsoftware is shown in Figure 3, whilethe results achieved through use ofPlanAhead design tools are record-ed in Table 1.

ConclusionA growing number of customers areusing PlanAhead software to helpthem tackle tough design problems.In doing so, they have increased theirproductivity while achieving andmaintaining their design require-ments. These benefits include:

• Reaching and maintaining performance goals

• Quicker incremental design changes

• Faster PAR time

• Fewer design iterations

• Tighter utilization control

• IP reuse

Getting started with PlanAheadsoftware is easy. Visit www.xilinx.com/planahead to download a free,30-day evaluation version, as well asadditional information and an onlinedemonstration. Also, customersinterested in getting on-site designsupport can opt for the PlanAheadQuickStart! program.

Without PlanAhead With PlanAhead Software Software

Timing Errors 9 0

Timing Score 1746 0

Unmet Timing Constraints 3 0

PAR Runtime 234 min. 34 min.

Figure 1 – Analyzing critical paths using PlanAhead software

Figure 2 – Creating LOC constraints with PlanAhead design tools

Figure 3 – Design floorplan created in PlanAhead software

Table 1 – Design results with and without PlanAhead design tools

www.xilinx.com/paq PlanAhead™ QuickStart! provides a new level of personal, expert assistance

to ensure your success. The solution includes configuration of the Xilinx

ISE™ design environment, a comprehensive training plan, and a dedicated

QuickStart! engineer on-site for one week to provide all the training you

need to keep your project on time and on budget.

Your Complete Hierarchical Design Solution

PlanAhead QuickStart! offers a complete system floorplanner solution

to shorten development cycles and limit scheduling challenges. With

PlanAhead QuickStart! your team has immediate design expertise to

guarantee complete success…right from the start!

Contact your Xilinx representative or go to www.xilinx.com/paq for

more information.

©2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc.All other trademarks are the property of their respective owners.

Xilinx Productivity Advantage: PlanAhead QuickStart!

We’re With You

Right From The Start

by Philippe Garrault Technical Marketing EngineerXilinx, Inc.philippe.garrault@xilinx.com

Brian PhilofskyTechnical Marketing EngineerXilinx, Inc.brian.philofsky@xilinx.com

You can achieve increases in design per-formance by selecting the right hardwareplatform and silicon features, being famil-iar with the device architecture, or havingthe proper settings and features in yourimplementation tools. But one of the mostoverlooked ways to increase design per-formance is to write HDL code that is veryefficient for the targeted device. In this arti-cle, we’ll present coding style tips to accel-erate design performance.

Use of Resets and PerformanceFew system-wide choices have as much of aprofound effect on performance, area, andpower as reset choice. Some system archi-tects specify the use of a global asynchro-nous reset for the system. Whether it istruly needed or not, the ramifications ofthis choice are not always understood.With Xilinx® FPGA architecture, the useof and type of reset can have serious effectson the performance of your code.

HDL Coding Practices to Accelerate Design PerformanceHDL Coding Practices to Accelerate Design Performance

Small code changes can make a big difference.Small code changes can make a big difference.

SRLsIn all current Xilinx FPGA architectures,LUT (look-up table) elements are config-urable as either logic, ROM/RAM, or a shiftregister (SRL, or shift register LUT).Synthesis tools can infer the use of any oneof these structures from RTL code.However, in order to realize the use of theLUT as a shift register, a reset can not bedescribed in the code, as the SRL does nothave a reset. This means that shift registerscoded with resets results in suboptimalimplementation (requiring several flip-flopsand the associated routing between them),while code without resets results in fast andcompact implementation (using SRLs).

The effect on area and power is moreobvious for these two cases, but the effecton performance is a little less clear. In gen-eral, a shift register built out of flip-flops isnot going to be the critical path in a designbecause the timing path between registers isnot normally long enough to be the longestpath in the design. The added consump-tion of resources (flip-flops and routing)can have a negative influence on the place-ment and routing choices for other por-tions of the design, possibly resulting inlonger routing paths.

Dedicated Multipliers and RAM BlocksMultipliers are generally thought of forDSP designs. But because Xilinx FPGAarchitectures contain dedicated resourcesfor multiplication, multipliers can befound in many types of designs, perform-ing multiplication as well as other func-tions. Similarly, virtually every FPGAdesign uses RAMs of various sizes, regard-less of the application.

Xilinx FPGAs contain several blockRAM elements that can be used in a designas RAM, ROM, a large LUT, or even gen-eral logic. The use of both multipliers andRAM resources can result in more compactand higher performing designs, but resetchoice can have either a positive or negativeperformance impact, depending on thetype of reset used. Both RAM and multi-plier blocks contain only synchronousresets; thus, if an asynchronous reset iscoded for these functions, the registerswithin these blocks cannot be used. The

registers contain the ability to program theset/reset as either asynchronous or synchro-nous, you might think that there is no penal-ty to use asynchronous resets. Thatassumption is often wrong. If an asynchro-nous reset is not used, the set/reset logic canbe configured as synchronous logic; if so, thisfrees up added resources for logic optimiza-tion. To illustrate how asynchronous resetscan inhibit optimization, let’s look at the fol-lowing suboptimal code examples:

VHDL Example #1

process (CLK, RST)begin

if (RST = ‘1’) thenQ <= ‘0’;

elsif (CLK’event and CLK = ‘1’) thenQ <= A or (B and C and D and E);

end if;end process;

Verilog Example #1

always @(posedge CLK, posedge RST)if (RESET)Q <= 1’b0;

elseQ <= A | (B & C & D & E);

To implement this code, the synthesistool has no choice but to infer two LUTsfor the data path, because there are five sig-nals used to create this logic. A possibleimplementation of the above code wouldlook like Figure 1.

If, however, this same code is re-writtenfor a synchronous reset, as in the following

effect this has on performance can besevere. For example, using a fully pipelinedmultiplier targeting Virtex™-4 deviceswith an asynchronous reset can result in a200 MHz performance. Changing the codeto a synchronous reset can more than dou-ble design performance to 500 MHz.

The issues with RAMs are twofold.Similar to the multipliers, Virtex-4 blockRAMs have optional output registers which,when used, can reduce the clock-to-outtimes of the RAMs and increase overalldesign speed. These registers offer synchro-nous resets but not asynchronous resets, andthus cannot be used if the registers withinthe code describe an asynchronous reset.

A secondary issue comes to light whenusing the RAMs as a LUT or general logic.At times, it is advantageous for both areaand performance reasons to condense sev-eral LUTs configured as ROM or generallogic into a single block RAM. This can bedone either by manually specifying thesestructures, or (in automated ways) map-ping the portions of the logical design tounused block RAM resources. Because theblock RAM has a synchronous reset, themapping of general logic can occur withoutchanging the specified functionality of thedesign – if a synchronous reset (or no reset)is used. If an asynchronous reset isdescribed, this is not possible.

General LogicProbably the least-known effect asynchro-nous resets have is on general logic structures.Because all Xilinx FPGA general-purpose

Figure 1 – Synthesis infers two LUTs

examples of corrected code with reducedarea and improved performance:

VHDL Example #2

process (CLK)begin

if (CLK’event and CLK = ‘1’) thenif (RST = ‘1’) then

Q <= ‘0’;else

Q <= A or (B and C and D and E);end if;

end if;end process;

Verilog Example #2

always @(posedge CLK)if (RESET)Q <= 1’b0;

elseQ <= A | (B&C&D&E);

The synthesis tool now has more flexi-bility as to how this function can exist. Apossible implementation of the precedingcode would look like Figure 2.

In this implementation, the synthesistool can identify that any time A isactive high, Q is always a logic one (theOR function). With the register nowconfigured with the set/reset as a syn-chronous operation, the set is now freeto be used as part of the synchronousdata path. This reduces the amount oflogic necessary to implement the func-tion, as well as reducing the data pathdelays for the D and E signals from theprevious example. Logic could have also

been shifted to the reset side as well, if thecode was written in a way that was a morebeneficial implementation.

Consider the following addition tothese examples:

VHDL Example #3

process (CLK, RST)begin

if (RST = ‘1’) thenQ <= ‘0’;

elsif (CLK’event and CLK = ‘1’) thenQ <= (F or G or H) and (A or (B and Cand D and E));

end if;end process;

Verilog Example #3

always @(posedge CLK, posedge RST)if (RESET)Q <= 1’b0;

elseQ <= (F|G|H) & (A | (B&C&D&E));

Now that there are eight signals thatcontribute to the logic function, a mini-mum of three LUTs would be needed toimplement this function. A possible imple-mentation of the above code would looklike Figure 3.

If the same code is written with a syn-chronous reset:

VHDL Example #4

process (CLK)begin

if (CLK’event and CLK = ‘1’) thenif (RST = ‘1’) then

Q <= ‘0’;else

Q <= (F or G or H) and (A or (B and Cand D and E));

end if;end if;

end process;

Verilog Example #4

always @(posedge CLK)if (RESET)Q <= 1’b0;

elseQ <= (F|G|H) & (A | (B&C&D&E));

A possible implementation of the abovecode would look like Figure 4. Again, theresulting implementation not only usesfewer LUTs to implement the same logicfunction, but also could potentially resultin a faster design because of the reductionof logic levels for practically every signalthat creates this function.

These examples are simple, but they doillustrate our point of how asynchronousresets force all synchronous data signals onthe data input to the register, thus resultingin possibly more logic levels and less optimalimplementation. In general, the more signalsthat fan into a logic function, the more effec-tive the use of synchronous sets/resets (or noresets at all) in minimizing logic resources ormaximizing design performance.

Adder Chains Instead of Adder TreesMany signal processing algorithms performan arithmetic operation on an input streamof samples, followed by a summation of alloutputs of this arithmetic operation. The

Figure 2 – More flexible LUT inference Figure 3 – Synthesis infers three LUTs

adder tree structure is typically used toimplement the summation in parallelarchitectures such as FPGAs.

One difficulty with the adder tree con-cept is the varying nature of its size. Thenumber of adders is dependent on the num-ber of inputs in the adder tree. The moreinputs in the adder tree, the more addersyou need, which increases both the number

of logic resources and power con-sumption. Larger trees also meanlarger adders in the last stages ofthe tree, which further reducessystem performance.

To reduce power consump-tion and maintain high perform-ance, adder trees should beimplemented as dedicated siliconresources. But placing a numberof fixed-size adder tree compo-nents in silicon is not efficientbecause you would have to uselogic resources when the fixednumber of additions is exceededor even go to a larger FPGA,

thereby increasing the cost of the device.With its columns of DSP48 dedicated

silicon, the Virtex-4 device family takes adifferent approach in implementing sum-mations. It involves computing the sum-mation incrementally using chained addersinstead of adder trees. This approach is adeparture from any existing FPGA and is

key to maximizing performance and lower-ing power for DSP algorithms becauseboth logic and interconnect are containedentirely within the dedicated silicon.

When pipelined, performance of theDSP48 block is 500 MHz – independentof the number of adders. As illustrated inFigure 5, cascading ports combined withthe 48-bit resolution of the adder/accumu-lator allow computing of the current sam-ple calculation, along with the summationof all computed samples so far.

To take advantage of the Virtex-4 adderchain structure in the RTL, simply replacethe adder tree description with an adderchain description. This process of convert-ing a direct form filter to a transposed orsystolic form is detailed in the XtremeDSPDesign Considerations User Guide.

Once the conversion is complete, youmay find that the algorithm runs much fasterthan your application needs. In that case,you could further reduce device utilizationand power consumption by using either

LUT4RST

48 h7(n-7)

No wire shift

Y(n-10)

Slice 8

h6(n-6)

Slice 7

h5(n-5)

Slice 6

h4(n-4)

Slice 5

h3(n-3)

Slice 4

h2(n-2)

Slice 3

h1(n-1)

Slice 2

Slice 1

No wire shift

X(n-2)

Y(n-6)

The final stages of the postaddition in logic are theperformance bottlenecks that consume more power.

The post adders arecontained wholly indedicated silicon forhighest performanceand lowest power.

X(n-4)

Figure 4 – Inferred with synchronous reset

Figure 5 – Chaining adders provide predictable performance

folding or multi-channeling techniques.Both techniques help implement designs insmaller devices or allow you to add function-ality to a design using the freed resources.

Multi-channeling is a process that lever-ages very fast math elements across multi-ple input streams (channels) with muchlower sample rates. This technique increas-es silicon efficiency by a factor almost equalto the number of channels. Multi-channelfiltering can be looked at as time-multi-plexing single-channel filters. For example,in a typical multi-channel filtering sce-nario, multiple input channels are filteredusing a separate digital filter for each chan-nel. Taking advantage of the Virtex-4DSP48 block, you could use a single digi-tal filter to filter all eight input channels byclocking the single filter with an 8x clock.This reduces the number of FPGAresources needed by almost 8x.

Maximize Block RAM PerformanceWhen inferring memory elements, factorsaffecting performance include:

• using dedicated blocks or distributedRAMs

• using the output pipeline register

• not using asynchronous resets

There are also a couple of lesser knownareas – HDL coding style and synthesistool settings – that can substantiallyimpact memory performance.

HDL Coding StyleWhen inferring dual-port block memories,it is possible that both ports could try toaccess the same memory cell at the sametime. If both ports are simultaneously writ-ing different values at the same memorycell, this creates a collision and the memo-ry cell content cannot be guaranteed. Butwhat happens if one port reads while theother port is writing at the same address?Well, it depends on the target device. Thelatest Virtex and Spartan™ families havethree programmable operating modes togovern memory output while a write oper-ation is occurring. Additional informationabout these operating modes is provided inthe device user guides.

Note that the different modes affecthow the memory outputs behave and alsoaffect the performance of the memory. Asillustrated in the following example, yourcoding style determines in which mode thememory is operating:

// Inference of Virtex-4 memory blocks//// ‘write first’ or transparent modealways @(posedge clk) beginif(we) begindo <= data;mem[address] <= data;

end elsedo <= mem[address];

// ‘read first’ or read before write mode(slower)

always @(posedge clk) begin if (we) mem[address] <= data;

do <= mem[address]; end

// ‘no change’ mode always @(posedge clk) if (we) mem[address] <= data;

else do <= mem[address];

Add Pipeline LevelsAnother way to increase performance is torestructure long data paths made of sever-al levels of logic, breaking them up overmultiple clock cycles. This method allowsfor a faster clock cycle and increased datathroughput, at the expense of latency andpipeline management overhead logic.Because FPGAs are register-rich, the addi-tional registers and overhead logic are usu-ally not an issue.

Because the data is now on a multi-cycle path, you must use special consider-ations for the rest of the design to accountfor the added latency. The followingexample presents a coding style to add fivelevels of registers on the output of a 32 x32 multiplier. The synthesis tool willpipeline these registers to the registers

available in the Virtex-4 DSP48 block soas to maximize data throughput.

// 32x32 multiplier with 4 DSP48 (PIPE=5)always @(posedge clk) begin

prod[0] <= a * b;for (i=1; i<=PIPE-1; i=i+1)

prod[i] <= prod[i-1];end

Nests in the CodeTry not to make too many nests in thecode, such as nested if and case statements.If you have too many if statements inside ofother if statements, it can make the linelength too long, as well as inhibit synthesisoptimizations. By following this guideline,your code is generally more readable andmore portable.

When describing “for-loops” in HDL, itis preferable to place at least one register inthe data path, especially when there arearithmetics or other logic-intensive opera-tions. During compilation, the synthesistool will unroll the loops. Without thesesynchronous elements, it will concatenatelogic created at each iteration of the loop,resulting in a very long combinatorial paththat may limit design performance.

ConclusionRecent advances in synthesis and place androute algorithms have made achieving thebest performance out of a particular devicemuch more straightforward. Synthesistools are able to infer and map complexarithmetics and memory descriptions ontothe dedicated hardware blocks. They willalso perform optimizations such as retim-ing and logic and register replications.Based on timing constraints, the place androute tool can now restructure the netlistand perform timing-driven packing andplacement to minimize placement androuting congestions.

However, today (just as yesterday), thereis only so much the tools can do to maxi-mize performance. If you need more per-formance out of your design, then a veryefficient way to proceed is by learning moreabout the target device, the synthesis tool,and by using the coding guidelines illus-trated in this article.

by Edgard GarciaXilinx Consultant/DesignerMulti Video Designsedgard.garcia@mvd-fpga.com

The Xilinx® Virtex™-4 family introduceda new high-performance concept for fastand complex DSP algorithm implementa-tion. The XtremeDSP™ DesignConsiderations User Guide, available onthe Xilinx website (www.xilinx.com/bvdocs/userguides/ug073.pdf), describes how youcan take advantage of the DSP48 architec-ture and includes several examples.

When you have to develop a real DSPapplication, you can of course instantiateeach DSP48 block and assign their respec-tive attribute values to obtain the correctbehavior. But did you know you can alsoinfer most of the useful DSP48 configura-tions by writing very simple RTL code?

Developing DSP algorithms in VHDL(or Verilog) is a nice way to maintaindesigns over a long period of time, but the

synthesis results must meet your perform-ance requirements. In this article, I willshow you how to write RTL code to takefull advantage of Virtex-4 DSP48 blocks.

DSP48 ArchitectureThe Virtex-4 DSP48 architecture is exten-sively described in the XtremeDSP UserGuide. Let’s start, however, with anoverview of some very important aspects ofDSP48 blocks:

• DSP48 blocks have two 18-bit inputsto feed the multiplier. If you want towork with unsigned data, 17 bits isthe maximum width of the multiplierinputs. Don’t forget to expand theunsigned data/coefficients by concate-nating one or more ‘0’ to the most sig-nificant bit (MSB). Similarly, if usingthe adder/subtracter, its inputs andoutput will have to be 48 bits or lessfor signed arithmetic and 47 bits orless for unsigned.

For the examples described in this arti-cle, we will use signed data. You will haveto use the IEEE.STD_LOGIC_SIGNEDpackage.

• Another important parameter fordescribing DSP behavior for Virtex-4DSP48 blocks is that all DSP48 inter-nal registers have a synchronous reset(using asynchronous reset will preventthe synthesis tool from using theDSP48 internal registers). The resetfunctionality has priority, regardless ofOpCode or other control inputs.

• It is important to note that the laststage of the adder/subtracter can bedriven dynamically to take a 48-bitinput (from the output stage feedbackor from the DSP48 C or Pcin input)and to add or subtract another 48- or36-bit input (originating for mostcommon cases from the multiplieroutput).

Writing RTL Code for Virtex-4 DSP48 Blocks with XST 8.1i

Writing RTL code for your DSP applications is easy and efficient.

Basic Examples1. Multiplier_accumulator. This commonly used function is ourfirst example, useful for FIR filters and other DSP functions. Hereis the source code:

library IEEE; use IEEE.STD_LOGIC_1164.ALL;use IEEE.STD_LOGIC_SIGNED.ALL; — Signed arithmetic is used

entity MULT_ACC isPort ( CK : in std_logic;

RST : in std_logic; — Synchronous resetAin, Bin : in std_logic_vector(17 downto 0); — A and B inputs of the multiplierS : out std_logic_vector(47 downto 0) ); — Accumulator output

end MULT_ACC;

architecture Behavioral of MULT_ACC is

signal ACC : std_logic_vector(47 downto 0); — Accumulator output

process(CK) beginif CK’event and CK = ‘1’ then

if RST = ‘1’ then ACC <= (others => ‘0’);else ACC <= ACC + (AIN * BIN);end if;

end if;end process;

S <= ACC;

end Behavioral;

This example will be synthesized into a single DSP48 block – noother logic resource is necessary. The performance is about 180-200MHz, depending on placement and routing.

2. Fully pipelined Multiplier_accumulator. If you need more per-formance and less dependency on place and route tools, you canstill improve the performance of the Multiplier_accumulator. TheDSP48 blocks have internal input registers (zero, one, or two stagesfor A and B inputs), as well as one selectable multiplier output reg-ister. The following RTL code uses one level of registers at the A andB inputs, as well as the multiplier output register:

entity MULT_ACC isPort ( CK : in std_logic;

RST : in std_logic; — Synchronous resetAin, Bin : in std_logic_vector(17 downto 0); — A and B inputs of the multiplierS : out std_logic_vector(47 downto 0) ); — Accumulator output

end MULT_ACC;

architecture Behavioral of MULT_ACC is

signal AinR, BinR : std_logic_vector(17 downto 0); — Registered Ain and Binsignal MULTR : std_logic_vector(35 downto 0); — Registered multiplier outputsignal ACC : std_logic_vector(47 downto 0); — Accumulator output

if RST = ‘1’ then AinR <= (others => ‘0’);BinR <= (others => ‘0’);MULTR <= (others => ‘0’);ACC <= (others => ‘0’);

else AinR <= Ain;

BinR <= Bin;MULTR <= AinR * BinR;ACC <= ACC + MULTR;

end if;end if;

end process;

S <= ACC;

end Behavioral;

This example will be synthesized by using just a single DSPblock. You can take advantage of the internal registers to greatlyimprove performance to more than 400 MHz for the slowestVirtex-4 speed grade, independent of the implementation (placeand route) tools.

3. Fully pipelined Loadable_Multiplier_accumulator. You canimprove the design further by using a loadable multiplier accumu-lator. For more details, please refer to the class material of the Xilinxcourse, “DSP Implementation Techniques for Xilinx FPGAs”(www.xilinx.com/support/training/abstracts/dsp-implementation.htm).Let’s modify the previous code for the load functionality:

entity MULT_ACC_LD isPort ( CK : in std_logic;

RST : in std_logic; — Synchronous resetAin, Bin : in std_logic_vector(17 downto 0); — A and B inputs of the multiplierLOAD : in std_logic; — Active high LOAD commandS : out std_logic_vector(47 downto 0) ); — Accumulator output

end MULT_ACC_LD;

architecture Behavioral of MULT_ACC_LD is

signal AinR, BinR : std_logic_vector(17 downto 0); — Registered Ain and Binsignal MULTR : std_logic_vector(35 downto 0); — Registered multiplier outputsignal ACC : std_logic_vector(47 downto 0); — Accumulator output

— 48 bit “ZERO” constant used for MULTR sign extension to 48 bitsconstant ZERO : std_logic_vector(47 downto 0) := (others => ‘0’);

if RST = ‘1’ then AinR <= (others => ‘0’);BinR <= (others => ‘0’);MULTR <= (others => ‘0’);ACC <= (others => ‘0’);

else AinR <= Ain;BinR <= Bin;MULTR <= AinR * BinR;

if LOAD = ‘1’ thenACC <= ZERO + MULTR; — OpCode = x05

elseACC <= ACC + MULTR; — OpCode = x25

end if;end if;

end if;end process;

S <= ACC;

end Behavioral;

4. Multiplier_accumulator_or_adder. This is another useful ver-sion of the multiplier accumulator. It is useful for multiplications ofdata buses of more than 18 bits (see Figure 1-18 in the XtremeDSPUser Guide). Here is the RTL code:

entity MULT_ACC_ADD isPort ( CK : in std_logic;

RST : in std_logic;SEL : in std_logic;A_in, B_in : in std_logic_vector(17 downto 0);C_in : in std_logic_vector(47 downto 0);S : out std_logic_vector(47 downto 0));

end MULT_ACC_ADD;

architecture Behavioral of MULT_ACC_ADD is

constant ZERO : std_logic_vector(47 downto 0) := (others => ‘0’);

signal AR, BR : std_logic_vector(17 downto 0);signal MULT : std_logic_vector(35 downto 0);signal Pout : std_logic_vector(47 downto 0);

if RST = ‘1’ then AR <= (others => ‘0’);BR <= (others => ‘0’);MULT <= (others => ‘0’);Pout <= (others => ‘0’);

else AR <= A_in;BR <= B_in;MULT <= AR * BR;

if SEL = ‘0’ then Pout <= C_in + MULT; — Opcode = 0x35 for C input— 0x15 for PCIN input

— if SEL = ‘0’ then Pout <= ZERO + MULT; — Opcode = 0x05 for ZERO— constant as input (Note 1)

else Pout <= Pout + MULT; — Opcode = 0x25 (Notes 2, 3)end if;

end if;end if;

end process;

S <= Pout;

end Behavioral;

Note that the synthesis results are not currently as optimized aswe could expect with XST 8.1. Some combinatorial logic will beused to implement the multiplexer between C_in and Pout, whilethe same function was available inside the DSP48 block. The per-formance is still 220 MHz for the -10 speed grade, and 270+ MHzfor -12. However, Synplify Pro 8.2 provides the ideal implementa-tion with the same RTL code.

Note 1 : Adding ZERO to Pout is equivalent to the previouslydescribed load function.

Note 2 : You can also use the 17-bit right shift on Pout by chang-ing this line as follows (at this time, this feature is supported onlyby Synplicity Synplify Pro 8.2):

else Pout <= ZERO + Pout(47 downto 17) + MULT;

Note 3 : If for any reason you do not want to use the output reg-ister of the multiplier, you can write:

Pout <= Pout + (AR * BR);

instead of declaring a combinatorial multiplier output. The result-ing RTL code is also more compact.

5. Symmetric rounding. Another simple but useful example is amultiplier with symmetric rounding (see Table 1-9 in theXtremeDSP User Guide). Assuming that you want to round theresult of the multiplication Ain x Bin to 20 bits, the following RTLcode will be synthesized in just one DSP48 block and one slice:

library IEEE; use IEEE.STD_LOGIC_1164.ALL;use IEEE.STD_LOGIC_SIGNED.ALL;

entity ROUNDING isPort ( CK : in std_logic;

RST : in std_logic;Ain, Bin : in std_logic_vector(17 downto 0);P : out std_logic_vector(19 downto 0));

end ROUNDING;

architecture Behavioral of ROUNDING is

constant ZERO : std_logic_vector(47 downto 0) := (others => ‘0’);

signal AR, BR : std_logic_vector(17 downto 0);signal MULTR : std_logic_vector(35 downto 0);signal Pout : std_logic_vector(47 downto 0);

signal Carry_in, Carry_inR : std_logic;

Carry_in <= not(Ain(17) xor Bin(17));Carry_inR <= Carry_in;if RST = ‘1’ then AR <= (others => ‘0’);

BR <= (others => ‘0’);MULTR <= (others => ‘0’);Pout <= (others => ‘0’);

elseAR <= Ain;BR <= Bin;MULTR <= AR * BR;

— Note that the following 4 operands adder will be implemented as a 3 operand one : — ZERO is a constant that allows easy sign extension for the VHDL syntax

Pout <= ZERO + MULTR + x”7FFF” + Carry_inR;end if;

end if;end process;

P <= Pout(35 downto 16);

end Behavioral;

This example will also work at 400 MHz for the Virtex-4 -10speed grade device and 500 MHz for the -12 speed grade device.Only one LUT and its associated slice flip-flop is used, as the sec-ond flip-flop is pushed inside the DSP48 block for carry input.

All of these examples can be used in a wide range of applications.You can see that they are very efficiently synthesized, and all of thelogic is mapped into the DSP48 blocks. The performance for eachof these DSP functions is independent of the place and route tools.

To make it easier for synthesis tools to recognize the DSP48 struc-

ture, it is important to write the code in a simple way, giving yourtools the best option to pack your desired functions into each DSP48block. For this reason, each code has been written in a single process.

The more simple and compact your RTL code, the more effi-cient the synthesis result. Of course, depending on your synthesistool, other alternatives can also give you excellent results, but theywill be more dependent on the synthesis tools.

Higher Complexity DesignsWhat happens when you need more complex DSP functions? Youcan use a similar approach for many complex DSP algorithm imple-mentations by describing each block separately to ensure optimalsynthesis results.

You will find many other examples, most of them directly relat-ed to those explained in their algorithmic and schematic form, inthe XtremeDSP User Guide.

ConclusionThis article is excerpted from the application note, “Virtex-4 DSP48Inference,” which is available at www.mvd-fpga.com/en/publi_V4_DSP48.html.

The application note includes additional examples, such as:

• Single DSP slice 35 x 18 multiplier (Figure 1-18 in theXtremeDSP User Guide)

• Single DSP slice 35 x 35 multiplier (Figure 1-19 in theXtremeDSP User Guide)

• Fully pipelined complex 18 x 18 multiplier (Figure 1-22 in the XtremeDSP User Guide)

• High-speed FIR filter (Figure 1-17 in the XtremeDSP User Guide)

The application note also describes many of the important featuresof DSP48 blocks supported by XST 8.1i and Synplify Pro 8.2. Mapreports, Timing Analyzer reports, and a detailed view of the FPGAEditor show the efficiency of the synthesis and implementation tools.You can also see how the cascade chain between adjacent DSP48 slicesis used to improve both performance and power consumption. Almostall of these widely used configurations provide the best implementa-tion results – in terms of resources used as well as performance.However, some remaining limitations are also described. We expectthese few points to be resolved in future releases.

For more information, see the XtremeDSP DesignConsiderations User Guide at www.xilinx.com/bvdocs/userguides/ug073.pdf. The methodology is clearly explained and implementa-tion results analyzed in detail, with ISE™ software tools likeTiming Analyzer and FPGA Editor.

Multi Video Designs (MVD) is a training and design center spe-cializing in FPGA designs, PowerPC™ processors, RTOS for embed-ded/real-time applications, and high-speed buses like PCI Expressand RapidIO. MVD as an Approved Training Partner and a mem-ber of the Xilinx XPERTS program, with offices in France, Spain,and South America.

XST Support for DSP48 Inference

XST, the synthesis engine included with the Xilinx®

ISE™ toolset, contains extensive support for inference of

DSP48 macros. A number of macro functions are recog-

nized and mapped to these dedicated resources, including

adders, subtracters, multipliers, and accumulators, as well as

combinations like multiply-add and multiply-accumulate

(MAC). Register stages can be absorbed into the DSP48

blocks, and direct connect resources are used to cascade

large or multiple functions.

Macro implementation on DSP48 blocks is controlled

by the USE_DSP48 constraint with a default value of

auto. In auto mode, XST attempts to implement all afore-

mentioned macros except adders or subtracters on DSP48

resources. To push adders or subtracters into a DSP48, set

the USE_DSP48 constraint value to yes.

XST performs automatic resource control in auto mode

for all macros except adders and subtracters. In this mode

you can control the number of available DSP48 resources

for synthesis using the DSP_UTILIZATION_RATIO

constraint, specifying either a percentage or absolute num-

ber. By default, XST tries to utilize, as much as possible, all

available DSP48 resources within a given device.

With the 8.1i release of ISE software, XST has intro-

duced further enhancements to its DSP support. XST can

now infer loadable accumulators and MACs, which are

critical for filter applications. XST can recognize chains of

complex filters or multipliers – even across hierarchical

boundaries – and will use dedicated fast connections to

build these DSP48 chains. The Register Balancing opti-

mization feature will consider the registers with DSP48

blocks when optimizing clock frequencies. Consult the

XST User Guide (http://toolbox.xilinx.com/docsan/xilinx7/

books/docs/xst/xst.pdf) for details about coding styles, and

watch the synthesis reports for specific implementation

results for your Virtex™-4 designs.

– David Dye

Senior Technical Marketing Engineer

Xilinx, Inc.

by Brent Przybus ChipScope Pro Product Marketing Manager Xilinx, Inc.brent.przybus@xilinx.com

For years I have looked at other careerchoices with envy. Pilots have an office inthe sky, traveling routes like New York toLondon and Los Angeles to Tokyo.Archeologists go on digs to such exoticlocales as Egypt, Africa, and SouthAmerica. Even cable television installers getto visit exotic homes in the Bay Area hills.

Engineers, on the other hand, toil awayin cold, sterile labs within the bowels of acorporate complex, trying to solve prob-lems and bring their next engineering mar-vel to life. As FPGA designers, most of thattime is spent running simulations, verify-ing, and debugging designs implementedwith the latest Xilinx® FPGA technology.

I am happy to say that the days of pastywhite skin and wearing sweaters in summerare over. Remote debugging is finally here.

Un-Tethered Debugging

Remote debugging enables designers to configure, debug, and verify Xilinx FPGA systems remotely.

V E R I F I C A T I O N

Preparing for Remote DebuggingIf you have used ChipScope™ Protools, then you understand the value ofplacing ILA, VIO, IBA, and ATC2cores within your FPGA designs to gainaccess and insight into what is happen-ing. If you have not used the ChipScopePro analyzer before, do not despair. TheChipScope Pro debugging and verifica-tion design tool allows you to debugand verify your Xilinx FPGA designson-chip by following these steps:

• Instrument your design by usingeither the ChipScope Pro core insert-er or core generator. These toolsallow you to add ChipScope Prodebugging cores to your new or exist-ing designs. The debugging corescome in several different flavors: inte-grated logic analysis (ILA), integratedbus analysis (IBA), virtual input out-put (VIO), and the Agilent TraceCore 2 (ATC2).

• Configure, debug, and verify yourdesign using the ChipScope Proanalyzer. The ChipScope Pro ana-lyzer runs on a standard PC run-ning either Windows or Linux andinterfaces to the FPGA through theJTAG port. Through the analyzer,you can interact with the debuggingcores you have placed in yourdesign, setting trigger conditions,capturing data, and displaying thatdata in a convenient waveform orlist view interface similar to abench-top logic or bus analyzer.

Figure 1 shows how ChipScope Protools are integrated into the standardXilinx FPGA design flow.

The latest version of the ChipScopePro analyzer includes remote debug-ging capability. To use this new feature,you need to specify a ChipScope Proserver – a lab machine with networkaccess – connected to the board andthe FPGA through the JTAG portusing a Parallel IV or USB configura-tion cable. In most cases, this is thestandard setup that has kept designersin the lab all these years (Figure 2).

ChipScope Pro CoreGenerator

DesignEntry

ChipScope Pro CoreInserter

ChipScope Pro CoreOn-Chip Debugging

ChipScope Pro CoreOn-Chip Verification

DeviceConfiguration

DesignImplementation

Design Verification

FunctionalSimulation

TimingSimulation

Static TimingAnalysis

BackAnnotation

Synthesis

ChipScopePro

Host Computer withChipScope Pro Software

Target Device Under Test

UserFunction

JTAGConnections

Board-Under-Test

ParallelCable

ILA Pro

ICON Pro

ILA Pro

Figure 1 - ChipScope Pro tool integration into the Xilinx FPGA design flow

Figure 2 – Typical ChipScope Pro lab setup

Setting up and starting the ChipScopePro analyzer server is easy using version7.1i or later of the ChipScope Pro tools:

• Start the server on a Windows machineby executing:

$CHIPSCOPE/cs_server.bat

or start the server on a Linux machineby executing:

$CHIPSCOPE/bin/lin/cs_server.sh

($CHIPSCOPE refers to the directorypath where ChipScope Pro tools areinstalled on the server machine.)

• Use the following command lineoption, as required:

– -port <portnumber>. Specify theTCP/IP port number to be used bythe client and server to establish aconnection. The default port num-ber is 50001.

– -password <password>. Protect theserver from unauthorized access. Nopassword is set by default.

– -l <logfile>. Specify a location for thelog file. The default location is$HOME/.chipscope/cs_analyzer_<portnumber>.log.

Moving Out of the LabWith the ChipScope Pro server runningback in the lab, the next step is to run theChipScope Pro client software on yourremote system with a network connection,such as a laptop with wireless access.Setting up the client is easy from within theChipScope Pro analyzer software:

• Launch the ChipScope Pro analyzerfrom the start menu or from the ISEProject Navigator.

• Select the JTAG Chain > Server HostSettings menu option. This willlaunch a server settings dialog box.

• For remote operation, set the host set-ting to the IP address of the server orthe appropriate network system namefor the server. Set the port and pass-word setting to the same values usedon the server. If default settings wereused on the server, these settings arealready set correctly. Click OK.

• Open a connection to the JTAGdownload cable you are using and startdevice configuration and debugging.The remote connection to the serverwill not be established until you open aconnection to a JTAG download cable.

Note that you must use the same ver-sion of ChipScope Pro tools on both theclient and server machines.

By running the Xilinx ISE™ tools onthe client machine, you can create andmodify your design and add ChipScopePro cores from the comfort of your remotelocation. When you have finished imple-menting your design and creating the con-figuration file, simply launch theChipScope Pro analyzer, connect to theserver in the lab, reconfigure the FPGA,and debug just as if you were in the lab.

Practical ApplicationsBeyond debugging from anywhere, thereare some very real, practical applicationsfor remote debug.

• Sharing resources. Xilinx FPGAdesigns utilize embedded processors:the PowerPC™ 405 hard-core orMicroBlaze™ soft-core processor. Thedesigns may consume as many as100,000 logic cells, more than 10 Mbof blockRAM, and leverage as many as512 XtremeDSP™ slices. Often anentire design team works on a singleFPGA design, and at some point eachmember of the design team requiresaccess to the FPGA. Remote debug-

ging allows the design team to share asingle board in one lab location, withdesign team members accessing the labfrom offices throughout the world.

• Support fielded systems. Your producthas shipped, but there is a problem.The field application resource is on-sitewith a laptop but needs factory assis-tance. ChipScope Pro remote debug-ging allows the factory to access thefielding system through the field appli-cation engineer’s laptop connected tothe board through JTAG. The factorycan quickly uncover the design prob-lem using ChipScope Pro cores left inthe original design; determine, imple-ment, and test a fix; and upgrade thesystem remotely from the factory.

• Move to hardware faster. The availabilityof hardware sooner in the design processcan greatly accelerate the overall FPGAdesign flow. Rather than spending hourssimulating a complex sequence ofevents, run this portion of the design inhardware. By using an evaluation devel-opment board or a prototype systemwith a target FPGA, you can run simu-lations in hardware – with real systemdata, at the system clock rate – andcomplete analysis in a fraction of timesoftware simulation takes. The remotedebugging and verification solutionenabled by ChipScope Pro tools allowsyou to do this from your designmachine, in the comfort of your office.

Conclusion Remote debugging can increase productivi-ty and the quality of the work environ-ment. For more information about how touse ChipScope Pro tools, visit www.xilinx.com/chipscopepro, where you will findChipScope Pro documentation, links todemo-on-demand sessions, and answers tofrequently asked questions.

By running the Xilinx ISE tools on the client machine, you can create and modify your design and add ChipScope Pro

cores from the comfort of your remote location.

by Arun ThakkarMember of the Technical Staff-1Lucent Technologiesthakkara@lucent.com

The Optical Networking Products Divisionat Lucent Technologies is involved in pro-ducing next-generation SONET opticalnetworking equipment for Internet serviceproviders. Our current project adheres tothe OC48 standard, with serial I/O as highas 2.5 GHz. Internal logic clock speeds canrun anywhere from 155 MHz on down,with heavy system interface and bus trafficon our system boards.

Our past projects have relied heavily onHDL simulation, using VHDL models forcomponent modeling and bus functionalmodels to simulate the interface and back-plane traffic. To accurately model the entiresystem, we also added external memoryinterface device models and modeled theeffects of interface trace delays. Together,this system simulation would let us checknew design concepts in our existing andfuture projects. The downside was that wehad to create HDL test benches and main-tain them in parallel. System simulationtimes could be quite long and tedious, andthere remained the unanswered question,“Have we modeled enough to accuratelypredict system behavior”?

ChipScope Pro Analyzer – Real-Time DebuggingWe began using Xilinx® ChipScope™Pro in our systems with the 6.1i softwarerelease. In our current project, we areusing two to three ChipScope Pro ILAcores per clock domain inserted into ourtarget Virtex™-II Pro XC2VP30 device.The soft debug cores are inserted afterthe synthesis stage using the ChipScopePro netlist inserter, and we debug prima-rily by using the ChipScope Pro logicanalyzer. We are capturing roughly 100transitions in any one debugging cycle,depending on the particular problem weare researching, and we use trigger portsto save onboard block RAM memory.

ChipScope Pro tools let us capturedata at any point in the FPGA, while thechip is interacting with the rest of the sys-tem and running at operating speed. Wehave been able to reduce the number ofpins on the FPGA and on the board thatwere previously dedicated to verification,since we debug directly through theJTAG programming cable.

Uncovering ProblemsOne example, in which by using theChipScope Pro analyzer we debugged aproblem that we wouldn’t have otherwise

uncovered, was in a new revision of one ofour in-production products. At the ven-dor’s recommendation, our manufacturerhad recently upgraded one of the system’sexternal memories. Suddenly we were see-ing degraded performance and memoryparity errors. Nothing else had changed, yetthe system stopped working. Simulationdidn’t reveal the error, but we started trac-ing through with the ChipScope Pro tools,in real time. By triggering on the paritybyte, we were able to discover a handshakeproblem when writing data to and fromthe controller. We were issuing an auto-refresh before bank pre-charge before awrite cycle was complete. This was fine inthe older part, but the tolerances hadchanged slightly in the new part.

ConclusionWe still use HDL simulation in our proj-ects here at Lucent, but ChipScope Protools have now become a vital part of ourdesign and verification cycle, for all ofour projects. By catching problems inreal time in the lab and by taking advan-tage of the reprogrammability of FPGAs,we are able to turnaround design prob-lems in a matter of hours and get moreout of our project time.

Shorter Verification Cycles at Lucent TechnologiesShorter Verification Cycles at Lucent Technologies

ChipScope Pro real-time debugging software has reduced project verification times for Lucent’s high-speed Optical Networking Products Division.ChipScope Pro real-time debugging software has reduced project verification times for Lucent’s high-speed Optical Networking Products Division.

by Hamid AgahSenior Technical Marketing Manager, Design Software DivisionXilinx, Inc.hamid.agah@xilinx.com

Howard Walker Technical Marketing Engineer, Design Software DivisionXilinx, Inc.howard.walker@xilinx.com

Scott CampbellTechnical Marketing Engineer, Design Software DivisionXilinx, Inc.scott.campbell@xilinx.com

When Xilinx invented the FPGA in the mid1980s, the preferred way to verify a designwas to program the FPGA in the actual sys-tem and see if it operated properly fromboth a timing and functional standpoint.

Those days are long gone.According to a 2005 EDA study by

CMP Media, verification is one of the topthree considerations for FPGA designers. Afast and successful verification experience isessential to get your product to market ontime. But how do you know if your currentflow is the best choice, especially for today’shigh-density FPGAs? At a minimum, asound verification strategy should includestatic/dynamic timing and dynamic simu-lation. Optional advanced methodologiessuch as equivalency checking and assertion-based verification are now also available toXilinx FPGA users (Figure 1).

Verifying Your Logic Design for First-Time SuccessVerifying Your Logic Design for First-Time Success

Xilinx and its Alliance Members have the latest tools andmethodologies to support your verification requirements.Xilinx and its Alliance Members have the latest tools andmethodologies to support your verification requirements.

In this article, we’ll discuss improve-ments and additions to the verificationsolutions available from Xilinx and itsAlliance Program Members.

Improved Timing Analysis in ISE 8.1i SoftwareA complete timing verification mustinclude checking the FPGA design underboth the best- and worst-case operatingconditions. The worst-case conditionsoccur when the voltage supply to theFPGA is at a minimum and the tempera-ture of the FPGA is at a maximum. Theseconditions increase the internal delays ofthe device, and thus increase the potentialfor setup time violations. The best-caseconditions occur when the voltage supply

this methodology was sufficient for slowersystem-level interface standards, theincreasing speed of today’s dual-data-rate(DDR) source-synchronous interface stan-dards demands a more complete verifica-tion solution.

STA Analysis for Real-World ConditionsTo provide the most accurate timing verifi-cation, Xilinx® static timing analysis soft-ware automatically analyzes the designunder the best- and worst-case operatingconditions. This analysis is performedsimultaneously for the both the system I/Ointerface and the internal logic of thedesign. The methodology uses a combina-tion of minimum and maximum delays forsetup and hold-time analysis. In setup timeanalysis, the worst-case condition occurswhen the data path delay is at a maximumand the clock path delay is at a minimum.Conversely, the worst-case hold-timeanalysis condition occurs when the datapath delay is at a minimum and the clockpath delay is at a maximum.

To account for clock uncertainty in thedesign, the Xilinx software system allowsthe specification of the input jitter for eachclock. In addition to the incoming clockjitter, the clock uncertainty because of sys-tem jitter and the clocking system designautomatically takes into account all timinganalysis checks. Clock uncertainty increas-es the potential for a setup time violationby effectively decreasing the clock pathdelay. In the same manner, clock uncer-tainty increases the chance of a hold-timeviolation by increasing the clock path delay.

By providing the ability to simultane-ously model both the minimum and maxi-mum delays of the clock and data paths,Xilinx timing analysis software ensures thegreatest reliability of your system designacross all operating conditions (Figure 2).

ISE 8.1i Software Breaks New GroundAlthough static timing analysis verifies thatthe physical delays of data and clock pathsin the FPGA design will not cause setup orhold-time violations, you must also verifythe design’s functional operation. Becauseof the dynamic nature of the design, thefunctional timing operation must be tested

to the FPGA is at a maximum and the tem-perature of the FPGA is at a minimum.These conditions decrease the internaldelays of the device and increase the poten-tial for hold-time violations.

In addition to voltage and temperaturevariations, the clock system is subject touncertainty because of various sources ofjitter throughout the system. Jitter cancause the early or late arrival of a clock edgeand thus increase the chance of a setup orhold-time error.

Traditionally, FPGA-based static timinganalysis software has only been able to ana-lyze device operation under worst-casetemperature and voltage conditions with-out regard to clock uncertainty. Although

Functional Simulation

Synthesis

Static Timing

Place and Route

Static Timing

Silicon

Timing Simulation

Gate-Level Simulation

Setup Slack

Increased timingaccuracy usingminimum clock

path delays

Setup Slack

Valid Data

TCLK (Min)

TCLK (Max)

Figure 2 – Improved accuracy of static timing analysis (STA)

Figure 1 -Verification solution available to Xilinx users

at system-level speeds.In a manner similar tostatic timing analysis,for an accurate timingsimulation you musttake into account thebest- and worst-caseconditions due toprocess, voltage, tem-perature, and clockuncertainty. XilinxISE™ 8.1i softwarebreaks new ground indynamic timing simu-lation accuracy byallowing both mini-mum and maximum clock and path delaysto be simulated simultaneously. This uniqueability ensures that setup and hold-time vio-lations will be accurately accounted for, andworks automatically with all simulators.

Easy-to-Use Simulators from Xilinx ModelSim Xilinx Edition IIIWorking with the Model Technology divi-sion of Mentor Graphics, Xilinx has devel-oped a customized, lower cost version of thepopular ModelSim PE simulator calledModelSim Xilinx Edition III (MXE III)(Figure 3). MXE III is ideal for medium-density FPGAs with capacities as high as 2

million system gates, such as the Spartan™-3E FPGA family. It enables you to verify thefunctional and timing models of your designand your HDL source code (for more infor-mation, see www.xilinx.com/ise). A lowerperformance version of MXE III calledModelSim Starter is a no-charge feature ofthe ISE Foundation™ toolset.

MXE III’s features and capabilitiesinclude:

• Seamless integration with ISE software,delivering better dynamic verificationthrough automated graphical test benchgeneration and easy viewing in theProject Navigator processes window

• More capacity and faster performancethan MXE-II

• Support for system Verilog and VerilogPLI/VPI

• Excellent debug environment

• Waveform management tools

• Customizable user interface

• Batch-mode simulation

• HDL editor

• Source code debugging

• Verilog-2001 or VHDL-93 support(single language product)

• The MXE-III Starter version offers 50%faster HDL simulation and 20 timesmore design capacity than MXE-II

• Upgrade path to more powerful simu-lators such as ModelSim PE and SE

ISE Software SIM 8.1i Xilinx provides an integrated full-featuredHDL simulator as an optional designproduct for ISE Foundation users (Figure4). Also included at no charge in ISEFoundation is ISE Simulator Lite. Thisstarter version of ISE Simulator is ideal forsmaller devices.

ISE SIM 8.1i features include:

• Mixed-language Verilog 2001 andVHDL-93 design support

• Simple user interface

• Export to XPower for easier devicepower estimation

• Integrated wave editor for test benchcreation

• Design hierarchy, waveform, and con-sole views

• Source-level debugging

• Command-line console with TCLinterface

• Does not require FlexLM licensing

• “Generate Expected Results” processgenerates expected design outputbehavior based on input stimulus

Figure 4 – Xilinx ISE simulator provides streamlined integration.

Figure 3 – Easy-to-use Xilinx ModelSim Edition (MXE III)

Easier Verification of Hierarchical Designs A capability in the ISE Foundation toolsetcalled KEEP_HIERARCHY makes itmuch easier and faster to debug hierarchi-cal designs. This design flow not onlydecreases the time it takes to run timingsimulation, but also addresses the biggerproblem of finding and resolving problemsduring the debugging stage.

The idea is to maintain the hierarchyof selected sub-modules when the designgoes through the synthesis and implemen-tation flow and then verify these sub-modules in timing simulation before theentire design is assembled and verified.Each sub-module can be written out as aseparate netlist and verified both in RTLsimulation as well as in timing simulationwith a separate associated SDF file.Because each sub-module for a timingnetlist looks the same as the RTL version(with the same top-level port names), thesame test benches can be used in timingsimulation that were used in RTL simula-tion with little or no additional work.

Analysis of this methodology on typicalVirtex™-4 designs has shown that boththe simulation run times as well as thememory requirements required for simula-tion were considerably reduced. See the“Design Hierarchy and Simulation” sectionin the ISE 8.1i Synthesis and VerificationDesign Guide for more information.

Faster Simulator Performance for Xilinx DevicesXilinx and its partners understand thatshorter simulation runtime is a never-end-ing goal. Cadence NC-Sim v5.5, with itshard-coded Xilinx library primitives, andMentor ModelSim SE 6.1, with ‘vopt’,have improved simulation runtime.

Advanced Verification Methodologies Assertion-Based VerificationAssertion-based verification (ABV) is a blendof assertions, functional coverage, and formalmodel checking technologies applicable toboth ASIC and Xilinx FPGA designs.Assertions are explicit expressions of designintent, capturing what a circuit structureshould or should not do. By embeddingassertions in a design and having them mon-

itor design activities, assertions improve theobservability of the design. For more infor-mation on ABV for Xilinx FPGAs, see “EarlyDefect Discovery with Assertion-BasedVerification Accelerates Design Closure,”also in this issue of the Xcell Journal.

Equivalency CheckingEC is a static verification technology thatuses formal techniques to determine iftwo versions of the same design, at differ-ent stages of development, are functional-ly equivalent. This offers 10 to 100 timesfaster verification time and 100% func-tional coverage, with no test vectorsrequired. It is also able to check for anytool-induced bugs. EC can eliminate longsimulation runtimes when doing func-tional checks, as code is modified to meetyour timing goals.

To use the EC flow for Xilinx devices:

1. Download the EC libraries atwww.xilinx.com/ise/partner_libraries.

2. Set up the synthesis and implementa-tion tools to generate “formally verifi-able” netlists by turning offoptimizations such as constant regis-ters removal, register duplication, reg-ister merging, and disable re-timing.

a. In Synplify Pro, use the followingvariables to enable EC and writeout the automated setup file,<deign>.vif (text): set_option -verification_mode 1set_option -write_vif 1

b. In DC FPGA, includeset_fpga_default -formality toenable EC and write out the auto-mated setup file, .svf (encrypted).

c. For ISE enable ‘netgen -ecb’ toensure that the <design>.svf fileincludes a listing of ISE optimizedconstant registers.

3. Output an EC-friendly Verilog netlistfrom CoreGen. The netlist is used as afunctional model for each instantiatedCoreGen IP to successfully verifyCoreGen blocks for the post synthesisto post-PAR check.

Here are some tips for having a successfulEC experience:

• Use CoreGen to instantiate large blockRAMs in the RTL (Figure 5)

• Turn off re-timing in the synthesis tool

• Isolate wide multipliers to a separatehierarchy and then black box it

• Replace old instantiated componentswhen targeting the RTL to a newarchitecture

Synplicity (Synplify), Cadence DesignSystems (Conformal), and Synopsys (DC-FPGA and Formality) have been workingwith Xilinx to enable this for you. Contactthese companies to learn more about:

• Synplicity Synplify Pro V8.0+ synthesiswith Conformal V5.0+

• Synplicity Synplify Pro V8.0+ synthesiswith eCheck V4.3+

• Synopsys DC FPGA V2005.03+ withFormality V2005.03+

ConclusionVerification is now one of the top threeconsiderations for designers. Xilinx and ourAlliance Program Partners are continuallyinvesting to improve your verificationexperience through:

• Improved timing accuracy

• Better integrated and easier-to-use tools

• Hierarchical design support

• Faster simulator performance

• New methodologies

For more information, visit support.xilinx.com.

Figure 5 – Xilinx CoreGen netlist for equivalency checking

by Ping Yeung Principal Engineer Mentor Graphics Corporationping_yeung@mentor.com

Darren ZacherTechnical Marketing EngineerMentor Graphics Corporationdarren_zacher@mentor.com

When designing a system-on-chip (SoC)using a platform FPGA for mobile, wireless,networking, or media applications, design-ers typically integrate many IP componentswithin a short period of time. By offeringadvanced Xilinx® MicroBlaze™ and on-chip PowerPC™ processors, platformFPGAs such as the Virtex™-II Pro andVirtex-4 families give you a head start onintegrating processor IP.

Regardless of your IP choice, marketrequirements are driving the need for morefeatures, such that processor-based plat-form designs are becoming more complex– incorporating multiple processor coresand multi-layered bus architectures. Thisspiraling increase in complexity drives theneed for new approaches to platformFPGA design and verification. The simple,push-button flow of synthesis logic map-ping, with verification as an afterthought,just doesn’t cut it.

Just as design flows for platform FPGAshave come to more closely resemble thoseadopted by ASIC SoC designers – wheresynthesis is closely linked to verification atevery step – so too have FPGA developersbegun to see the benefits of leveraging theconvergent synthesis and verification strate-gies developed for complex ASIC SoCs.

Assertion-based verification (ABV) hasemerged as a major player within this con-vergent design and verification flow.Assertions have been used successfully formore than a decade in microprocessordesign. But only recently has their fullpotential been realized. The payoff is bigfor large, complex designs, including plat-form-based designs that contain a growingamount of third-party and in-house IP. AsFPGAs approach the complexity and sizeof larger ASICs, assertions have becomeparticularly useful to FPGA designers.

With the goal of examining the valueof ABV within a convergent flow thatencourages early discovery of design defi-ciencies (hereafter referred to as defects),we’ll recommended a few strategies forsuccessful design closure and explore ABVat greater length.

Early Defect Discovery with Assertion-Based Verification Accelerates Design Closure

The convergence of design, synthesis, and verification.The convergence of design, synthesis, and verification.

Essential Strategies for Success Defect discovery must be consciously pur-sued earlier in the design cycle, where theoverall pain and cost of fixing errors is muchless. Here are some recommended best prac-tices for accelerating design closure:

• Do more timing and performanceanalysis up-front. Are you willing towait until after your design is function-ally verified, synthesized, and placedand routed to find out whether yourchosen arbitration scheme can keep upwith incoming traffic? Early discoveryof throughput issues requires more in-depth analysis of performance and tim-ing issues throughout the synthesisprocess. Before burning cycles trying tomeet timing constraints in place androute, you must ensure that constraintsare complete. Early discovery of timingissues also requires constraint coverageanalysis during synthesis.

• Use interactive synthesis techniques forgreater predictability. An indispensableweapon in your FPGA design toolkit isa capable, interactive synthesis andanalysis environment that goes fromRTL to physical implementation.Interactive synthesis techniques allowyou to perform “what-if ” explorationsearlier in the design cycle. A robustsynthesis environment also providesseveral design representations such ashigh-level operators and architecture-specific technology cells. Interactivesynthesis capabilities give you an earlierunderstanding of the nature of thedesign and indicate whether it willmeet specifications.

• Check HDL code early and often.Today’s design-management tools assistin checking code against an establishedset of agreed-upon (by the design teamor silicon vendor) coding style rules.Before burning simulation cycles, youcan use these coding style rule checkersto catch actual defects and flag poten-tial defects, thus bringing defect dis-covery up-front in the design cycle.

• Implement a more effective functionalverification strategy. Synthesis tools for

tions are critical. These assertions originatefrom the following sources:

• Using proprietary or standard assertionlanguages, such as SystemVerilogAssertions (SVA) and the PropertySpecification Language (PSL), designengineers can write assertions as youimplement various structures. They maysometimes make an inadvertent mistakein their implementations; assertionsprovide critical cross-checks betweenintended and actual behaviors. Forexample, suppose the intended maxi-mum value of a pointer is 48. An asser-tion might detect that a coding mistakeexists that allows the pointer to gobeyond 48 in otherwise legal situations.

• Verification engineers can write asser-tions from a specification of the design,a design document, or from their per-sonal understanding of the design.Normally these assertions address con-cerns at a high level and cover criticalbehaviors of the design. For example,suppose a particular DMA controllerhas a channel configured to do 10memory transfers. Before the channel isde-allocated, the total number of trans-fers must be exactly 10, or a seriousproblem will occur. The verificationengineer can add an assertion to checkfor such an event.

• Both design and verification engineersinsert protocol assertion monitors tocheck for violations of the correspon-ding interface protocols. These moni-tors ensure that transactions betweencomponents are properly generatedand handled correctly. EDA vendorshave developed off-the-shelf protocolassertion monitors for common stan-dard interfaces (such as MentorGraphics CheckerWare protocol mon-itors). These monitors come with allof their protocols’ rules encapsulated,and are kept up-to-date to account forprotocol standards revisions. Usedwith simulation, protocol assertionmonitors collect useful statisticalinformation. For formal verification,they act as constraints.

platform FPGAs do more than simplygenerate a technology-mapped netlist.Best-of-breed synthesis tools containimportant analysis capabilities thatprovide more insight into your designat every stage in the cycle. These capa-bilities can identify potential problemareas, such as clock domain crossingpoints, where you need to handle func-tional verification delicately. Moreover,when applying a traditional functionalverification approach (using VHDL orVerilog test benches) to platformFPGAs, modeling random or pseudo-random stimuli and checking circuitresponse against designer intentbecomes increasingly tedious. This iswhere ABV can play a significant role.

Assertion-Based VerificationTo go beyond traditional simulation-basedverification, you must improve the observ-ability of the verification process and con-trol of the design. This is best done usingABV: a blend of assertions, functional cov-erage, and formal model-checking tech-nologies. Assertions are explicit expressionsof design intent, capturing what a circuitstructure should or should not do. Byembedding assertions in a design and hav-ing them monitor design activities, asser-tions improve observability. Formal modelchecking (FMC) analyzes the RTL struc-ture, characterizing the internal nature of adesign to augment control.

Because ABV creates an unlimited num-ber of observation points (assertions)throughout the design, it speeds debuggingby identifying errors close to or at theirsource. Assertions also identify problems thatdo not propagate to a primary output; thusABV improves design quality by exposingbugs that might otherwise go undetected.

Assertions provide a set of properties astargets for FMC. They also enable simula-tion test vectors to be used by dynamic for-mal verification (which we will explainlater) to find potential violations of a mod-ule’s assertions.

The basic ABV methodology comparesthe implementation of a design against itsspecified assertions. Therefore, the qualityand comprehensiveness of a design’s asser-

• Some verification tools insert assertionsinto the design by analyzing the RTLsource-code structure. Solutions formany common design problems pro-vide the basis for this analysis, allowingautomation of the instrumentationprocess. Many problems are discoveredstatically, either by the tool itself (usingnetlist analysis) or with formal analysis.Other problems are detected only withdynamic verification methods, such asfunctional simulation or dynamic for-mal verification. Examples of problemsanalyzed by automatic assertion inser-tion include synthesis-to-simulationmismatch problems, unreachable codeor FSM states, clock domain crossing(CDC) errors, X-semantic problems,and CDC clock jitter.

ABV Essentials ABV accelerates debug and improvesdesign quality by finding and fixing bugsearly, at or near their source.

Simulation-Based Verification with AssertionsDuring functional simulation, a design’sassertions monitor activities both inside thedesign and on its interfaces (Figure 1).Assertion violations are reported immediate-ly; the problem need not propagate to thedesign’s output ports to be detected by thetest bench. This high observability simplifiesbug triage and increases the understandingof potential causes of discovered problems.

Formal Model Checking with AssertionsFMC is a complementary technology tosimulation and an integral part of an ABVmethodology. FMC uses mathematicaltechniques to prove some assertions true(given a set of assumptions provided by theassertion library) and other assertions false(by discovering counterexamples). A proofmeans that FMC has explored all possiblebehavior with respect to the assertion andhas determined it cannot be violated. Acounterexample shows the circumstancesunder which the assertion is not satisfied.FMC examines all possible states of adesign to determine if any of them violatea specified set of properties.

Dynamic formal verification, a newFMC technique pioneered by the 0-In busi-ness unit of Mentor Graphics, overcomesthe memory and computational limitationsof static formal verification by amplifying agiven simulation trace. This enhances ABVby enabling FMC at both the sub-block andchip levels, using static and dynamic formaltechniques, respectively (Figure 2).

Given a set of target behaviors, expressedas assertions or coverage points, dynamicformal verification uses formal analysis of“interesting states” along the simulationtrace to determine if it is possible to reachany of the coverage points or violate anyassertion. For functional verification, an

“interesting state” is one where a new or raredesign behavior has occurred.

Consider a bug that occurs many cyclesdeep into a simulation. This particular bugmight occur only if a particular FIFO in thedesign is full and a particular FSM is in stateFOO. With simulation, it is possible that atest will manage to fill up the FIFO, whilethe same or another test can also get the FSMinto state FOO. To get exactly the right com-bination of random stimuli for both events tooccur may be a low-probability event.

With dynamic formal verification, from asimulation state where the FIFO is full, for-mal analysis examines all possible states lead-ing up to and beyond that point in time,

PCIBridge

SDRAMController

EthernetController

EncryptionEngine

Checker

RAM RAM

Exercise

Bug Bug

PCIBridge

Static FormalVerification

Exhaustive

EthernetController

EncryptionEngine

Checker

RAM RAM

Confirm

Controller

Dynamic FormalVerification

Locally Exhaustive

Simulation Test Case

Figure 1 – Simulation-based verification with assertions

Figure 2 - Formal model checking with assertions

identifying and reporting the particularsequence that leads to the state that manifeststhe bug. Thus, dynamic formal verificationcoupled with constrained random stimuliuncovers a bug that would have otherwiserequired additional test bench code to detect.

Without dynamic formal verification,you would have to know that a potentialbug is lurking and modify the stimulusconstraints in an attempt to guide the testto the right set of states. In actuality,dynamic formal verification can amplifyany single scenario into 10,000 or moreeffective tests. With this in mind, the actu-al set of constraints for the test bench canbe much simpler because the constraints donot need to accommodate the fine-grainedtweaking discussed above, saving the sub-stantial time it takes to conceive and codesuch elaborate constraint environments.

Interface Protocol MonitorsInterface protocol monitors allow devices inplatform-based designs to be isolated andverified. In any good processor-based design,every component must communicate cor-rectly with the available interfaces. Protocolmonitors check for violations of the corre-sponding interfaces at the block or chip level.

A protocol monitor includes assertionsto enforce the protocol rules. It ensures thattransactions between components areproperly generated and handled correctly.During simulation or formal verification,incorrect protocol transactions will beidentified at the source.

For hardware-related issues, an ABVmethodology using the CheckerWare asser-tion library can improve defect discovery inprocessor-based platform FPGA designs.Besides checking for violations during sim-ulation, CheckerWare protocol monitorscollect functional coverage and statisticalinformation and act as constraints for for-mal functional verification. Monitorsaffirm the transactions performed, high-light “holes” in the regression suite, andaudit the quality of pseudo-random stimu-lus generation environments.

Hard-to-Verify StructuresAs processor-based designs become morecomplex, components become buried so

deeply in designs that it is difficult to con-trol and test them effectively. For example,in any multi-layered bus architecture, mul-tiple masters can talk to multiple slavesconcurrently. This capability increasesoverall system bandwidth, but it alsoincreases verification complexity.

To tackle these hard-to-verify structures,you can capture design intent with assertionsand checkers and embed them in the RTLcode. The embedded assertions automatical-ly check for incorrect behavior resulting fromerrors in the RTL code itself, catching designproblems locally without manual interven-tion by the designer or verification engineer.

Embedded assertions also allow you tolocate the source of problems faster. In atraditional verification flow, errors detectedduring simulation are time-consuming totrace to the originating problem becausethere are many possible causes for each fail-ure symptom. For instance, if multipleDMA channels are set up to transfer datafrom several interfaces to memory, you maydetect an error in the memory data at theend of the test, but it is difficult to trackdown the exact transaction that producedthe data error. Using assertions, you canquickly identify the exact time of the datacorruption and the hardware resourcesinvolved with the problem.

ConclusionThe higher capacity and complexity ofmodern platform FPGAs has opened upunprecedented opportunities for compa-nies, but it has also created increasinglytough challenges for designers. Adopting –and adapting to – strategies that leverage aconvergent design, synthesis, and verifica-tion flow for earlier defect discovery canprove beneficial for FPGA engineeringteams in terms of reduced iterations andfaster design closure.

ABV accelerates debugging by findingand fixing bugs early, at or near theirsource, and improves design quality whilereducing development costs by discoveringerrors that would otherwise go undetecteduntil after tape-out.

For more information on advanced ver-ification methodologies from MentorGraphics, visit www.mentor.com/fv.

Xilinx Events and TradeshowsXilinx participates in numerous trade shows and events throughout the year.

This is a perfect opportunity to meet our silicon and software experts, ask questions,

see demonstrations of new products, and hear other customer success stories.

For more information and the current schedule, visit www.xilinx.com/events/.

Worldwide Events Schedule

North America

Oct. 12-13 Denali MemconSanta Clara, CA

Oct. 17-20 MILCOM 2005Atlantic City, NJ

Oct. 18-20 NSDC 2005San Jose, CA

Oct. 24-27 GSPx 2005Santa Clara, CA

Oct. 24-27 Storage Networking WorldOrlando, FL

Oct. 25 FPF 2005San Jose, CA

Nov. 15-16 SDRF 2005Anaheim, CA

Jan. 5-8 CES 2006Las Vegas, NV

by Philippe GarraultTechnical Marketing EngineerXilinx, Inc.philippe.garrault@xilinx.com

Apart from the devices we proudly displayin our cubicles here at the factory (becausethey represent the fruits of a lot of labor),FPGAs usually end up on printed circuitboards (PCBs).

Closing on a pin assignment that willmeet requirements from both the PCB andFPGA environments is becoming morechallenging. On one side of the interface,ever-increasing FPGA performance, densi-ty, and I/O count are placing tighter boardconstraints on the layout of the signal toand from the FPGA. On the other side,timing, congestion, and signal integrity ofever-faster signals on the PCB are placingconstraints on FPGA pin assignment.

The latest EDA survey conducted byEE Times tends to reflect this challenge.Two-thirds of respondents said that theirlatest design uses two or more program-mable devices. They also selected “gettingthe FPGA to work on the PCB” as the sec-ond most challenging part of FPGAdesign projects.

Until recently, very few tools or process-es existed to assist with FPGA pin assign-

ments, but this is changing. In this article,I’ll look at the causes of pin assignmentchanges in today’s design environments,describe the implications of such changes,and review the different tools available tosimplify and automate FPGA pin assign-ment closure.

The Need for FPGA Pinout ChangesThere are many good reasons to modify theFPGA pinout throughout the system designprocess. The flowchart in Figure 1 presentsa typical design flow, with an emphasis onFPGA pinout closure and the sources forthese changes. It clearly shows many areaswhere changes may occur; however, theseare introduced by three specific sources:

1. Pinout changes because of design flowconstraints. With today’s highly compet-itive and constantly evolving electronicmarkets, it has become critical for com-panies to shorten design cycles so as toreact faster to market demand changes.Therefore, product development is par-allelized wherever possible. This goes forFPGA and PCB design processes too.The system architect defines an initiallist of interface characteristics, whichPCB and FPGA designers use to starttheir design. This assignment is laterrefined (that is, changed) as both the

PCB and FPGA teams progress withproduct development. Additionally, mar-ket changes throughout the developmentcycle may require changes to the pinout,such as adding support for a new proto-col or adding a feature at the last minute.

2. Pinout changes because of PCB con-straints. Because of form-factor restric-tions or board cost control, the availablereal estate on the board may be limited.In such cases, using the programmabili-ty and flexibility of the FPGA pinoutcan help solve PCB congestion or rout-ing problems. Some of the FPGA fea-tures you can use include:

• Swapping pins to untangle nets on theboard. This diminishes the number ofvias needed and may reduce the num-ber of layers. Reducing the number oflayer changes a signal encompasses willalso improve its signal integrity andelectromagnetic emissions.

• Adjusting I/O properties to augmentboard signal integrity by lowering thesignal drive strength or slew rate.

• Using the programmable internal ter-minations to save on discrete compo-nent costs or to save board space incongested areas.

Simplifying FPGA Pin Assignment ClosureSimplifying FPGA Pin Assignment Closure

When integrating FPGAs on printed circuit boards, you have a myriad of tools from which to choose.When integrating FPGAs on printed circuit boards, you have a myriad of tools from which to choose.

3. Pinout changes because of FPGA con-straints. In turn, FPGAs impose con-straints on the design of the PCB, whichmay require pinout changes as the imple-mentation progresses. These pinoutchanges are because of:

• Timing. Margins on some signalsgoing into or out of the device may betight enough that only a limited set ofpackage pins will work.

• Dedicated/special-purpose pins.Because only a subset of package pinscan be programmed to function as spe-cial-purpose pins (such as global orregional clocks or programming pins),this places constraints on the board toroute signals to these capable I/Os.

• Voltage/termination compatibility.FPGA I/Os are grouped in banks,with all pins in a particular bank shar-ing power and reference voltages. Thismeans that once a particular voltage is

Implications of Pin Assignment ChangeGiven the reasons I have discussed, it isalmost certain that the FPGA pinout willchange during the system design process.But what implications – in terms of timeand effort – do these changes have on boththe FPGA and PCB environments? Whatsteps are necessary to verify that both envi-ronments are in sync and that constraintson either side are met?

From the FPGA side, after each pinoutchange you will need to update the user con-straint file (.UCF) and verify that the inter-nal timing constraints are met. You can alsorun PACE and SSO calculations to verify theassignment’s I/O banking or SSO guidelines.

From the PCB side, you will need toupdate the schematic and layout symbols fol-lowing a pinout change. You may also wantto run the DRC to ensure that electrical andphysical properties of the changed signal(s)pass the constraints. You can also run signalintegrity analysis to verify that timing andamplitude margins are still adequate.

The important step is to communicatepin assignment changes to the other envi-ronment. Historically, PCB and FPGAdesign environments have been pretty iso-lated, with only limited communicationchannels between them. As I will explain inthe next section, there are now tools andoptions available that make synchroniza-tion and data transfer far less cumbersomeand time-consuming.

Available Tools and ProcessesNow that you know that FPGA pin map-ping will change, and you understand thefrequency and implication of such changesin both FPGA and PCB designs, what areyour options to minimize the burden ofincorporating these changes? How can thepin mapping be propagated between envi-ronments in an automated fashion so thatno data gets lost in translation? How canthis be done quickly so that you can iteratepinout changes until a solution satisfactoryto both the PCB and FPGA teams is found?

Within the FPGA EnvironmentThe PACE (Pin and Area ConstraintEditor) program allows you to graphicallyassign design signal names to package pins.

used in this bank, only I/Os withcompatible voltage can be assigned inthe same bank. This too may force youto select pins on the FPGA packagethat are not optimal from a PCB routing standpoint.

• Simultaneous switching outputs (SSO).To ensure that signals driven by theFPGA will meet I/O standard electricalcharacteristics, Xilinx publishes a rec-ommended maximum number of SSOsper area of the device. You might haveto scatter pins across different areas ifyou exceed these recommendations.

• Device decoupling circuits. As withany other IC on the board, FPGAsrequire certain characteristics for thepower delivery system, which includesadding local decoupling networks (dis-crete components) to supply power tothe circuit (the voltage regulator is typ-ically unable to respond to rapid devicedemand changes).

change

Design Team

Design

Creation

To accommodate design changes

• I/O locations edits

• I/O properties edits

• Signals add/remove

Design

Implementation

Initial pinout

Incremental changes

Incremental change

Final Pinout

System Architect

closure

InputsInputs

For place and route, timing and

floorplan closure

Design

To accommodate design changes

Layout

Netlist

For place and route, timing or signal

integrity closure

Specification

change

Constraint File

Schematic

and Layout

Symbol

Constraint File

Schematic

and Layout

Symbol

Schematic

and Layout

Symbol

Constraint File

Figure 1 – Typical pin assignment closure process

PACE also presents a die view with a graph-ical representation of the FPGA internallogic. Thus, you can assign pins close to thedriving logic while also being mindful ofthe location recommendations from thePCB designer. You can enter pin propertiesfor I/O standard, drive, and slew rate. Thegraphical view easily identifies special-pur-pose pins such as global or region-al clocks. As illustrated in Figure2, PACE also performs voltagecompliance and SSO checks sothat the pin assignment is correctby construction.

This tool proves valuable at dif-ferent stages in the design process.You can create a pin assignmentfrom scratch or read in the signalnames from a synthesized netlist.After place and route, you caninteractively adjust the pin assign-ment to satisfy the implementationor PCB tool constraints. PACEgenerates a Verilog module orVHDL entity, along with a pinoutfile that can be loaded in mostPCB tools to generate and updateschematic and layout symbols.PACE also reads pin files from thePCB tool and generates a UCFconstraint file for ISE™ software.

Another tool worth mention-ing here is DesignF/X fromProduct Acceleration. With thisapplication, you define the I/Oproperties (standards, drivestrength) and general area of thepackage on which your interfaceswill be placed. The tool will thenplace every I/O so as to facilitatePCB routing, while making surethe FPGA banking, clocking,and SSO rules are obeyed.

Within the PCB EnvironmentYour favorite PCB design tool likely has awizard or some documented process toassist with the creation and maintenance ofyour FPGA schematic and layout symbols.There is probably an assisted way to updatethese symbols after a pinout change. Finally,to assist with high-pin-count devices, thereis also an option to fracture symbols onto

multiple sheets to simplify documentationand the connectivity process.

FPGA/PCB Co-Design ToolsTools designed to bridge the FPGA andPCB environments have recently debuted,with their feature lists continuouslyexpanding. Shown in Figure 3, programs

such as Mentor I/O Designer (althoughnot the panacea, because they do not yetunderstand all of the FPGA and PCB con-straints related to pin assignment) alreadygo a long way toward simplifying andaccelerating FPGA pinout closure. From asingle environment, you can:

• Automate the synchronization of bothenvironments after a pinout change.

The program will update the XilinxUCF file and schematic and layoutsymbols. This greatly reduces manualand error-prone data entry and theensuing verification process to makesure that both the FPGA and PCBenvironments are in sync.

• Assign or swap pins from agraphical view of the FPGApackage. Like PACE, it willperform DRCs to ensure thatthe pin assignment does notviolate any device rules.

• Assign or swap pins from agraphical view of the PCB.Visualizing the physicallocation of the other com-ponents the FPGA is inter-facing provides you withinformation you can use toassign pins in a manner thatwill reduce wire crossover onthe board and simplify theFPGA breakout. This can inturn lead to saved designtime (PCB router iterations)and money (better use ofboard real estate, reducedlayer, or via count).

ConclusionUsing traditional spreadsheet-based approaches, it is becomingmore cumbersome to close onprogrammable device pinout,because both FPGA and PCBenvironments impose ever-tight-ening or sometimes conflictingconstraints. Manual and time-consuming data entry and verifi-cation to ensure synchronizationof both environments some-

times results in designers creating subop-timal PCBs, or not taking advantage of allof the power and flexibility of the FPGA.Thankfully, there is a growing set of toolsand techniques that FPGA and EDAcompanies are putting in place to assistyou with pinout data creation, manage-ment, and synchronization. Check outthese tools, as they may help make betteruse of your time.

Figure 2 – PACE pinout export to PCB tools

Figure 3 – PCB wire uncrossing within Mentor I/O Designer

by Matt KleinSr. Staff Engineer, Applications Engineering,Advanced Products DivisionXilinx, Inc.matt.klein@xilinx.com

Managing your system power within budg-et is essential to maintain reliability in yoursystem. Failure to do so may cause compo-nents to break down and reduce reliability.

The semiconductor industry’s rapidmove toward 90 nm silicon processes bene-fits the performance and cost aspects, butplaces enormous pressure on power budg-ets. As transistor sizes decrease, leakage cur-rent (and hence static power) increasesexponentially. Dynamic power also increas-es, with increasing system speed and largerdesign density, but in a more linear fashion.

Today, many designs have 50/50 staticand dynamic power dissipation. Accordingto International Technology Roadmap forSemiconductors (ITRS) projections, staticpower is increasing exponentially at everyprocess node; thus, innovative process tech-nologies are imperative.

With the adoption of FPGAs in moremarkets and systems every year, driven byincreasing performance/density anddecreasing price, FPGA power consump-tion within the entire system is becomingcritical. Leading FPGA vendors are alreadyadopting new techniques to mitigate staticand dynamic power consumption.

Power Consumption ConsiderationsThe amount of power being consumed inthe system is very important, and FPGAs(often used for system integration func-tions) make up the majority of power con-sumed in these systems. A given system orindividual components usually have apower budget, which falls into two majorareas. The first area is a simple practicalone, which is to accommodate the powercapacity of the supplies used in a design tomeet system power needs. The second areais about thermal concerns, which you needto understand to keep the system workingwithin its various components’ tempera-ture specifications. To this end, it is impor-

tant to know where power consumptioncomes from in the FPGAs you choose andhow you can optimize it.

Working Within a Power BudgetAs mentioned, the power budget arisesbecause of power consumption and ther-mal concerns. Here is a typical example: aboard has a power budget of 20W with anormal operating environment of 10°C to40°C. Under conditions of a failed fan(s),ambient air above certain components mayrise above 70°C. Many components’ man-ufacturers have operating conditions thatrange as high as 85°C in junction tempera-ture (for commercial grade) and 100°C forindustrial grade parts. Using Xilinx®

ISE™ XPower or Web-based power esti-mation tools, you can see where powerconsumption will fall and if you will needto optimize your FPGA’s power consump-tion. It is also important to learn what con-sumes power in the FPGA and whatmethods of design optimization may beavailable to help reduce consumed power.

Power Considerations in 90 nm FPGA DesignsPower Considerations in 90 nm FPGA Designs

Learn how to design for reduced power with Virtex-4 FPGAs.Learn how to design for reduced power with Virtex-4 FPGAs.

Where is Power Consumed in the FPGA?There are two primary areas of power con-sumption in FPGAs. Static power comesfrom transistor leakage; dynamic powercomes from voltage swing, toggle rate, andcapacitance. Both are important factors inmeeting a power budget and power opti-mization. It is therefore important to knowwhat each factor is and how it varies withdifferent operating conditions.

Static PowerStatic power is power consumed by transis-tors due to leakage. This leakage is now sig-nificant for 90 nm devices. To get higherperformance from the transistor, you needto lower the voltage threshold, (VT) of thetransistor, which also increases leakage.

Leakage of the 90 nm transistors variesstrongly with process: the VT of the transis-tors varies because of doping, and the gatelength varies because of lithography. Thiscan produce strong changes in transistorspeed and leakage. Reduced VT or gatelength both increase leakage and speed,while the converse is also true. The varia-

present, the absolute leakage in Virtex-4FPGAs is about one-third that of compet-ing high-performance 90 nm FPGAs.

Dynamic PowerDynamic power is power consumed bytransistors and traces that are toggling. Theeffect, simply put, is from changing aninternal voltage from a logic “0” to a logic“1” (or vice versa) and charging a capaci-tance to that voltage. The more often this isdone, the more power consumed. In theFPGA, transistors are used for logic andprogrammable interconnects betweenmetal traces. The capacitance that we aretalking about is transistor parasitic capaci-tance and metal interconnect capacitance.

The formula for dynamic power is:

PDYNAMIC = nCV2fwhere n = number of toggling nodes, C = capacitance, V = voltage swing, and f = frequency.

All nodes in the FPGA consume powerthrough a combination of charging tran-sistor parasitic capacitance and metalinterconnect capacitance. The latterdepends on the length of routes in theFPGA, while net node capacitance isdetermined by the number of transistorsthat are switching. Tighter logic packingwill reduce the number of switching tran-sistors and minimize routing lengths,which will reduce dynamic power.

Figure 2 shows the variation of dynam-ic power with voltage swing and core volt-age, VCCINT. Process and temperature causelittle variation in dynamic power. Takentogether, their effect is less than 5-10%.

Lowering Design Power by Changing the FPGA EnvironmentTo optimize the power consumption of agiven design, there are certain things thatyou can do independent of the design con-tained in the FPGA. Knowing the environ-ment is therefore important.

TemperatureControlling temperature can help reducestatic power. A reduction in junction tem-perature from 100 °C to 85 °C will reducestatic power by ~ 20% (as shown earlier in

tion in leakage and static power is about 2to 1 between worst-case and typical process.

Leakage and static power are also influ-enced strongly by core voltage, VCCINT,with variations that go approximately asthe square and cube, respectively, ofVCCINT. Static power shows a ~15%increase, with only a 5% increase inVCCINT. Leakage is very strongly influ-enced by junction (or die) temperature, TJ.

Because each of these factors – process,voltage, and temperature – have a strongeffect on leakage and static power of theFPGA, it is important for you to under-stand them and how they might influencetotal power consumption of the FPGA orASIC. Gate-to-substrate leakage is alsopart of total leakage, but is not highlytemperature-dependent.

Figure 1 shows the variation in transistorleakage and static power in 90 nm FPGAsdue to process, voltage, and temperature.

Seeing the increasing transistor leakagewhen moving toward a high-performance90 nm FPGA, Xilinx IC designers chose toadopt the use of a third gate-oxide thick-ness in the transistors of the newest XilinxVirtex™-4 FPGAs. In previous FPGAsand ASICs, only two oxide (dual-oxide)thicknesses exist: a thin oxide for core tran-sistors and a thick oxide for I/O transistors.The use of a third middle thickness ofoxide (triple-oxide) and higher VT in a por-tion of the transistors of Virtex-4 FPGAsallows for a dramatic reduction in overallleakage and static power compared to othercompetitive FPGAs, which do not usetriple oxide. So although variations withprocess, voltage, and temperature are still

0Faster Slower

0.6Faster -5 0 +5 Slower

0-40 -20 20 40 60 80 100 120 140

Transistor Speed Variationwith Process

% Change in VCCINT Junction Temp oC

Typical

Leakage CurrentStatic

ICCINTQ = IS-->D + IGATE

Source Drain

I S-->

VCCINT Variation Dynamic Power from Nominal Change

-10.0% -19.0%

-5.0% -9.8%

0.0% 0.0%

5.0% 10.3%

10.0% 21.0%

Figure 1 – Leakage and static power variations with process, voltage, and temperature

Figure 2 – Variation in dynamic power due to core voltage variation

Figure 2). Even though the static power ofthe Virtex-4 FPGA is already low comparedto other 90 nm FPGAs, reducing it byanother 20% is valuable, since in somedesigns static power of the FPGA representsa sizeable portion (30-40%) of the totalpower budget. The reduction in junctiontemperature can be achieved by increasedairflow and larger heat sinks, which willtransfer heat away from the FPGA, reduc-ing junction temperature. The reduction injunction temperature also has the addedbenefit of increasing reliability.

VoltageKeeping core voltage at or below nominalwill reduce static and dynamic power. Staticand dynamic power consumed at the corevoltage, VCCINT, is often the largest powerconsumer in the FPGA. FPGAs are usuallyspecified to be able to run and meet per-formance with power supply voltage within+/- 5% of nominal. Figure 2 shows that a +/-5% variation in VCCINT causes a ±15% and

±10% variation in static and dynamic power,respectively. To the extent that the VCCINT

power supply can be specified more tightly,you can also set it to be at or even slightlybelow nominal rather than being able to havea worst case that is 5% above nominal.

Lowering Design Power by Changing the DesignTo make FPGA design-related tradeoffs inpower consumption, it is important toknow where you should start. Xilinx hasseveral design tools that can help you getan early or detailed estimate of designpower consumption.

Based on your estimates of design size(logic and flip-flops), operating frequency,toggle rates, embedded block utilization,and environment conditions such as tem-

perature, the Xilinx Web Powertool shown in Figure 3 allowsinitial estimates to be made onthe power consumption of agiven design. It does not rely ondetailed information about thedesign such as exact routing,placement, and utilization.

The XPower tool includedwith all configurations of ISEsoftware is a detailed poweranalysis program that allowsyou to input stimulus vectorsfor the design. Along with theinformation from the actualrouted and placed design, thetool calculates power con-sumption much more accu-rately, as shown in Figure 4.

FPGA Design Techniques to Reduce PowerSeveral design-specific tech-niques can also reduce powerconsumption. These includeconstraining logic to a smallarea where possible, settingsynthesis flags to minimizearea, and minimizing layers of

logic. Pipelining is also a good techniquebecause it allows a higher timing constraintto be set, which allows reduction of capaci-tance and thus dynamic power reduction.

Setting Placement and Timing ConstraintsFloorplanning the design properly reducesdynamic power. ISE Floorplanner allowsyou to create placement constraints. Amore sophisticated floorplanning tool fromXilinx called the PlanAhead™ design toolallows you to observe hierarchy in a designand group sets of related logic into smallareas. Using the placement and groupingconstraints reduces the physical area, allow-ing you to achieve higher performancewhile minimizing routing capacitance andreducing dynamic power consumption.

Other techniques include constrainingtiming in a design. Synthesis tools as well asISE routing and placement tools allow youto input timing constraints. If you raise thetarget timing constraint – especially theclock target – the router will try harder tomeet it through more aggressive placementand routing efforts. The net effect is to min-imize routes, which reduces routing power.

Other Partial Shutdown orReprogramming MethodsOther commonly used design techniquescan help reduce dynamic power consump-tion. One of these techniques is to useclock multiplexing. This entails turning offsections of the FPGA. A hardware featurein the Virtex-4 FPGA and its predecessorsis a clock gating block, which allows asmooth way to turn off or on a global clocknet. Better than clock enables on flip-flops,this method allows the entire large togglingclock net to be gated off, which saves poweron the net and on the flip-flops.

Some types of designs, especially thoseused in battery-type applications, only needto consume power at certain times. In theseapplications, you can turn off clocks andlower the core voltage to the minimum

Figure 3 – Xilinx Web-based power estimation tools

Figure 4 – ISE XPower power estimation tool

Xilinx has several design tools that can help you get an early or detailed estimate of design power consumption.

level that still allows FPGA data retention.In the Virtex-4 device, this is 0.9V. At thislevel, static power is reduced by greaterthan 60% when compared with the nomi-nal 1.2V level.

A way to shrink the FPGA size requiredfor a given set of tasks is to use the dynam-ic reconfigurability port, or DRP, which isavailable in Virtex-4 and other XilinxFPGAs. If there are several functions thatdon’t need to coexist, this port allowsreloading only a portion of the FPGA. Indoing so, you can choose a much smallerFPGA, which reduces static power.

Use of Embedded BlocksAnother method of reducing power con-sumption is through the use of embeddedblocks. Although it is more work in somecases to instantiate special blocks, Virtex-4FPGAs have a number of pieces of hard IP,which are essentially ASIC gates. Some ofthese functions are in fact automaticallysynthesized by some of the modern synthe-sis tools. These new blocks have between5x and 20x lower power than programma-ble logic and programmable interconnectimplementations. The embedded blocksreduce static power by not having extratransistors (as in programmable logic) andby not using programmable interconnecttransistors. They reduce dynamic power byusing only metal interconnects versus metaland programmable interconnects; reducingtrace lengths; reducing extra node capaci-tance because of lack of pass transistors;and minimizing layers of logic.

Some of these blocks include the following:

• PowerPC™ – embedded high-per-formance processor

• DSP – XtremeDSP™ slice, a high-performance sophisticated multi-function arithmetic and logic block

• SSIO – New ChipSync™ block avail-able in every I/O pin reduces logic cellcounts for source-synchronous I/Odesigns

• Embedded Ethernet MAC(s) – noneed to use logic and interconnect forMAC functions

• FIFO – SmartRAM memory includesbuilt-in FIFO controllers

• SRL16 – Allows multiple cascade flip-flops to be used without programmableinterconnects

Power-Based Routing OptimizationAnother exciting tool available in the 8.1irelease of ISE software allows the optimiza-tion of a design for power consumption.

This initial release allows automatic capac-itance minimization without the need foryou to enter faster timing constraints.

Figure 5 shows testing results of thisnew interconnect capacitance reduction. Itis estimated that interconnect capacitanceis approximately two-thirds of the dynamicpower, so this improvement can be valuableto reduce dynamic power.

Future releases of ISE software will offermore power-optimization enhancements,including power-optimized synthesis andpower-optimized placement.

ConclusionIt is very important to know the systempower budget and operating environment.Understanding where various forms of powerconsumption come from allows you to adjustthe FPGA environment and design character-istics to minimize power consumption andsuccessfully meet a given power budget.

FPGA Family Average Maximum % Reduction Capacitance

Capacitance Reduction

Spartan-3 11% 21 %

Virtex-4 6 % 9.5 %

Virtex-II Pro 13 % 18 %

Figure 5 – Interconnect capacitance reductionusing power-based routing optimization

in ISE 8.1i software

by David W. BlevinsStaff Software Marketing EngineerXilinx, Inc.david.blevins@xilinx.com

Xilinx® device architectures include severalconfigurable functional blocks – includingclocking, digital signal processing, andhigh-speed I/O blocks – that provide youwith advanced functionality. Typically,these blocks are fairly complex in theiroperation and yet very flexible, so parame-terizing them for the desired behavior canbe a daunting and time-consuming task ifdone by hand.

The architecture wizards found in theISE™ Foundation™ design environmentcan streamline the process of customizingsuch blocks. There are several wizards avail-able; each one leads you through asequence of screens that allows you to pre-cisely define the behavior of the block athand. Each screen has built-in design rulechecking that ensures that the result will be“correct by construction” – that is, that thecombination of selections made is a legalconfiguration of the target block.

After selecting “Finish” on the finalscreen, the wizard creates a customizedblock definition file (filename.xaw). ISEFoundation software then translates thatdefinition into a HDL description of theblock and creates a corresponding HDLinstantiation template that you can thenuse in your design.

Using the ISE Foundation Architecture WizardsUsing the ISE Foundation Architecture Wizards

Streamline the process of configuring and instantiating the complex blocks found in Xilinx devices.Streamline the process of configuring and instantiating the complex blocks found in Xilinx devices.

The Clocking WizardAs an example, let’s use the clocking wizardto create a digital clocking manager modulefor a Virtex™-4 FPGA, which will be driv-en by a clock source external to the device. Itwill generate a main clock signal, as well as afrequency-doubled clock that drives anenabled buffer. A LOCKED signal will indi-cate when the DCM clock signal has stabi-lized after the FPGA is powered up or reset.

Examining the resulting HDL code andinstantiation template generated by ISEFoundation software after the wizard fin-ishes clearly shows the benefit of using anarchitecture wizard for this type of task.

Starting the Clocking WizardFrom within ISE’s Project Navigator, we startthe clocking wizard by selecting “Add NewSource” and selecting the “IP (CoreGen andArchitecture Wizard)” source type. Theresulting dialog will let us select the “SingleDCM ADV v8.1i” variant (see Figure 1).

Pressing “Next” and “Finish” on thesubsequent dialog boxes invokes the clock-ing wizard.

General Setup ScreenOn the first clocking wizard screen, weselect the desired inputs and outputs of theblock, the clock source, phase shift type,frequency, and feedback configuration(Figure 2).

Using the “Add Buffer” button on thenext screen, we place an enabled buffer atthe block’s clock input (Figure 4).

Summary ScreenAfter creating the buffer by pressing the“OK” button, we click the “Next” button

Clock Buffers ScreensSelecting “Next” takes us to the clockbuffers screen. By default, all clock outputsdrive global buffers, but for our design, weneed to place an enabled buffer after thefrequency-doubled clock output. To dothis, we select “Customize Buffers” andthen click the “Global Buffer” button nextto CLK2X to change its global buffer to anenabled buffer (Figure 3).

Figure 1 -Select IP dialog box

Figure 2 – General Setup screen

Figure 3 – Clock Buffers screen

Figure 4 – View/Edit Buffer screen

Figure 5 – Summary screen

Examining the resulting HDL code and instantiation template generated by ISE Foundation software after the wizard finishes clearly shows the

benefit of using an architecture wizard for this type of task.

on the clock buffers screen to advance tothe summary page, which provides adetailed report on the block that will begenerated (Figure 5).

Selecting the “Finish” button creates abinary filename.xaw source file in the ISEFoundation project directory. The .xawwill automatically be converted by ISEFoundation software to the correspondingHDL description required for synthesiswhen you implement your design, butyou can view that HDL at any time afterthe .xaw file has been created. EitherVHDL or Verilog is available; VHDL isshown in our example.

Here is where you can really see the truevalue of the architecture wizard – imaginehaving to write this code by hand:

— Module dcm1

— Generated by Xilinx Architecture Wizard

— Written for synthesis tool: XST

library ieee;

use ieee.std_logic_1164.ALL;

use ieee.numeric_std.ALL;

— synopsys translate_off

library UNISIM;

use UNISIM.Vcomponents.ALL;

— synopsys translate_on

entity dcm1 is

port ( CLKIN_ENABLE_IN : in std_logic;

CLKIN_IN : in std_logic;

RST_IN : in std_logic;

CLKIN_IBUFG_OUT : out std_logic;

CLKIN_OUT : out std_logic;

CLK0_OUT : out std_logic;

LOCKED_OUT : out std_logic);

end dcm1;

architecture BEHAVIORAL of dcm1 is

signal CLKFB_IN : std_logic;

signal CLKIN_IBUFG : std_logic;

signal CLK0_BUF : std_logic;

signal GND1 : std_logic_vector (6 downto 0);

signal GND2 : std_logic_vector (15 downto 0);

signal GND3 : std_logic;

component BUFGCE

port ( I : in std_logic;

CE : in std_logic;

O : out std_logic);

end component;

component IBUFG

O : out std_logic);

end component;

component BUFG

O : out std_logic);

end component;

component DCM_ADV

generic( CLK_FEEDBACK : string := “1X”;

CLKDV_DIVIDE : real := 2.000000;

CLKFX_DIVIDE : integer := 1;

CLKFX_MULTIPLY : integer := 4;

CLKIN_DIVIDE_BY_2 : boolean := FALSE;

CLKIN_PERIOD : real := 10.000000;

CLKOUT_PHASE_SHIFT : string := “NONE”;

DCM_AUTOCALIBRATION : boolean := TRUE;

DCM_PERFORMANCE_MODE : string :=

“MAX_SPEED”;

DESKEW_ADJUST : string := “SYSTEM_SYNCHRO-

NOUS”;

DFS_FREQUENCY_MODE : string := “LOW”;

DLL_FREQUENCY_MODE : string := “LOW”;

DUTY_CYCLE_CORRECTION : boolean := TRUE;

FACTORY_JF : bit_vector := x”F0F0”;

PHASE_SHIFT : integer := 0;

STARTUP_WAIT : boolean := FALSE);

port ( CLKIN : in std_logic;

CLKFB : in std_logic;

DADDR : in std_logic_vector (6 downto 0);

DI : in std_logic_vector (15 downto 0);

DWE : in std_logic;

DEN : in std_logic;

DCLK : in std_logic;

RST : in std_logic;

PSEN : in std_logic;

PSINCDEC : in std_logic;

PSCLK : in std_logic;

CLK0 : out std_logic;

CLKDV : out std_logic;

CLK2X : out std_logic;

CLK2X180 : out std_logic;

CLKFX : out std_logic;

CLKFX180 : out std_logic;

DRDY : out std_logic;

DO : out std_logic_vector (15 downto 0);

LOCKED : out std_logic;

PSDONE : out std_logic);

end component;

GND1(6 downto 0) <= “0000000”;

GND2(15 downto 0) <= “0000000000000000”;

GND3 <= ‘0’;

CLK0_OUT <= CLKFB_IN;

CLKIN_BUFGCE_INST : BUFGCE

port map (CE=>CLKIN_ENABLE_IN,

I=>CLKIN_IBUFG,

O=>CLKIN_OUT);

CLKIN_IBUFG_INST : IBUFG

port map (I=>CLKIN_IN,

O=>CLKIN_IBUFG);

CLK0_BUFG_INST : BUFG

port map (I=>CLK0_BUF,

O=>CLKFB_IN);

DCM_ADV_INST : DCM_ADV

generic map( CLK_FEEDBACK => “1X”,

CLKDV_DIVIDE => 2.000000,

CLKFX_DIVIDE => 1,

CLKFX_MULTIPLY => 4,

CLKIN_DIVIDE_BY_2 => FALSE,

CLKIN_PERIOD => 10.000000,

CLKOUT_PHASE_SHIFT => “NONE”,

DCM_AUTOCALIBRATION => TRUE,

DCM_PERFORMANCE_MODE => “MAX_SPEED”,

DESKEW_ADJUST => “SYSTEM_SYNCHRONOUS”,

DFS_FREQUENCY_MODE => “LOW”,

DLL_FREQUENCY_MODE => “LOW”,

DUTY_CYCLE_CORRECTION => TRUE,

FACTORY_JF => x”F0F0”,

PHASE_SHIFT => 0,

STARTUP_WAIT => FALSE)

port map (CLKFB=>CLKFB_IN,

CLKIN=>CLKIN_IBUFG,

DADDR(6 downto 0)=>GND1(6 downto 0),

DCLK=>GND3,

DEN=>GND3,

DI(15 downto 0)=>GND2(15 downto 0),

DWE=>GND3,

PSCLK=>GND3,

PSEN=>GND3,

PSINCDEC=>GND3,

RST=>RST_IN,

CLKDV=>open,

CLKFX=>open,

CLKFX180=>open,

CLK0=>CLK0_BUF,

CLK2X=>open,

CLK2X180=>open,

CLK90=>open,

CLK180=>open,

CLK270=>open,

DO=>open,

DRDY=>open,

LOCKED=>LOCKED_OUT,

PSDONE=>open);

end BEHAVIORAL;

The preceding HDL source codedescribes the DCM module to the synthe-sis tool that we are using, but we will stillneed to create an instance of the module inour design. This is achieved with an“instantiation template,” which is createdby using the “View HDL InstantiationTemplate” process in Project Navigator.The resulting code snippet can be insertedinto our design to create an instance of theDCM module:

COMPONENT dcm1

CLKIN_ENABLE_IN : IN std_logic;

CLKIN_IN : IN std_logic;

RST_IN : IN std_logic;

CLKIN_IBUFG_OUT : OUT std_logic;

CLKIN_OUT : OUT std_logic;

CLK0_OUT : OUT std_logic;

LOCKED_OUT : OUT std_logic

END COMPONENT;

Inst_dcm1: dcm1 PORT MAP(

CLKIN_ENABLE_IN => ,

CLKIN_IN => ,

RST_IN => ,

CLKIN_IBUFG_OUT => ,

CLKIN_OUT => ,

CLK0_OUT => ,

LOCKED_OUT =>

Supported Block ConfigurationsThe above example illustrates only a sin-gle configuration of one type of blockthat the architecture wizards support.Other supported blocks and configura-tions include:

Clocking WizardThe digital clock management moduleprovides you with extensive control overthe global clocking configuration(s) in yourdesign and features the choice of either anexternal or internal clock source, clockdeskew, phase shift control, and frequencysynthesis.

The clocking wizard can create severalDCM configurations using one or moreDCM modules:

• Single DCM

• Single DCM_ADV

• Clock forwarding/board deskew

• Board deskew with an internal deskew

• Clock switching with two DCMs

• Cascading in series with two DCMs

• PMCD (Virtex-4 FPGAs)

Rocket IO WizardThe Xilinx Rocket IO™ wizard configuresthe multi-gigabit transceiver block inVirtex-II Pro and Virtex-II Pro X devices(Virtex-4 Rocket IO support is found inCORE Generator™ software). For certainprotocols, it also allows the configurationof multiple channels of transceivers.

For Virtex-II Pro devices, the followingRocket IO protocols are available throughthe wizard:

• Fibre Channel

• Gigabit Ethernet

• XAUI

• Infiniband

• Aurora

• PCI Express

• SONET OC-48 (Virtex-II Pro Xdevices)

• SONET OC-192 (Virtex-II Pro Xdevices)

• Custom user-defined configurations

Xtreme DSP Wizard (Virtex-4 Devices)This wizard can create several differenttypes of DSP blocks using the Virtex-4Xtreme DSP™ slice, including:

• Multiplier

• Accumulator

• Multiplier-accumulator (MAC)

• Adder/subtractor

ChipSync Wizard (Virtex-4 Devices)Two basic types of ChipSync configura-tions can be created:

• The memory applications mode config-ures a block of I/O for memory applica-tion usage. The wizard allows you to setup the data bus and clocks/strobes,including specifying delay information.You can also configure the address bus,reference clocks, and control signals.

• The non-memory applications modeconfigures a block of I/O for non-memory application usage, such as fornetworking cases. The wizard allowsyou to set up the data bus and clocks,including specifying delay information.You can also configure reference clocksand control signals.

ConclusionAs shown in this article, the architecturewizards act as intelligent assistants thatfacilitate easy creation of customizedinstances of the various built-in complexblocks found in Xilinx FPGAs.

The architecture wizards augment otherXilinx IP block creation tools such asPlatform Studio, CORE Generator soft-ware, and System Generator for DSP – allof which help you to get your product tomarket as quickly as possible.

The architecture wizards augment other

Xilinx IP block creation tools such as

Platform Studio, CORE Generator software,

and System Generator for DSP – all of

which help you to get your product to

market as quickly as possible.

A series of compelling, highly technical product demonstrations, presented

by Xilinx experts, is now available on-line. These comprehensive videos

provide excellent, step-by-step tutorials and quick refreshers on a wide array

of key topics. The videos are segmented into short chapters to respect your

time and make for easy viewing.

Ready for viewing, anytime you are

Offering live demonstrations of powerful tools, the videos enable you to

achieve complex design requirements and save time. A complete on-line

archive is easily accessible at your fingertips. Also, a free DVD containing all

the video demos is available at www.xilinx.com/dod. Order yours today!

©2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

FREE on-line training Demos On Demand

Pb-free devicesavailable now

www.xilinx.com/dod

by Cindy KaoMarketing SpecialistXilinx, Inc.cindy.kao@xilinx.com

Partial reconfiguration offers countless bene-fits across multiple industries. It can be animportant component to any design orapplication – allowing designers more capa-bilities and resources than meets the eye.

Partial reconfiguration is the ability toreconfigure select areas of an FPGA any-time after its initial configuration. You cando this while the design is operational andthe device is active (known as active partialreconfiguration) or when the device is inac-tive in shutdown mode (known as staticpartial reconfiguration).

By taking advantage of partial reconfig-uration, you gain the ability to:

• Adapt hardware algorithms

• Share hardware between various applications

• Increase resource utilization

• Provide continuous hardware servicing

• Upgrade hardware remotely

Xilinx has supported partial reconfigura-tion for many generations of devices. AllXilinx® FPGAs support partial reconfigura-tion, from Virtex™-4 devices to our lowestcost FPGAs, the Spartan™-3/E family.

Benefits of Partial ReconfigurationBenefits of Partial Reconfiguration

Take advantage of even more capabilities in your FPGA.Take advantage of even more capabilities in your FPGA.

PART IAL RECONFIGURATION

How Partial Reconfiguration Works Xilinx supports two basic styles of partialreconfiguration: module-based and differ-ence-based. Module-based partial reconfigu-ration uses modular design concepts toreconfigure large blocks of logic. The dis-tinct portions of the design to be reconfig-ured are known as reconfigurable modules.Because specific properties and specific lay-out criteria must be met with respect to areconfigurable module, any FPGA designintending to use partial reconfigurationmust be planned and laid out with thatin mind. These properties and layout cri-teria are outlined in the XilinxDevelopment System Reference Guide,which can be found at http://toolbox.xilinx.com/docsan/xilinx7/books/docs/dev/dev.pdf.

Difference-based partial reconfigura-tion is a method of making smallchanges in an FPGA design, such aschanging I/O standards, LUT equa-tions, and block RAM content. Thereare two supported ways to make suchdesign changes: at the front end or theback end. Front-end changes can beHDL or schematic. You must re-syn-thesize and re-implement the design,creating a new placed and routed nativecircuit description (NCD) file. Back-end modifications are made in FPGAEditor, a GUI tool within ISE™ soft-ware used to view/edit device layoutand routing.

You can also perform partial recon-figuration by implementing a basic con-troller to manage the reconfiguration ofan FPGA. This could be in the form ofan embedded or external processor.

Xilinx offers a suite of processorsolutions. The PicoBlaze™ andMicroBlaze™ soft-core processors bothsupport the Spartan and Virtex families.The Virtex-II Pro FPGA embodies thehard-processor solution with the inte-gration of an IBM PowerPC™ 405 32-bit RISC processor into the FPGA.Now, with the introduction of theVirtex-4 FX platform FPGA, Xilinx hasincreased processing power by intro-ducing two PowerPC 405 processorcores in a single device (see Table 1).

advantage of its capabilities. Partiallyreconfigurable devices have benefitedthe Joint Tactical Radio System (JTRS)Program, which I’ll discuss more in thefollowing section.

• Increased system performance. Althougha portion of the design is being reconfig-ured, the rest of the system can continueto operate. There is no loss of perform-ance or functionality with unaffectedportions of a design – no down time. It

also allows for multiple applica-tions on a single FPGA.

• The ability to change hard-ware. Xilinx FPGAs can beupdated at any time, locally or remotely. Partial reconfigu-ration allows you to easilysupport, service, and updatehardware in the field.

• Hardware sharing. Becausepartial reconfiguration allowsyou to run multiple applica-tions on a single FPGA, hard-ware sharing is realized.Benefits include reduceddevice count, reduced powerconsumption, smaller boards,and overall lower costs.

• Shorter reconfiguration times.Configuration time is directlyproportional to the size of theconfiguration bitstream.Partial reconfiguration allowsyou to make small modifica-tions without having toreconfigure the entire device.By changing only portions ofthe bitstream – as opposed toreconfiguring the entiredevice – the total reconfigura-tion time is shorter.

ApplicationsPartial reconfiguration is the cor-nerstone for power-efficient, cost-effective software-defined radios(SDRs). Through the JTRSProgram, SDRs are becoming areality for the defense industries asan effective and necessary tool for

For more information, visit ProcessorCentral at www.xilinx.com/products/design_resources/proc_central/index.htm.

BenefitsThere are many benefits that come withpartially reconfigurable devices:

• Applications. Partial reconfiguration isuseful in a variety of applications acrossmany industries. The aerospace anddefense industries have certainly taken

DSP(Option

al)GPPFPGA

A/DD/A

Waveform Control

DSP(Option

al)GPPFPGA

A/DD/A

Waveform Control

DSPDSP GPPGPPFPGAFPGAA/DD/AA/DD/A

GPP Functions• SCA CF• CORBA• RTOS• DSP

FPGA Functions• Channelization• DSP

DSP Functions• Mod/Demod• FEC

Waveform Control

Waveform C

Waveform AWaveform B

Dedicated Resources Model

Processor Type of Processor Supported Devices

PicoBlaze Soft-Processor Core Virtex FamilySpartan Family

CoolRunner-II CPLDs

MicroBlaze Soft-Processor Core Virtex FamilySpartan Family

PowerPC Hard Processor Virtex-II ProVirtex-4 FX

FPGAFPGA

GPPGPPA/DsD/AsA/DsD/As

GPP Functions• SCA CF• CORBA/RTOS• DSP

FPGA Functions• Channelization• DSP

Waveform

aveform C

Waveform Control

Shared Resources Model

Table 1 – Xilinx processor solutions and supported devices

Figure 1 – Dedicated resources model: three-channel SDR modem

Figure 2 – Shared resources model: three-channel SDR modem

communication. SDRs satisfy the JTRSstandard by having both a software-repro-grammable operating environment and theability to support multiple channels and net-works simultaneously.

Figure 1 shows a three-channel SDRmodem supporting a SoftwareCommunications Architecture CoreFramework (SCA CF), as mandated forJTRS. Current implementations of SCA-enabled SDR modems with multiple chan-nels require multiple sets of processingresources and a dedicated set of hardwarefor each channel. The more channels SDRmust support, the more dedicated resourcesneeded. This aversely affects space, weight,power consumption, and cost.

With partial reconfiguration, the ability

to implement an SDR modem usingshared resources is realized, as shown inFigure 2. A shared resources model enabledby partial reconfiguration of an FPGA tosupport multiple waveforms can be sup-ported by the SCA as mandated by JTRS.FPGA implementations of SDR, with par-tial reconfiguration, results in effective useof resources, lower power consumption,and extensive cost savings.

Partial reconfiguration can also be usedin many other applications. Another exam-ple is in mitigation and recovery from sin-gle-event upsets (SEU). In-orbit,space-based, and extra-terrestrial applica-tions have a high probability of experiencingSEUs. By performing partial reconfigura-tion, in conjunction with readback, a system

can detect and repair SEUs in the configura-tion memory without disrupting its opera-tions or completely reconfiguring the FPGA.(Readback is the process of reading the inter-nal configuration memory data to verify thatcurrent configuration data is correct.)

ConclusionThe capabilities and benefits offered by par-tial reconfiguration reach across manyindustries and applications. Leverage partialreconfiguration by using Xilinx FPGAs inyour next design, the only truly partiallyreconfigurable devices. You can take advan-tage of any of its benefits, from remote hard-ware upgrading to on-chip hardwaresharing, and give your designs the reconfig-urability advantage.

As a developer, you are concerned with manyissues—project deadlines, code quality and integrated tool support, to name a few. And nowthat you’ve selected an FPGA-based processor topower your next application, what do you do?Where can you find a rich development environment that supports both the XilinxMicroBlaze™ and PowerPC™ processors, individually and simultaneously? The Nucleus®

EDGE software, based on Eclipse, answers thisneed by providing a comprehensive, fully developed tool suite for both MicroBlaze andimmersed PowerPC developers alike.

The Nucleus EDGE software consists of an IDE,compiler, debugger and system profiler — allseamlessly integrated so that FPGA developerscan create their product from conception todeployment in one complete environment.

NucleusEDGERich Development Suite for FPGA

www.AcceleratedTechnology.com/xilinx

Nucleus. Embedded made easy.

by Nij DorairajSenior Staff EngineerXilinx, Inc.nij.dorairaj@xilinx.com

Eric ShifletTechnical Marketing EngineerXilinx, Inc.eric.shiflet@xilinx.com

Mark GoosmanProduct Marketing ManagerXilinx, Inc.mark.goosman@xilinx.com

The architecture of the Xilinx® Virtex™platform family of FPGAs allows designmodules to be swapped on-the-fly using apartial reconfiguration (PR) methodology.This powerful capability allows multipledesign modules to time-share resources on asingle device, even while the base designoperates uninterrupted.

Partial reconfiguration is a process ofdevice configuration that allows a limited,predefined portion of an FPGA to bereconfigured while the remainder of thedevice continues to operate. This is espe-cially valuable where devices operate in amission-critical environment that can notbe disrupted while some subsystems arebeing redefined.

Using partial reconfiguration, you candramatically increase the functionality of asingle FPGA, allowing for fewer, smallerdevices than would otherwise be needed.Important applications for this technologyinclude reconfigurable communicationand cryptographic systems.

Virtex Configuration BackgroundPartial reconfiguration is supported inboth the Virtex and Spartan™ families.In this article, we will focus on imple-menting the partial reconfiguration

methodology as it applies to Virtex-II andVirtex-II Pro FPGAs.

All user-programmable features inside aVirtex FPGA are controlled by memory cellsthat are volatile and must be configured onpower-up. These memory cells are known asthe configuration memory, and define thelook-up table (LUT) equations, signal rout-ing, input/output block (IOB) voltage stan-dards, and all other aspects of the design.

To program configuration memory,instructions for the configuration controllogic and data for the configuration memoryare provided in the form of a bitstream,which is delivered to the device through theJTAG, SelectMAP, serial, or ICAP configura-tion interface.

A programmed Virtex FPGA can be par-tially reconfigured using a partial bitstream.You can use partial reconfiguration to changethe structure of one part of an FPGA designas the rest of the device continues to operate.

PlanAhead Software as a Platform for Partial Reconfiguration

PlanAhead software delivers a streamlined environment to reduce space, weight, power, and cost.

Use CasesPartial reconfiguration is useful for systemswith multiple functions that can time-sharethe same FPGA device resources. In suchsystems, one section of the FPGA continuesto operate, while other sections of the FPGAare disabled and partially reconfigured toprovide new functionality. This is analogousto the situation shown in Figure 1, where amicroprocessor manages context switchingbetween software processes. Except in thecase of partial reconfiguration of an FPGA,it is the hardware – not the software – that isbeing switched.

Partial reconfiguration provides an advan-tage over multiple full bitstreams in applica-tions that require continuous operation nototherwise accessible during full reconfigura-tion. One example, illustrated in Figure 2, isa graphics display that utilizes horizontal andvertical synchronization. Because of the envi-ronment in which this application operates,

• Place bus macros

• Follow PR-specific design rules

• Run the partial reconfiguration imple-mentation flow

PlanAhead™ software provides a singleenvironment (or platform) to manage thepreceding guidelines.

Here is a list of the steps in which youcan use PlanAhead design tools to imple-ment a partial reconfiguration design:

• Netlist import

• Floorplanning for partial reconfiguration

• Design rule checks

• Netlist export

• Implementation flow management

• Bitstream size estimation

Netlist ImportPlanAhead software works with any syn-thesized netlist, such as XST or Synplify.You can import any hierarchical netlist(single edf/ngc or multiple edf/ngc files).Follow the regular guidelines to import thedesign into PlanAhead design tools andcreate a floorplan for the design.

Floorplanning for Partial ReconfigurationThis is an important step in the partialreconfiguration flow. Floorplanning withinPlanAhead software is based on design par-titions referred to as physical blocks, orPblocks. A Pblock can have an area (such asa rectangle) defined on the FPGA device toconstrain the logic. You can define Pblockswithout rectangles and ISE™ software willattempt to group the logic during place-ment. Netlist logic placed inside of Pblockswill receive AREA_GROUP constraints.

The key floorplanning tasks for partialreconfiguration are:

1. Setting up a PRM:

• Assign an area for the PRM by creatinga Pblock with an area defined withinthe fabric

• Assign RANGES for the Pblock

signals from radio and video links need to bepreserved – but the format and data process-ing format require updates and changes dur-ing operation. With partial reconfiguration,the system can maintain these real-time linkswhile other modules within the FPGA arechanged on the fly.

MethodologyTo implement a partial reconfigurationdesign successfully, you have to follow astrict design methodology. Here are someof the guidelines to follow:

• Insert bus macros between modules thatneed to be swapped out (called partialreconfiguration modules, or PRMs) andthe rest of the design (static logic)

• Follow synthesis guidelines to generatea partially reconfigurable netlist

• Floorplan the PRMs and cluster allstatic modules

Processor Context Switch

FPGA Configuration Switch

Process 1 Partial Bitstream A

Process 2 Partial Bitstream B

Process N Partial Bitstream N

Radio Link

Bus LinkFPGA

Monitor

Figure 1 – The analogy between microprocessor context switching and FPGA partial reconfiguration regions

Figure 2 – Maintaining real-time links during partial reconfiguration

• Define the MODE=RECONFIGattribute on the PRM

(Pblock property->attribute section)

2. Setting up static logic:

• Every top-level module, other thenPRMs, should be grouped together ina single Pblock. This is called a staticlogic block. This block should not have

a RANGE defined; this will cluster thestatic logic together in a single Pblock.Select all top-level modules (except thePRM) and assign it to a Pblock. Figure3 shows the static logic grouped in aPblock named AG_base.

• When you have completed the floor-planning in PlanAhead design tools,the resulting physical hierarchy will beorganized as shown in Figure 4.

3. Bus macro placement:

• Bus macros are physical ports that con-nect a PRM to static logic. Any con-nection from a PRM to static logicshould always go through a bus macro.Bus macros are instantiated as blackboxes in RTL and are filled with a pre-defined routing macro in the form ofan .nmc file. Bus macros are placed onthe PRM boundary. Static logic con-nected to PRMs will migrate towardsthe bus macro during placement.

Design Rule ChecksGiven the complexity of the flow, it iscommon for mistakes to be introduced inthe original RTL and during the floor-planning process. PlanAhead-PR designtools check for design violations. Alsointegrated into this feature is the PR-Advisor, which provides feedback on howto improve your design.

The PR design rule checks are:

1. Bus macro DRC. This provides verifica-tion for all design rules related to busmacro connectivity and placement. Oneexample of a bus macro DRC is thePRBP check. This DRC checks for allrules that should be followed for busmacro placement. Figure 5 shows anexample of a design that failed thePRBP DRC. In this case, the inter-leaved/nested macro should be placed atSLICE_X41Y*.

2. Floorplanning DRC. This covers floor-planning rules. Clock objects (globalclock buffers, DCM) and I/Os should beplaced and static logic clustered.

3. Glitching logic DRC. This verifies glitch-ing logic elements (SRL and distributedRAM) above and below PRM regions.

4. Timing Advisor/DRC. This provides acheck for timing-related issues. Oneexample of timing DRC is the PRTPcheck. The static module is implement-ed before the PRM during the imple-mentation phase. Regular timingconstraints do not cover the paths thatcross between the static and a PRM.This does not present a problem, pro-vided that the bus macro is synchro-nous. However, if it is an asynchronousbus macro, the static module does notknow about the propagating of asyn-chronous paths, as shown in Figure 6.This could be important if these pathsare timing-critical. One way to pass thisinformation to the static module is tospecify a TPSYNC constraint on the busmacro output net. PlanAhead softwarewill recommend a TPSYNC constraintthat can be added to the .UCF file.

PRM1.... Static I/O Buffers, Clock

Logic, and Bus

Figure 3 – All static logic is grouped in a single Pblock AG_base.

Figure 4 – The hierarchy for a partial reconfiguration design

Figure 5 – The PRBP DRC verifies all rules that should be followed for bus macro placement.

Netlist ExportOnce the design is floorplanned and passesthe DRC checker, it is ready to be export-ed. PlanAhead design tools take care ofexporting the original hierarchical netlist

into a PR-style netlist that has a specificformat (static and PRM in separate direc-tories). The export directory will appear asshown in Figure 7.

Implementation Flow ManagementThe partial reconfiguration flow wizard,shown in Figure 8, runs the partial recon-figuration implementation on the export-ed design. It will produce a full bitstreamfor the complete design and a partial bit-stream for each of the PRMs. The imple-mentation steps are:

• Initial budgeting

• Static module implementation

• PRM implementation (one implemen-tation for each version of every PRM)

• Assembly and bitstream generation(results are stored in the merge directory)

To run the tools, start the flow wizardby selecting “Tools > Run PartialReconfig.” This wizard will guide youthrough each of the implementation steps.You can either run the implementation orjust generate the scripts, which can be usedfrom a command line outside of thePlanAhead user interface.

Bitstream Size EstimationThe Pblock statistics report includes a sec-tion that reports PRM bitstream size (Figure

9). This information can be used for estimat-ing the size of configuration memory storagesuch as external flash and DDR. This infor-mation can also be used to calculate howlong it will take to swap the module based onyour bitstream memory interface.

ConclusionPlanAhead software is the first graphicalenvironment for partial reconfiguration.Using PlanAhead design tools as a platformfor partial reconfiguration applications cangreatly simplify the complexities of jug-gling the dynamic operating environmentof these cutting-edge applications, allowinga single device to operate in applicationsthat previously required multiple FPGAs.

The methodology offered by PlanAheadsoftware can dramatically increase produc-tivity and decrease time-to-solution fordesigners using partial reconfiguration.

The capability of designs to leverage par-tial reconfiguration opens doors to a wholehost of applications. By providing a plat-form that leverages the advantages of partialreconfiguration for designs targeting Virtex-II and Virtex-II Pro FPGAs, PlanAheaddesign tools allow you to dramaticallyextend the functionality of your design.With similar support for designs targetingthe Virtex-4 multi-platform FPGA in thenear future, the application space for partialreconfiguration is practically limitless.

All modules are black boxed

PRM, all staticmodules areblack boxed

Static Module

Figure 6 – Special consideration should be given to timing-critical paths, which

include an asynchronous path.

Figure 7 – Upon successful completion of thefloorplanning DRC processes, all modules will be

displayed in black text within the ExportFloorplan dialog box.

Figure 8 – The PlanAhead partial reconfiguration flow wizard provides

an easy-to-follow flow.Figure 9 – The display of Pblock properties includes the estimated size of the bitstream.

by Jonathan TrotterTitanium Business Development ManagerXilinx, Inc.jonathan.trotter@xilinx.com

Once a specification has been defined andauthorized, several critical stages remain inevery design cycle. The first stage is designentry. HDL coding is the most prominentform of entry when designing platformFPGAs. The code must be written proper-ly for the design to pass through synthesis.Xilinx recommends that designers simulatetheir HDL before moving on.

The next stage is implementation, withthe most important aspect being place androute. Implementation maps the HDL-entered design into FPGA building blocks,creating a bitstream. Analysis is requiredbefore downloading the bitstream into anFPGA. Downloading the FPGA withoutthis analysis can damage the device, or eventhe system.

Designs must also meet system timingparameters. If timing is not met after animplementation, then the code, timingconstraints, implementation options, orinternal logic placement must be changed.Yet adjusting any one of these areas canyield your desired results or further removeyou from them.

Determining which area to adjust firstcan be challenging to design teams, espe-cially those feeling the heat of a deadline.The best design teams are those that canmake the necessary changes or correctionsthe fastest. Once the changes have beenmade, a final verification or timing simula-tion is required. If the design does not pass,more modifications are needed.

Supporting PlayersSupporting Players

Titanium Dedicated Engineering provides assistance in overcoming technical obstacles.

To the RescueXilinx has the expertise to solve your prob-lem and meet your deadline regardless ofyour application or specification require-ments. Under our Titanium DedicatedEngineering program, Xilinx engineers cansuccessfully:

• Aid designers with embedded process-ing, timing closure, signal integrity,and DSP challenges

• Assist teams starting their first FPGAdesign with design flow and funda-mentals

• Help with adding a feature to a designthat is already at capacity

• Meet a deadline with timing closureassistance

Titanium Dedicated Engineering canimprove your design productivity andaccelerate your time to market by provid-ing you with a dedicated application engi-neer on a contract basis. This expert canoffer the technical assistance that yourteam and design require, either remotelyor at your site. Not only will your designteam have access to the engineer’s expert-ise, they will also have access to the knowl-edge possessed by Xilinx product anddevelopment groups.

Design ChallengesTo create a complete and powerful plat-form FPGA design requires knowledge,skill, time, resources, and patience. Tofully utilize and take advantage of all of thenew features of a Xilinx device requiresramp-up time for any design team, regard-less of their experience level. A good exam-ple of this is Xilinx embedded processingtechnology. Designing with embedded fea-tures requires extra skills and knowledgeabove and beyond those needed for a suc-cessful FPGA design. With time, mostcapable teams will learn how take advan-tage of these features, but the competitionmay learn faster with assistance.

Each design team operates and functionsdifferently, but they all seem to have similarpractices and styles. The greater familiarity

lize the device features most applicable totheir design. In this case, the applicationengineer quickly identified and recom-mended that the team take advantage ofthe DCMs as well as IOB registers.

After an architecture-specific re-design, the next issue was to apply propertiming and area constraints based on theirtiming report. The constraints forceXilinx place and route tools to concen-trate on the most critical areas of thedesign. The application engineer alsofocused on educating the design team sothat they could apply these fundamentalson their own during their next design.

The main culprit of the configurationissue turned out to be a faulty crystal oscil-lator. The frequency was a little lower thanwhat its datasheet specified. Once it wasreplaced, the FPGAs could be configuredvia System ACE technology.

ConclusionXilinx has encountered and solved almostevery presentable difficulty at every stageof the design cycle. Most of the issues andsolutions have been documented in adatabase. In a critical design situation, thisknowledge can be the difference betweensuccess and failure.

Xilinx will give you personalizedaccess to one of our application engineersfor the duration of your choosing. ThisTitanium Dedicated Engineer will be anexpert in the area of your design needsand can provide an understanding ofXilinx design methodologies and tech-niques. The engineer will provide a solu-tion to your problem, at any stage of thedesign cycle, and also educate yourdesign team on the steps needed to pre-vent those issues from recurring.

A Titanium Dedicated Engineer canwork at Xilinx, on-site with your designteam, or a mix of both. This flexibilityallows our engineers to fully understandthe needs and requirements of our clients,as well as leverage Xilinx factory resourcesto resolve problems and accelerate produc-tion. For more information aboutTitanium Dedicated Engineering, call(800) 888-FPGA (3742) or visit www.xilinx.com/titanium.

they have with FPGAs, the easier it is to uti-lize advanced features like the embeddedprocessor or multi-gigabit transceiver, or toreach desired performance levels. Problemscan occur if the team is relatively new orinexperienced. The quicker designers adoptour fundamental design style, the quickerthey will complete successful designs. Forexample, timing verification must be per-formed before transferring the design insoftware to silicon. The step is not mandato-ry, but precautionary.

Success StoryMany ASIC design teams prototype theirideas in an FPGA. This allows them tomake changes or modifications without thetime and expense of re-spinning an ASIC.One Xilinx customer wanted to implementan ARM processor core and additional cus-tom logic inside a Xilinx device. The designwas so large that it had to be split upbetween two Virtex™-II devices. It can bechallenging enough to successfully imple-ment a function in one platform FPGA, letalone splitting it across two of them.

The customer’s design was not meetingtiming, nor could they configure their sys-tem utilizing our System ACE™ technolo-gy. Because two devices were required,board connectivity techniques were critical;the board timing and FPGA timing had tobe in harmony for the two devices to func-tion in unison.

There were other several crucial designchallenges for this team:

• They were one month behind schedule.

• They had a customer demonstration insix weeks

• System frequency was 15 Mhz belowthe application’s minimum requirement

• Configuring via JTAG was successful,but not utilizing System ACE technology

Xilinx sent a Titanium DedicatedEngineer to work onsite with the team toidentify their timing, integration, and con-figuration issues. The engineer first helpedthem understand the device architecture toensure that they had the knowledge to uti-

Now you can see inside your FPGA designs in a way that

will save days of development time.

The FPGA dynamic probe, when combined with an Agilent

16900 Series logic analysis system, allows you to access

different groups of signals to debug inside your FPGA—

without requiring design changes. You’ll increase visibility

into internal FPGA activity by gaining access up to 64

internal signals with each debug pin.

You’ll also be able to speed up system analysis with the

16900’s hosted power mode—which enables you and your

team to remotely access and operate the 16900 over the

network from your fastest PCs.

The intuitive user interface makes the 16900 easy to get up

and running. The touch-screen or mouse makes it simple to

use, with prices to fit your budget. Optional soft touch

connectorless probing solutions provide unprecedented

reliability, convenience and the smallest probing footprint

available. Contact Agilent Direct today to learn more.

U.S. 1-800-829-4444, Ad# 7909Canada 1-877-894-4414, Ad# 7910www.agilent.com/find/new16900www.agilent.com/find/new16903quickquote

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation

• Increased visibility with FPGA dynamic probe• Intuitive Windows

®XP Pro user interface

• Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000

Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.

X-ray vision for your designsAgilent 16900 Series logic analysis system with FPGA dynamic probe

X-ray vision for your designs

by Scott FergusonFactory Application Engineer, Logic AnalyzersAgilent Technologies, Inc.sferguson@agilent.com

As FPGAs become a viable option for high-performance signal processing in the digitalcommunications design space (cellular basestations, satellite communications, andradar), analysis and debug tools mustinclude new techniques to help you get themost optimal performance in your circuitsin the least amount of time.

Although signal analysis tools that con-nect to simulation and RF analog signalsare available, it’s important to be able tomeasure signal quality (frequency spec-trum, I-Q constellation, and error vectormagnitude [EVM]) in the sub-circuits ofyour FPGA. Thus, Agilent has linked its89601A Vector Signal Analysis (VSA) soft-ware with its line of logic analyzer products(1680, 1690, and 16900 families) to createa digital VSA tool. This tool, when com-bined with the Xilinx® ChipScope™ ProAgilent Trace Core, allows you to performsignal analysis anywhere inside your FPGAdesign quickly and easily.

In this article, we’ll show how this com-bination of tools works – and how it canhelp you get the most from your Xilinx-based DSP circuits.

Real-Time Analysis of DSP DesignsReal-Time Analysis of DSP Designs

Agilent combines the FPGA Dynamic Probe and digital VSA.Agilent combines the FPGA Dynamic Probe and digital VSA.

Digital VSAVSA uses Fast Fourier Transform (FFT)-based data processing to provide a combi-nation of time- and frequency-domaindisplays and measurements. Figure 1shows a typical VSA display. Althoughthe display is extremely flexible and con-figurable, the main components includethe I-Q constellation plot (upper left),magnitude spectrum (lower left), errorvector (upper right), and measurements(lower right). The EVM is displayed inthe measurements section. This singlevalue is a key indicator of the quality ofthe modulated signal.

EVM is computed by extracting I-Qsymbols from the captured data; the sym-bols are the grid points in the constella-tion defined by the QPSK, QAM, orother modulation scheme. Once extractedfrom the measured signal, the symbolsequence is used to create an ideal (theo-retically perfect) signal known as the “ref-erence” signal. Each measured signal iscompared to the reference signal, and thedifference is known as an error vector.(The error can contain both I and Q, ormagnitude and phase components). Theindividual error vectors for a single cap-ture are combined to make a single EVMmeasurement.

Although this analysis software wasoriginally created to analyze analog RFsignals, it was developed in a hardware-independent, PC-based software package.Because Agilent logic analyzers are alsoPC-based, it was easy to extend the VSAsoftware to link to the logic analyzers.

Digital baseband and IF signals are rep-resentations of analog signals. Rather thanusing an instrument that digitizes a signalto enable FFT analysis (like an RF signalanalyzer), the signal is digital from thestart. These digital versions of analog sig-nals can be displayed in a logic analyzer ina chart-style waveform, which resemblesan oscilloscope display (as in Figure 2).

As you can see, when the bus is syn-chronously sampled and the sample ratemeets the Nyquist requirements, the logicanalyzer captures a sufficiently accurateversion of the “once-was” or “will-soon-be” analog signal.

core is a switching MUX incorporatedinto the design using the ChipScope ProCore Inserter, typically post-synthesis.During core insertion, you select theinternal nets to connect to the trace core,and the physical pads to which you willconnect the MUX output. These pads are

then routed on the circuit board toa logic analyzer probe.

The logic analyzer controls theFPGA through JTAG (downloadingthe bit file and selecting banks).When you select a new bank, thelogic analyzer automatically reconfig-ures itself to match the names of thenets now connected to the probe.

Design Example – QAM16 ModulatorWith help from our local XilinxDSP specialist FAE, we created ademo that fits into a small Virtex™-II part (XC2V250-FG256) usingXilinx System Generator for DSP.This tool makes creating DSPdesigns quick and easy. The design(shown in the block diagram inFigure 3) contains a 25 MHz symbolencoder; a root-raised cosine filterwith 24 taps and 4X interpolation(the output running at 100 MHz);and an IF modulation stage with a25 MHz local oscillator.

FPGA Dynamic ProbeThe FPGA Dynamic Probe, working withthe ChipScope Pro analyzer, can provideaccess to any part of a DSP design withoutrecompiling. In Figure 3, a simplified dig-ital radio transmitter design is connectedto the Agilent Trace Core 2 (ATC2). This

SymbolEncoder

BB Filter

90 deg

Bank 0 Bank 1 Bank 2 Bank 3

Logic Analyzer Probe

Select Signal Bankvia JTAG

Figure 3 – FPGA Dynamic Probe

Figure 2 – Chart display of digital bus

Figure 1 – VSA display

Integrating the ATC2 Core in a System Generator DesignAfter compiling this design intoVHDL, we inserted the ATC2 core.To make the signal names more logi-cal on the logic analyzer display, wedid some hand-editing of the VHDL.(You could avoid this step by carefullychoosing net names in the SystemGenerator.) We then connected mostof the interesting nets as output portsfrom the top-level object to make thenet names short enough to fit on thelogic analyzer screen.

When connecting nets to outputports solely for use with the FPGADynamic Probe, a good trick is to usethe “keep” attribute in the VHDL.Because you don’t add the ATC2 coreto the design until after synthesis,many nets would otherwise be opti-mized out because they’re not con-nected to anything. In VHDL, thesyntax to use the “keep” attributelooks like this:

attribute keep : string;

attribute keep of i_symbol: signal is “true”;

attribute keep of q_symbol: signal is “true”;

We created an ATC2 core withfour banks, each with 48 signals.Using the ATC2 core’s 2X TDMoption (time-slicing two signals at atime on each pad), this requires only25 package pads on the FPGA (onefor a clock and 24 for data). This givesus access to 192 signals. Actually, weonly need to view about 92 signals:

• I-Q symbols, 8 bits each (16)

• I-Q filter output, 24 bits each (48)

• IF local oscillator sine and cosine,2 bits each (4)

• Combined IF signal (24)

The output of the RRC filter with24-bit I and Q signals was the largestrequirement, defining the number ofpins required. If 24 pins were notavailable, you could drop the least sig-

nificant bits, losing some dynamic rangebut still being able to view the signals.

Time-Domain, Logic, and VSA MeasurementsThe logic analyzer uses synchronoussampling (or “state mode”) to capturethe output of the ATC2 core. Thismeans that data is sampled on eachedge of the ATC2’s output clock. Ourdesign has two clock rates in the circuit– 25 MHz for the symbol data beforethe RRC filter and 100 MHz for allparts after the filter. Because the ATC2core supports only one clock per core,two options exist for debug:

• Using two cores, one for each clock rate

• Using one core with the fasterclock rate and over-sampling the25 MHz bus

Because the two clocks are correlated– and one is an integer multiple of theother – you can just over-sample theslower bus. If over-sampling is not desir-able, the logic analyzer can use a setupthat stores every fourth sample, therebycapturing the 25 MHz bus accuratelywith one sample per 25 MHz clock.

With the extra signals available inthe MUX, we were able to double-probe some of the interesting signals.For example, in bank 0 we have the Iand Q symbols before the filter, andalso the I component after the RRC fil-ter. This means we can do some time-domain analysis in the logic analyzer tomeasure group delay in the filter, as inFigure 4. Two markers indicate a com-mon signal feature: a wide, flat top andthe marker measurement display show-ing an interval of 250 ns.

After probing the interesting parts ofthe circuit, we performed vector signalanalysis on the signals and measured thequality of our RRC filter and IF modu-lation stages.

Looking at the QAM16 I-Q symbolsbefore they were filtered (as shown inFigure 5), you can see the 16-pointQAM constellation (upper left graph).

Figure 7 – Digital IF signal

Figure 6 – Filtered IQ baseband data

Figure 5 – Unfiltered QAM16 symbols

Figure 4 – Filter group delay measurement

With one point per symbol, the linesbetween the constellation points arestraight. The frequency spectrum (in thelower left graph) is centered at 0 Hz andhas a 25 MHz pass-band with power inadjacent channels. Adjacent channel poweris undesirable in the RF signal, of course,which is the reason for the baseband filter.

By selecting a different bank in theATC2 core (controlled by the logic analyz-er), you can perform analysis on the IQ sig-nal after the baseband filter, as seen inFigure 6. Now the spectrum has sidebandsremoved, and the measurement display (inthe lower right quadrant) shows an EVMof 0.5%. The next time your RF team com-plains of errors in the baseband design, youcan point to this measurement (with whichthey are quite familiar), and prove that it’snot your filter’s fault.

In many digital radio designs, this IQsignal would now be converted to analog.However, we performed the IF modulationdigitally inside the same FPGA. Switchingbanks in the FPGA Dynamic Probe gives usaccess to the digital IF (again, withoutanother synthesis and place and route step),as shown in Figure 7. Note that the spec-trum and I-Q constellation are roughly thesame, only now centered about 25 MHz.The EVM is a little bit higher, indicatingthat you may want to use a higher qualitylocal oscillator or another filter stage.

ConclusionXilinx System Generator and ChipScopePro analyzer, combined with the Agilentlogic analyzer and Agilent VSA software,allow you to perform real-time in-depthanalysis on digital baseband and IF signalsinside your Xilinx FPGA. This will saveyou time and eliminate doubts about thedifference between simulation and realhardware. It can also help you communi-cate with your colleagues on the RF designteam, enabling you to speak their languageand use the same analysis software regard-less of signal format (analog, digital, base-band, or RF).

For more information about theseapplications, visit www.agilent.com/find/logic-sw-apps, or contact your Agilent representative.

Would you like to write for Xcell Publications?

It’s easier than you think.

Would you like to write for Xcell Publications?

It’s easier than you think.We recently launched the Xcell Publishing Alliance

to help you publish your technical ideas. We can help you – from concept research and development, through planning and

implementation, all the way to publication and marketing.

Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work

with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,

the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,

writing, editing, and marketing.

We recently launched the Xcell Publishing Alliance to help you publish your technical ideas. We can help you –

from concept research and development, through planning and implementation, all the way to publication and marketing.

Submit articles for our Web-based Xcell Online or our printed Xcell Journal and we will assign an editor and a graphics artist to work

with you to make your work look as good as possible. Submit yourbook concepts and we will bring our partnership with Elsevier,

the largest English language publisher in the world, and our broad industry resources to assist you in planning, research,

writing, editing, and marketing.

For more information on this exciting and highly rewarding program, please contact:

Forrest CouchExecutive Editor, Xcell Publications

xcell@xilinx.com

by Anthony LeProduct Marketing Manager, Configuration Memory SolutionsXilinx, Inc.anthony.le@xilinx.com

FPGA configuration is often a last-minutedesign decision, because engineers viewFPGA configuration as an easy, no-brainerstep in the design cycle. That is true whencustomers use the Xilinx “recipe” – a Xilinx®

FPGA, Platform Flash PROM, ISE™ soft-ware, and platform cable USB. However,FPGA configuration becomes increasinglycomplex if you use a non-Xilinx solution. Inthis article, I’ll discuss the differencesbetween Platform Flash and commodityFlash (see Table 1 for a summary).

Added Flexibility in Configuration SolutionsBefore the introduction of the Spartan™-3E FPGA family, customers who config-ured Xilinx FPGAs with commodity Flashwould use a three-chip solution: an FPGA,a commodity Flash PROM, and a CPLD.Spartan-3E FPGAs eliminate the need for aCPLD (or other controller) by providing adirect interface for leading commodityFlash devices, thus reducing the chip countto two (FPGA and PROM).

Because Xilinx gives you complete flexi-bility to use multiple memory sources forFPGA configuration, you should considerthe following factors at the start of the designprocess: total cost of ownership, board space,configuration speed, source of supply, value-added features, and ease of use. If these fac-tors are not thoroughly thought out from thebeginning, then the final design can incuradditional costs and possible board redesign.

Configuration Choices – Platform Flash or Commodity Flash

Xilinx provides flexibility for configuration memory so that you can make the best decision for your design.

Total Cost of OwnershipOn a per-unit basis, commodity Flashmight appear to be attractively priced;however, you need to consider the totalcost of ownership. Total cost of ownershipis the summation of per-unit cost, designand prototyping cost, and manufacturingand test cost.

1. The cost difference between a commod-ity Flash PROM versus a Platform FlashPROM is negligible when compared tothe overall board cost. In fact, XilinxPlatform Flash PROMs are competitive-ly priced with all non-volatile memoriesin the market (commodity Flash PROMand competing PROMs).

2. The prototyping phase of a design sig-nificantly favors Platform Flash overcommodity Flash because Xilinx offersone of the lowest cost in-system pro-gramming (ISP) solutions:

• Platform cable USB: $150

• Programming software – iMPACT: $0

• Xilinx award-winning support: $0(included)

3. Once in production, you can significant-ly reduce costs by utilizing the BoundaryScan (JTAG) capability of Xilinx FPGAsand Platform Flash PROMs (along withother JTAG devices on the board) forlow-cost Boundary Scan testing and pro-gramming. Commodity Flash devices donot offer JTAG interfaces; therefore, cus-tomers cannot take advantage of the lowcost of Boundary Scan testing. In mostcases, expensive Automatic TestEquipment (ATE) is required to test andperform in-system programming of com-modity Flash memories.

Board SpaceIf board space is critical to your design,then consider the following:

• Standard SPI PROMs are typicallyoffered in the smallest form factor. The1 Mb to 4 Mb SPI PROMs are usuallyoffered in a SOIC-8L (5 x 6 mm)package and 8 Mb (and larger) devicesare usually offered in a SOIC-16L (10x 6 mm) package.

a Xilinx FPGA with a commodity FlashPROM would require using a CPLDdevice to translate memory into theFPGA bitstream (refer to XAPP058,“Xilinx In-System Programming Usingan Embedded Microcontroller”).Resulting data transfer rates candegrade based on the translation logic.

3. If using a Spartan-3E device with a par-allel commodity Flash, the configurationmode is limited to 6 MHz. PlatformFlash features a maximum transfer rateof 40 MHz x 8 I/O (or 320 bps) withSpartan-3E devices.

In theory, Parallel commodity Flash isfaster, but given the limitations listedabove, the practical transfer rate is signifi-cantly less than that of Platform Flash.

Source SupplyAlthough there are many commodity Flashvendors, you should be aware of twopotential pitfalls. First, every vendor pro-vides similar commodity Flash PROMs,but there are nuances with each vendorthat can limit their interoperability. For

• Platform Flash PROMs are a close sec-ond with 1, 2, and 4 Mb PROMsoffered in a TSOP-20L (6.5 x 6.4 mm)package and 8, 16, and 32 Mb PROMsin a TFBGA-48 (8 x 9 mm) package.

• Parallel commodity Flash devices havelarge packages to address the additionalcontrol, address, and I/O pins.

SPI PROM has the advantage here, butPlatform Flash is a close second consider-ing that there is only a 12 mm2 differencein area. The difference in area is minutecompared to the overall board space.

Configuration SpeedParallel commodity Flash devices are typi-cally the fastest memory on the market.They are offered in either x8 or x16 I/Oconfigurations. The theoretical data trans-fer rate can be as fast as 50 MHz x 16 I/O,but there are limitations when configuringa Xilinx FPGA.

1. At this time, Xilinx FPGAs can only beconfigured in x8 mode.

2. Before Spartan-3E devices, configuring

Platform Flash SPI Flash Parallel Flash

Minimum Config Pins 3 4 37

Configuration Serial & Parallel* Serial Only Parallel Only

Storage Density 1 – 32 Mb 512K – 128 Mb 512K – 256 Mb

Sourcing Xilinx only Multiple Multiple

Supply Management Yes No No

Storage Cost Low Lowest Low

FPGA Read/Write Yes** Yes Yes

JTAG Interface Yes No No

Configuration Time Fast Slow Fastest

Compression Yes*** No No

Design Revisioning Yes*** No MultiBoot

Power Reliability Excellent Fair Fair

Xilinx Support Excellent Limited Very Limited

ISE 7.1i Support Excellent None None

* XCF00P supports both Serial & Parallel, XCF00S only suports Serial** Read-Only using XAPP694 *** Only with XCF08P, XCF16P, and XCF32P

Table 1 – Memory summary chart

example, a STMicro SPI is not fully com-patible with an Atmel SPI PROM.Second, during a period of tight sourcesupply, customers might find themselvespaying more for expedites or end up withvery long lead times. Xilinx answers thesource supply conundrum by holding alarge inventory of Platform Flash at fin-ished goods, which allows Xilinx to quick-ly react to increased demand.

Ease of UsePlatform Flash was designed to work seam-lessly with all Xilinx FPGAs and is also sup-ported by an award-winning support team.Xilinx provides a total configuration solu-tion that includes software and hardware.No other configuration memory solutionoffers this type of support.

Value-Added FeaturesFinally, Platform Flash offers the followingvalue-added features that are not found incommodity Flash:

1. Compression. The higher densityPlatform Flash PROM devices havebuilt-in de-compressors, which, on aver-age, can fit 50% more configurationdata into the same memory space. Xilinxpatented compression technology canhelp you reduce costs in two ways:

a. Reduce component costs by fitting alarge bitstream into a lower densityPlatform Flash PROM device. Forexample, a Virtex™-4 LX60 designrequiring more than 17 Mb of config-uration bits can fit into a XCF16Pinstead of a 32 Mb PROM.

b. Reduce component count by fitting adesign into one PROM as opposed totwo or more. For example, a Virtex-4LX160 device requires more than 40Mb of configuration bits, which nor-mally would require a 32 Mb and 8Mb PROM, but compression enablesthe design to fit into a single XCF32P.

2. JTAG. Allows low-cost board-levelBoundary Scan testing for opens andshorts, as well as programmability dur-ing prototyping and in the productionenvironment.

3. Design Revisioning. Allows one board tohave many functions. Platform FlashPROMs (XCF08P, XCF16P, andXCF32P) have blocks of memories thatcan be written and read independently ofone another. The logic to switch betweeneach block is already built into PlatformFlash, thus reducing design time andcost. Although commodity Flash deviceshave a similar feature called “sectors,”you would need glue logic and softwareto access the various sectors.

4. Access to unused memory. Most FPGAbitstreams will not use all of the memoryof a PROM. Thus, any unused memorycan be used for processor “scratch pad” or“boot code.” You can access unused mem-ory within a Platform Flash PROMthrough JTAG (refer to XAPP544, “UsingXilinx XCF02S/XCF04S JTAG PROMsfor Data Storage Applications”). Note thatyou can still access unused memory with-in commodity Flash PROM, but youmight need to design additional logic andsoftware to access the unused memory.

ConclusionWhen planning your next board design,use Platform Flash for your FPGA con-figuration and you can beat your com-petitors to market and lowerdevelopment cost. Platform Flash is aninnovative configuration memory withvalue-added features that enable greaterflexibility and performance for Virtexand Spartan FPGAs.

Platform Flash PROMs provide youwith a system-level drop-in solutionthat allows you to maximize the flexi-bility of Virtex and Spartan FPGA-based systems to significantly reduceyour design effort and accelerate timeto market. Platform Flash PROMs arecompetitively priced, reduce theamount of board space required forconfiguration, and offer a complete 1 to 32 Mb PROM density solution(Table 2).

For more information, visitwww.x i l inx . c om/produc t s / s i l i c on_solutions/proms/pfp/index.htm.

XCF01S XCF02S XCF04S XCF08P XCF16P XCF32P

Density 1 Mb 2 Mb 4 Mb 8 Mb 16 Mb 32 Mb

JTAG Prog ✔ ✔ ✔ ✔ ✔ ✔

Serial Config ✔ ✔ ✔ ✔ ✔ ✔

SelectMap Config ✔ ✔ ✔

Compression ✔ ✔ ✔

Design Revisions ✔ ✔ ✔

VCC (V) 3.3 3.3 3.3 1.8 1.8 1.8

VCCO (V) 1.8 – 3.3 1.8 – 3.3 1.8 – 3.3 1.5 – 3.3 1.5 – 3.3 1.5 – 3.3

VCCJ (V) 2.5 – 3.3 2.5 – 3.3 2.5 – 3.3 2.5 – 3.3 2.5 – 3.3 2.5 – 3.3

Clock(MHz) 33 33 33 40 40 40

Packages VO20 VO20 VO20 FS48 FS48 FS48

VO48 VO48 VO48

Pb-Free Pkg VOG20 VOG20 VOG20 FSG48 FSG48 FSG48

VOG48 VOG48 VOG48

Availability Now Now Now Now Now Now

Table 2 – Platform Flash features

David PellerinChief Technology OfficerImpulse Accelerated Technologies, Inc.david.pellerin@impulsec.com

Greg EdvensonSenior Software EngineerPico Computing, Inc.greg@picocomputing.com

Kunal ShenoyDesign EngineerXilinx, Inc.kunal.shenoy@xilinx.com

Dan IsaacsDirector, Embedded PowerPC Marketing, APDXilinx, Inc.dan.isaacs@xilinx.com

The Xilinx® Virtex™-4 FX family ofFPGA devices provides embedded systemsdevelopers with new alternatives for creat-ing high-performance, hardware-accelerat-ed applications. With an integratedindustry-standard PowerPC™ processorand innovative Auxiliary Processor Unit(APU) interface, the Virtex-4 FX deviceallows system designers to efficiently con-nect custom hardware accelerators to theintegrated processor, yielding unprece-dented performance.

In the past, software programmers whowanted to take advantage of FPGAs foralgorithm acceleration have experiencedsignificant technical barriers because of thecomplexity of writing low-level hardwaredescriptions to represent higher level soft-ware functions.

Accelerating PowerPC Software Applications

Using custom APU peripherals, C-to-hardware tools enable fast creation of Virtex-4 hardware accelerators.

In this article, we’ll show how thepower of the Virtex-4 FX FPGA can bemade readily available to embedded sys-tems designers and software programmersthrough the use of software-to-hardwaretools. The emergence of such tools bringthe performance benefits of FPGAs toanyone who can program in C.Accelerated FPGA-based designs are noweasier and more practical for a wide rangeof application domains, including imageprocessing, DSP, and data encryption.

From Software to FPGA HardwareBy virtue of their massively parallel struc-tures, FPGAs have the potential to dramati-cally accelerate embedded softwareapplications. But because these devicesrequire different (hardware-oriented) skillsthan traditional processors, the creation andprogramming of a system based on an FPGAhas remained challenging for all but the mosthardware-savvy software programmers. Thisis changing, however. With the introductionof simplified FPGA-based computing plat-forms and streamlined tools for platformbuilding, most of the barriers to FPGA adop-tion have been removed. In addition, theintroduction of software-to-hardware toolsfor FPGAs has dramatically improved thepracticality of these devices as software-pro-grammable computing platforms.

The tools that make this shift possible –enabling FPGA-based platforms to be con-sidered viable alternatives to traditionalprocessors for embedded systems – servetwo basic needs. At the front end, software-to-hardware compiler tools accept high-level descriptions written in a languagefamiliar to embedded software program-mers. The de facto standard for embeddedsystems design is standard C, with C++ andJava beginning to make inroads as well.

At the back end, existing synthesis andplace and route technologies are combinedwith system-level platform building tools,allowing designers to develop and targetcomplete systems on programmable logicto specific development boards. Both ofthese needs are being met today, by toolscurrently available.

In the area of software-to-hardware com-pilation, compiler tools such as Impulse

rience with low-level FPGA design – toassemble a complete system within a singleFPGA, including one or more processorsand related peripherals. When these toolsare combined with a platform-aware soft-ware-to-hardware compiler, the completesystem can include custom acceleratorsoriginally written in C.

Accelerating Embedded ApplicationsThe Virtex-4 FX family of devices providesan ideal platform for hardware accelerationof embedded applications. The Virtex-4FX12 FPGA, for example, includes morethan 12,000 logic cells; an integratedPowerPC 405 core, which can operate atspeeds as fast as 450 MHz; and dual10/100/1000 Ethernet MACs, configuredby the processor through the device controlregister (DCR) interface or through theFPGA fabric.

Looking inside the embedded processorblock (see Figure 2), the PowerPC 405CPU is directly coupled to the unique andinnovative APU controller, which providesdirect access to hardware acceleratorsimplemented in the FPGA logic. The APUcontroller supports three classes of instruc-tions: PowerPC floating-point instructions,APU load and store instructions, and user-defined instructions (UDI). UDIs are pro-

CoDeveloper (Figure 1) can simplify thegeneration of FPGA hardware from higherlevel C-language descriptions of softwarealgorithms. These tools provide the necessarybridge between the domains of software pro-gramming and lower level hardware design.

Serving the need for platform buildingtools is Xilinx Platform Studio, which sup-ports a wide variety of Xilinx FPGA-basedboards and systems. Platform Studio makesit practical for an embedded systemsdesigner – who may have little or no expe-

HDLFiles

C SoftwareLibraries

C LanguageApplications

GenerateFPGA

Hardware

GenerateSoftwareInterfaces

GenerateHardwareInterfaces

DCRI/F

ISOCMControl

PowerPC405 Core

DCREMAC

APUControl

Statistics

Client PHY

Ext DCR

Statistics

DCRControl

ResetDebug

Figure 1 – Impulse CoDeveloper tools simplify theconversion of C subroutines to lower level FPGAlogic and provide the necessary software-to-hard-ware communications on the Virtex-4 platform.

Figure 2 – The Virtex-4 FX12 embedded processing block includes the embedded PowerPC 405processor and a high-performance APU interface, along with dual 10/100/1000 EMACs.

grammed into the APU controller eitherdynamically through the PowerPC 405 viathe DCR bus or statically during FPGAconfiguration via the bitstream. The APUsupported instructions are executed byhardware acceleration co-processingengines implemented in the FPGA logic.

When packaged in a highly integrated

compact device such as the PicoComputing E-12 card (see sidebar, “AWide Range of Development Platforms”),the FX12 device becomes a completeembedded development platform thatrequires little or no hardware design expert-ise. For embedded application developersrequiring a wider range of hardware

peripherals (such as direct access to videoand audio signals), the Xilinx ML403board, also based on the FX12 device, pro-vides an excellent embedded systemsdevelopment platform.

On-chip, the Virtex-4 FX APU con-troller provides a flexible high-bandwidthinterface between the FPGA fabric and the

Taking Advantage of Parallelism in FPGAsA key aspect of any software-to-hardware design flow is the use of parallelism to increase performance. When accelerating C appli-cations using FPGAs, parallelism can be exploited at two distinct levels: at the application system level and at the level of statements(or blocks of statements) within a specific subroutine or loop.

Although there are ongoing attempts to create compiler technologies that can exploit both levels of parallelism with a high degreeof automation, the best approach today is to focus automation efforts (represented by the software-to-hardware compiler) on thelower level aspects of the problem, while at the same time providing software programmers an appropriate and easy-to-use pro-gramming model that allows higher level, coarse-grained parallelism to be expressed. In this way programmers can make hard-ware/software partitioning decisions and experiment with alternative algorithmic approaches, leaving the task of low-leveloptimization to automated compiler tools. This approach is particularly useful for platforms such as the Virtex-4 device that includeembedded processors.

A number of programming models can be applied to FPGA-based programmable platforms, but the most successful of thesemodels share a common attribute: they support modularity and parallelism through a dataflow-like method of design partitioningand abstraction. Communicating sequential processes, or CSP, is one such programming model. CSP has proven to be highly effec-tive in expressing application-level parallelism for FPGA targets. This programming model is directly supported in the Impulse Ctools provided by Impulse Accelerated Technologies, Inc.

At the heart of the Impulse C programming model are processes and streams (Figure 4). Processes are independently synchro-nized, concurrently operating portions of an application that are written in a standard language (in this case C language). Processesperform the work of the application by accepting data, performing computations, and generating relevant outputs.

Unlike traditional C subroutines, processes are considered persistent; they are normally called once (whether in hardware orsoftware) and continue as long as there is streaming data to be processed. The data processed by such an application flows fromprocess to process by means of streams, or in some cases by means of messages or shared memories, which are also supported inthe programming model. Streams represent one-waychannels of communication between concurrent processesand are self-synchronizing with respect to the processesby virtue of buffering. The primary method of synchro-nization between processes is therefore the data beingpassed on the streams.

The key to allocating processing power within such asystem – and using such a programming model – is toimplement one or more processes in the FPGA to handlethe heavy computation, and implement other processeson embedded or external microprocessors to handle fileI/O, memory management, system setup, and other non-performance-critical tasks. Using tools such as thoseincluded with Impulse C, an application comprising mul-tiple parallel C processes can be modeled entirely in soft-ware, verified using a standard desktop C debuggingenvironment, and then, after the application is function-ally complete, incrementally moved into the FPGA forfurther optimization and acceleration.

Software processesset up data andperform non-time-critical functions

Hardware processes are independentlysynchronized and perform most of the work

Shared MemoryBlock Reads/Writes

StreamInputs

SignalInputs

RegisterInputs

StreamOutputs

SignalOutputs

RegisterOutputs

ApplicationMonitoring

Outputs

Figure 4 – The Impulse C programming model emphasizes the use ofprocesses, streams, and shared memories for hardware/software partitioning.

pipeline of the on-chip PowerPC. Fabricco-processor modules (FCMs) implement-ed in the FPGA fabric are connected to theembedded PowerPC processor through theAPU interface, allowing the creation ofcustom hardware accelerators. These hard-ware accelerators operate as extensions tothe PowerPC, thereby offloading the CPUfrom demanding computational tasks.

Software engineers can access the FCMfrom within assembler or C code. Assemblermnemonics are available for user-definedinstructions and pre-defined load/storeinstructions, enabling programmers toinvoke hardware-accelerated functions intothe regular program flow. Programmers canalso define custom instructions designedspecifically for the hardware functionality ofthe FCM. When combined with C-to-hard-ware compiler tools, the APU controllerallows software programmers to create hard-ware-accelerated software applications withlittle or no FPGA design expertise.

C-to-Hardware Tools Increase Design ProductivityTo make productive use of any computingplatform, software programmers needappropriate compiler and debugging tools.Impulse C, from Impulse AcceleratedTechnologies, gives software pro-grammers access to FPGAs byallowing hardware accelerators tobe compiled directly from softwaredescriptions.

These accelerators, which aretypically represented by one ormore software subroutines, areautomatically compiled into effi-cient, high-performance hardwarethat can be mapped directly intoFPGA gates. In the case of theVirtex-4 FPGA, Impulse C is alsocapable of automatically generatingsoftware/hardware interfaces usingthe APU. This is particularly useful forapplications that combine both traditionaland FPGA-based processing.

Because it is based on standard C,Impulse C allows FPGA algorithms to bedeveloped and debugged using popular Cand C++ development environments,including Microsoft Visual Studio and

GCC-based tools. The CoDeveloper soft-ware-to-hardware compiler translates spe-cific C-language subroutines to low-levelFPGA-hardware (see Figure 3) while opti-mizing the generated logic and identifyingopportunities for parallelism. The compileris also capable of unrolling loops and gener-ating loop pipelines to exploit the extremelevels of parallelism possible in an FPGA.Instrumentation and monitoring functionsgenerate debugging visualizations for highlyparallel multi-process applications, helpingsystem designers identify dataflow bottle-necks and other areas for acceleration.

For applications involving the embed-ded PowerPC and MicroBlaze™ proces-sors, the Impulse C compiler automates thecreation of hardware/software interfacesand generates outputs compatible withXilinx Platform Studio. This makes it pos-sible to create high-performance, mixedhardware/software applications for FPGA-based platforms without the need to writelow-level VHDL or Verilog.

For large applications comprising multi-ple hardware and software elements,Impulse C includes interface libraries (seesidebar, “Taking Advantage of Parallelismin FPGAs”) and related compiler features,allowing parallelism to be expressed at the

level of multiple and independently syn-chronized processes. These processes can bemapped either to software running on anembedded PowerPC or MicroBlaze proces-sor, or to FPGA hardware.

For all such applications, integration offront-end compiler tools and back-endplatform building tools is important. The

design process is highly iterative, reflectingthe fact that decisions made up-front (suchas C coding styles and system-level parti-tioning decisions) may have a dramaticimpact on the results obtained after Ccompilation, synthesis, place and route,and final bitmap generation. At each pointin the process, the tools provide feedback,allowing you to evaluate and estimate per-formance before moving to subsequent(and perhaps more time-consuming) phas-es of the platform generation process.

Let’s summarize the steps required for atypical PowerPC-based application usingImpulse and Xilinx tools:

1. The application is initially written in standard C, using common Cdevelopment tools. These toolsinclude readily available tools such asVisual Studio, Eclipse, or GCC andGDB, and may also involve morecomprehensive cross-developmenttools. During this phase, a baselinefor validation (a software test bench,also written in C) is established,allowing you to quickly test laterdesign iterations.

2. A C profiler such as gprof may beinvoked, or other, less sophisticatedmethods are used to identify compu-tational hotspots. Often thesehotspots can be isolated to a few Csubroutines or inner code loopsrequiring acceleration. Applicationmonitoring (made possible by instru-menting the C code during softwaretesting) can help characterize thesehotspots and analyze data movement.

3. Using software-to-hardware interfacefunctions provided in the Impulse C library, data streams or sharedmemories create abstract connectionsbetween the main algorithm runningon the PowerPC and hardware-accelerated subroutines running inthe FPGA. The modified softwarealgorithm, which now includes oneor more independently synchronizedprocesses, is simulated again in astandard C environment to ensureits correct behavior.

Figure 3 – The C-language-to-FPGA-accelerator design flow

4. C-language subroutines (or processes,as they may now be referred to) repre-senting hardware accelerators are ana-lyzed and optimized by the Impulse Ccompiler, resulting in hardwaredescription files compatible withFPGA synthesis tools. Optimizationreports generated in this phase helpyou understand the impact of variouscoding styles, and make appropriaterevisions in the original C code forimproved performance. During thiscompilation process, additional com-piler outputs are generated that repre-sent hardware-to-software interfaces,including (in the case of the Virtex-4FPGA) the necessary APU interfacelogic. Software run-time libraries arealso generated at this point, corre-sponding to the abstract stream andshared memory interfaces specified onthe processor side of the application.

5. The generated hardware and softwarefiles are exported from the Impulsetools (as a PCORE peripheral) andimported directly into the XilinxPlatform Studio environment.

The stream and shared memory inter-faces defined in the C application aremapped to APU, PLB, or other interfaceswhere appropriate, along with other com-ponents (such as standard processorperipherals or non-standard IP blocks) tocreate the complete system. From withinthe Platform Studio interface, the entireapplication (both hardware and software) isbuilt, resulting in a downloadable bitmap.

Evaluating FPGA AccelerationUsing the Pico E-12 card and the XilinxML403 development kit, in conjunctionwith Impulse C and Platform Studio, we setout to compare the relative performance ofthe embedded PowerPC 405 processor bothwith and without APU hardware accelera-tion – and using only C programming tech-niques. To investigate a range of potentialapplication domains, we selected the fol-lowing three representative algorithms:

1. An image filter. This algorithm allowedus to evaluate two pipelined hardware

routines for processing a stream ofimage data. The algorithm chosen forthis experiment is a relatively simple 3 x 3 edge-detection function operat-ing on a 512 x 512 image buffer. Thisalgorithm allowed us to quickly evaluate the performance of datastreaming through the Virtex-4 APUinterface, as well as the potentialspeedups of using multiple, pipelinedhardware processes.

2. A triple-DES encryption engine. Thisalgorithm allowed us to evaluate theimpact of various C-level optimizationstrategies, as well as the practicality of adapting and optimizing legacy Ccode for a streaming programmingmodel. One million character blocks(of eight characters each) wereprocessed to obtain performancenumbers for this test.

3. A fractal image generator. This algo-rithm is computationally intensiveand can be characterized in manyways to explore the size/performancespace. For this experiment, we createda single hardware process in theFPGA as an APU peripheral. Thishardware process communicates witha single controlling software processrunning on the embedded PowerPC.The design of this algorithm, whichgenerates a 1024 x 768 pixel imagewith a selectable level of image accura-cy, is scalable such that additionalhardware accelerator processes can beeasily added, up to the limit of thetarget FPGA.

For each of these algorithms, variouscombinations of compiler loop unrolling,pipelining, and maximum stage delays

were selected in the Impulse C compiler. Inthis way the applications could be opti-mized (in most cases without modifyingthe original C code) to obtain a desirablebalance of size, cycle delays, and maximumclock speed in the generated hardware. Inmost cases, we determined that using a rel-atively low clock rate (50 MHz) in theFPGA fabric – in combination withincreased cycle-by-cycle throughput(through the use of automated pipelining)– produced the best overall results given thenominal overhead of software-to-hardwaredata communication.

Using these algorithms as a baseline,numerous tests were performed in whichthe same C code was compiled both to theFPGA (as an APU accelerator) and to theembedded PowerPC processor as a soft-ware-only application. The results of thesetests are summarized in Table 1.

As the chart shows, the hardware-accel-erated algorithms show an impressiveincrease in performance, even at reducedFPGA clock rates, compared to thePowerPC software-only version.

ConclusionIn this article, we have demonstrated how itis possible, using an FPGA-based platformand C-to-hardware tools, to create highlyaccelerated systems without low-level hard-ware design skills. The Virtex-4 FX device,when implemented in a card such as the PicoE-12 or on a prototyping board such as theXilinx ML403, promises to revolutionize theway that FPGA devices are applied for high-performance embedded computing.

Software-to-hardware tools such asImpulse C, when combined with the plat-form building capabilities of PlatformStudio, make programming for suchdevices practical and efficient.

ApplicationPowerPC Only PowerPC/APU

Acceleration(300 MHz) (300/50 MHz)

Image Filter (512 x 512 Image) 0.1414 sec 0.0124 sec 11.4 X

Encryption (8M Characters) 2.257 sec 0.0667 sec 33.84 X

Fractal Image (10K Max Iterations) 660 sec 31 sec 21.29 X

Table 1 – Virtex-4 APU acceleration results

A Wide Range of Development PlatformsProviding embedded application developers – software programmers – with an easy-to-use hardware platform is a critical first stepin making FPGAs viable as embedded development platforms. A growing number of vendors are offering FPGA-based proto-typing and high-performance computing platforms ranging from low-cost, single-FPGA systems to larger FPGA grids intended

for hardware-accelerated computing.The Pico E-12 card mentioned in the main article (which is avail-

able from Pico Computing, www.picocomputing.com) is aCompactFlash form-factor package that draws well under 2W. Itfeatures 10/100/1000 Ethernet, 64 MB of Flash, 128 MB of RAM,and a wide host of peripheral adapters such as A/D, D/A, asynchro-nous serial, synchronous serial, CAN bus, relay control, and JTAG.The card is supported by platform development and programmingtools appropriate for software developers. The Pico E-12 platformadvances desktop and portable computing by providing massivelyparallel hardware computing resources in a low-power, self-con-tained package (Figure 5).

There are two versions of the Pico E-12. The LogicOptimized (LO) version is based on the Virtex-4 LX-25 device,while the Embedded Processor (EP) version is based on theVirtex-4 FX12 device, with its integrated PowerPC processor.

In either case, the FPGA on the E-12 card is configuredfrom the 64 MB of on-board flash memory using an on-boardloader. The unique design of this loader allows new FPGAimages to be swapped into the FPGA on demand. A large num-ber of FPGA images can be stored in flash memory, and anyimage can be loaded at any time through on-board software orexternal software communicating with the E-12 card throughits external interfaces. The contents of the 128 MB of externalRAM remain intact through the swapping sequence, allowingsubsequent FPGA images to operate on existing RAM data.

The Xilinx ML403 (shown in Figure 6), the first of sever-al Virtex-4 FX embedded processing development boards,combines the Virtex-4 FX12 with a wide variety of software-configurable interfaces, including network interfaces; serial,parallel, and USB ports; LVDS and D/A and A/D interfaces;and a VGA driver. As such, the ML403 is ideal for embeddedsystems designers requiring direct FPGA access to externalhardware devices.

Figure 7 shows a comparison between the APU withImpulse-generated hardware accelerators and a processor/soft-ware-only implementation. Both systems are utilizing theML403 in this example. You can see that the APU-acceleratedversion is significantly faster than the processor-only version.

Figure 5 – The Pico Computing EP-12 card packages the FX12 or LX-25 device with a CompactFlash

interface in an extremely compact form-factor.

Figure 6 – Xilinx ML403 development system

Figure 7 – Xilinx ML403 development system showing APU hardware accelerated implementation versus software

only executing on the PowerPC

by Aaron Spear Lead Architect, Tools Solution Accelerated Technology, A Mentor Graphics Divisionaaron_spear@mentor.com

Phillip WalkerTechnical Marketing Engineer Accelerated Technology, A Mentor Graphics Divisionphillip_walker@mentor.com

The FPGA-based embedded system designflow provides many benefits to systemdevelopers, as well as new challenges. Chiefamong the benefits are accelerated designflows that allow you to move quickly fromthe design and testing cycles to marketingand selling. With this accelerated flow, it ismore important than ever that the hardwareand software designs are in sync throughoutthe entire engineering design cycle.

Accelerated Technology, A MentorGraphics Division, has developed a versionof its Nucleus embedded software suite thatintegrates with the Xilinx® EmbeddedDevelopment Kit (EDK). This provides atight integration of software systems in theFPGA embedded systems design flow.EDK is based on a data-driven code basethat makes it extensible and open. By lever-aging this functionality, Nucleus software isable to achieve a level of integration intothe FPGA-based embedded system designflow that was previously not possible.

Nucleus Integration with Xilinx FPGA System DesignNucleus Integration with Xilinx FPGA System Design

Accelerated Technology’s Nucleus, integrated with EDK, provides the most extensive solution for embedded system architects.Accelerated Technology’s Nucleus, integrated with EDK, provides the most extensive solution for embedded system architects.

The Nucleus embedded software suitefor Xilinx FPGA system design includes acomplete tool offering and target softwareplatform, including high-level modelingwith xtUML and advanced target softwaredebugging with the Eclipse-based NucleusEDGE environment.

Auto Configuration with MLDThe underlying technology of the Xilinxapproach is microprocessor library defini-tion (MLD). This technology allows forautomatic kernel configurations and thegeneration of board support packages(BSPs). This unique and straightforwardapproach allows the Nucleus embeddedsoftware suite to configure to FPGA systemdesigns created in EDK. This eliminates theneed to re-port the software system to thenew memory map and peripherals for everyhardware design cycle.

The Data-Driven ApproachEDK uses two main data repositories tostore information on hardware- and soft-ware-related settings. All hardware-relatedsettings are stored in the MHS (micro-processor hardware specification) file, whilesoftware-related settings are stored in theMSS (microprocessor software specifica-tion) file. These files provide a databasethat other tools in the Xilinx system canaccess for any given project.

The data-generation component, asdefined in a TCL file, takes hardwaredetails from the MHS data file and custominformation from the MLD file anddecides which files to generate and whatparameters to customize. The Nucleus-

Introduction to Nucleus EDGEAccelerated Technology offers theEclipse-based Nucleus EDGE develop-ment environment, a complete develop-ment environment that supports JTAGdebugging of both PowerPC andMicroBlaze processors. Nucleus EDGEextends the Eclipse platform for multi-core, multi-process, multi-thread debug-ging. It can be installed as a stand-aloneor into Platform Studio SDK to provideadvanced debugging and project manage-ment capabilities.

Nucleus EDGE is state-of-the-artdebugging technology that includes fea-tures such as:

• Full multi-core/multi-process/multi-thread debugging, with support forsynchronous operation on multiplecores simultaneously

• Advanced C-like scripting language(codelets)

• Data-driven target/core/peripheraldescriptions (XML)

• Support for freeze-mode/run-modedebugging (OS- and hardware-dependent)

• Pluggable kernel awareness (data-driven)

• Pluggable connection devices

• Pluggable core support

• Pluggable real-time trace support

• Pluggable profiling engine

specific functions of the TCL file encapsu-late the entire algorithm, generating con-sistent information used by the Nucleuskernel to support a wide range of hardwareIP configurations (see Figure 1).

The elements of the Nucleus PLUS real-time kernel modified by the data genera-tion file include:

• The number and type of peripherals used

• Memory map information

• Locations of memory-mapped device registers

• Timer configurations

• Interrupt controller configurations

Core GenerationXilinx currently offers two processor choic-es for implementation in their FPGAs: thePowerPC™ 405 hard-core processor andthe MicroBlaze™ soft-core processor. TheNucleus embedded software suite currentlyworks with both options. Once you haveselected your processor and configuredyour system inside EDK, enabling Nucleusis as simple as a drop-down menu selection.

Configuring Nucleus and BSP GenerationAfter you have generated your basic systemdesign and core selection in EDK, you areready to implement the Nucleus embeddedsoftware suite. As Figure 2 illustrates, weleveraged Xilinx MLD technology to addthis functionality to EDK.

Once you have configured Nucleus tofit your system requirements, generatingthe corresponding BSP is straightforward.Simply choose the “Generate Libraries andBSPs” from the Tools menu option inEDK. This will build the correct librariesand any associated applications that theNucleus embedded suite requires in orderto run on your newly designed system.

System Debugging with Nucleus EDGEYou now have your hardware defined andthe Nucleus PLUS kernel configured torun on your new platform. But what aboutsoftware development? Where do we gofrom here?

MHS MSS

User Parameters

Figure 1 – Dataflow in Xilinx EDK

Figure 2 – Components of the Nucleus system arenow configurable from within the EDK through

the Library/OS Parameters dialog.

Creating and Building ApplicationsNucleus EDGE provides a powerful buildand project management environment.The Nucleus EDGE builder is a front endfor any tool that transforms one or morefiles from one format into another format;examples include a compiler that trans-forms a C file into an object module or alinker that transforms N object modulesinto an executable. You can plug tools intothe Nucleus EDGE builder by writing asimple XML description for that tool. Wecurrently have built-in support for 32 dif-ferent tool sets, including Xilinx GNU forboth PowerPC and MicroBlaze processors.

BSPsWhen targeting traditional processors withNucleus EDGE, you are responsible for cre-ating and maintaining BSPs that the debug-ger and project manager use. One advantageof Nucleus software integration with EDK isthat BSPs are generated automatically, mak-ing maintenance painless. When you finishyour hardware design and generate the BSP,the Nucleus EDGE BSP is also generated.This BSP is then used to determine appro-priate tool defaults when creating applica-tions, or knowing the layout of memory andperipherals when debugging.

Getting Started with Project ManagementNucleus EDGE provides a powerful userinterface to change compiler settings for agiven project, or optionally override themfor a particular file. You are free to type inthe command-line arguments if you knowthem, or you can peruse the options using atree, which contains information about thecommand and allowable settings for it. It isnice not to have to wade through obscurecompiler documentation to find the settingyou need and its syntax (Figure 3).

Editing and BuildingNucleus EDGE provides a full-featuredcontext-sensitive editor for C/C++ as wellas assembly. The editor provides the fol-lowing features:

• Configurable syntax highlighting (youcan change the colors)

• Outliner that aids in navigation foryour active source file

• Right-click navigation for declara-tion/definition of function calls

• During debugging, hovering over vari-ables displays their current value; addi-tionally, you can define your ownscript functions to render tool tips foryour application data types

• Code completion for both functionsand macros

Any errors in your source during build-ing are displayed in the build console. Youcan click on errors; the editor synchronizesto the location for you. The editor decoratesall warning or error source locations withspecial icons (Figure 4). Figure 4 also showsthe outliner at the right side of the file.

Debugging PowerPC and MicroBlaze ProcessorsFor Nucleus PLUS kernel applications,Nucleus EDGE can support run-modedebugging – the debug of individual taskswhile the rest of the system continues torun. To accomplish this, it can use a serialport, Ethernet connection, or even theXilinx JTAG UART.

For MicroBlaze processors, NucleusEDGE currently supports debuggingthrough XMD (Xilinx MicroprocessorDebugger). For PowerPC, connectionoptions include XMD, or, for PowerPCdesigns in which you have instantiated adedicated JTAG scan chain, third-partyJTAG devices such as Abatron’s BDI2000and MacCraigor Systems On Chip Demonfamily of connections.

Platform Debugging (Hardware/Software Co-Debugging)One useful feature gained from connectingthrough XMD is that you can leverageChipScope™ Pro hardware debugging fea-tures simultaneously while you debug yoursoftware using Nucleus EDGE. In theChipScope Pro GUI, you get a logic ana-lyzer view of signals inside your core. Youcan then configure the ChipScope Pro ana-lyzer to halt the processor when the state ofa certain peripheral changes, for example.

When this occurs, Nucleus EDGE syn-chronizes, and you see the exact state ofyour software when the event occurred.

DebuggingNucleus EDGE contains many special fea-tures for embedded debugging. The regis-ter view, for example, shows groups ofnative processor registers as well as memo-ry-mapped peripherals. Those bits in theregister that are set are highlighted in anoptional graphical control. Bit-mappedregisters show the bits set, and allow youto control them individually.

In Figure 5, you can see that “ExceptionEnable” is bit 6 in the MSR (the bold boxaround the bit), and that the bit is not cur-rently set (the blue background). Gone arethe days of getting out your calculator todo binary conversions and counting bits to

Figure 3 – Nucleus EDGE build settings for MicroBlaze GNU

Figure 4 - Building with errors

Figure 5 - MicroBlaze registers

figure out if a bit is set. Figure 5 also showshow values that changed from the last stepare color-coded (red).

BreakpointsOne other compelling feature that NucleusEDGE offers above and beyond the capabil-ities of Platform Studio SDK is built-in inte-gration with hardware breakpoints. As youmay know, you can locate as many as eightprogram counter hardware breakpoints(used for stepping), as well as four readwatch points and four write watch points ina given MicroBlaze design. Nucleus EDGEoffers a completely integrated and graphicalmethod to set both types of breakpoints.Also, the Nucleus EDGE debug engine isable to use the hardware breakpoints seam-lessly to enable stepping in ROM.

Multi-Core DebuggingThe number of MicroBlaze cores that canbe placed in a design is only limited by thesize of the FPGA. However, the MicroBlazedebug module can support debugging of asmany as eight MicroBlaze cores simultane-ously. The Nucleus EDGE user interfaceand debug engine have the ability to create“synchronization groups” of different cores.When one of these cores stops, all of thecores in the group are stopped.

Although Xilinx does not currently shipan IP block that supports configuration ofsynchronous control of multiple cores, it isa relatively trivial matter to implement ityourself – after all, you have an FPGA.Simply tie together a memory-mapped reg-ister with some MUX logic on theMB_HALT pin (that indicates that a corehas gone into debugging state) as well as theDBG_STOP pin (that can force a core intodebugging state). This way, when one corein a group either hits a breakpoint or has anexception, all of the cores stop. Then, inNucleus EDGE, you can provide a codeletscript that sets this register appropriately.

Codelets/ScriptingNucleus EDGE contains support for ascripting language that we call “codelets.”The syntax is standard ISO/ANSI C, witha few extensions. Simply put, codelets arescripts that run in the debugger but have

full visibility and control over the target.You can access target registers, memory,

and variables, as well as call target functionsfrom within a codelet. You can read andwrite host files as well as sockets. You canopen “channel viewers” in the debugger GUIand execute them through any differentexpression evaluation. You can call themfrom the command line, or when hitting abreakpoint, or by typing an expression in thewatch window. Codelets are meant to be anenabling technology. They allow you to getinside your hardware in a way that is nototherwise possible. Some things that cus-tomers have done with codelets include:

• Board initialization during debug

• Complex conditional breakpoints

• Custom hardware validation/regression testing

• Virtual console I/O

• “Poor man’s” kernel awareness

• SmartWatch – the ability to define a codelet that is used to “render” agiven data type to a string, giving you nice tool tips for your datastructures when debugging

ChannelsNucleus EDGE contains an abstraction forcommunications that we call channels. Anybyte stream can be a channel. Files, sockets,and serial ports can all be channels. Codeletscan also be used to create channels. On top ofthat, the GUI provides the ability to write“channel viewer plug-ins,” a way to render thedata that comes from these channels. Usingthis infrastructure offers all kinds of interestingcapabilities. Nucleus EDGE currently shipswith the following built-in channel viewers:

• Generic text console I/O (standard I/Owith the app)

• VT-100 compatible console I/O (sup-ports escape sequences)

• Strip chart recorder that allows you toplot any value over time in real time

• Windows Media Player streamingplug-in (plays MP3s, MPEG video)(on Windows hosts only)

• Binary data viewer (like the memoryview, in effect a “protocol analyzer”)

These viewers are just the beginning.Channels also allow us to abstract the mech-anism used to connect to a profiling agent orrun-mode debugging, for instance. Whencoupled with the Xilinx JTAG UART, thisyields a powerful infrastructure for gettinginside your application.

Kernel AwarenessNucleus EDGE kernel awareness gives youthe ability to see a snapshot of the state ofyour system, as well as providing the ability toset thread-dependent breakpoints. We cur-rently provide out-of-the-box kernel aware-ness for the Nucleus PLUS kernel. However,Nucleus EDGE also gives you the ability toconfigure your own kernel awareness. Thiscan be done for a third-party RTOS, an in-house kernel, or no RTOS at all.

Nucleus EDGE provides a data-drivenmechanism to describe how it should iterateobjects of a given type and display theirattributes. They do not even have to be soft-ware objects – they could be anything thatis memory-mapped (Figure 6).

ConclusionConfigurable cores are the future of embed-ded development. With the combination ofauto configuration of Nucleus target softwareand advanced debugging with NucleusEDGE, Accelerated Technology has bridgedlong-standing gaps in integrated systemdesign. By supporting both PowerPC- andMicroBlaze-based FPGA systems, AcceleratedTechnology distinguishes itself from the com-petition and provides unparalleled softwaretools and support for FPGA system designers.

For more information, evaluations, andupdates to these exciting technologies, visitwww.acceleratedtechnology.com/xilinx.

Figure 6 - Kernel awareness

by Anders DellsonPresident and CEOMitrionics Inc.anders.dellson@mitrionics.com

An entirely new and exciting marketand technology segment is emergingwith the growth of FPGA-based high-performance computing (HPC). Todate, technical obstacles have madepractical success in this area more of achallenge than an opportunity. Thisyear, the FPGA-based HPC markettakes off, with system vendors (such asCray and Silicon Graphics) and FPGAboard suppliers (such as Nallatech)spearheading the hardware side of theequation. Now Mitrionics is providingits Mitrion Platform for fast and easysoftware programming of FPGAs.

FPGAs have the potential to begreat application accelerators by work-ing as co-processors, off-loading com-putationally intensive tasks fromconventional CPUs. The promise ofachieving orders of magnitude acceler-ation in processing has fueled a num-ber of projects and enterprise venturesover the last decade. Some wereproven very successful, while otherswere not. Until now, the critical com-ponent of converting applications torun on FPGAs has depended on theservices of highly skilled hardwaredesigners. This has put the benefits ofFPGA performance boosts out ofreach for the majority of softwaredevelopers, who do not have the abili-ty nor inclination to go into the exten-sive details of designing hardware.

At least that was the case until now.With the Mitrion Platform, softwaredevelopers have the ability to convertapplications to be accelerated on anFPGA without knowing the first thingabout the complexities of hardwaredesign. The Mitrion Platform allowsFPGA-based HPC applications to bewritten in days or weeks with just a fewhundred lines of code – versus takingmonths or years and tens of thousandsof lines of code to write with otherFPGA programming tools.

Programming FPGAs for High-Performance Computing Acceleration

You can make your applications run faster by using FPGAs as co-processors.

Putting FPGAs to WorkFPGAs are potentially small supercomput-ers. One of the main reasons FPGAs havenot been prevalent in supercomputing istheir lack of programmability. The usersand application developers for supercom-puters do not know hardwaredesign, and have no wish tolearn it. These users are oftenresearchers with their own com-plex field of work.

The Mitrion Platform givesthese users access to the per-formance of FPGAs, while stillallowing them to write theiralgorithms in software. It isapplicable to fields such as genesequencing, weather prediction,image analyses, industrialautomation, and geosciences.You will find FPGA-basedapplication acceleration highlyattractive in two main areas:from a raw processing perform-ance perspective and because ofthe significant reduction ofpower consumption comparedto clusters of regular micro-processors. Many problemswithin these areas have been dif-ficult to address with traditionalFPGA design methods, mainlybecause of their complexity.

The Mitrion Virtual ProcessorCompiling software into a hard-ware design is not a trivial task.The program code that makesup the software and the transis-tors, wires, and gates that makeup the hardware are very differ-ent things. We say that the bestsolution is simply not to do it.

Traditional processors solvethis problem by using the vonNeumann architecture, amachine (designed in hardware)that reads the program code insequence and executes theinstructions. The problem withthe von Neumann architecture isthat it really does not lend itselfto the high level of parallelism

is the key to bridging the software-to-hard-ware gap. Software developed for theMitrion Platform is compiled into instruc-tions for the Mitrion Virtual Processor,which is then adapted accordingly andinstantiated in programmable logic. It

delivers massive parallelism andhigh silicon utilization, and isaimed at the acceleration of cal-culation-intensive programs.

The Mitrion VirtualProcessor performs thousands ofoperations simultaneously byallocating computational unitsfor each instruction. The fine-grained nature of the processingelements permits every individ-ual operation of the program torun in parallel.

To assure sufficient memorybandwidth, the Mitrion proces-sor operates with simultaneousaccess to multiple shared exter-nal memories as well as allinternal memory banks in theFPGA. Using heavy pipelining,data is communicated directlybetween the thousands of pro-cessing units, and we reducethe number of memory access-es and I/O performance bottle-necks.

With the Mitrion VirtualProcessor, we are able to elimi-nate the direct translation ofsoftware to hardware.

The Mitrion-C ProgrammingLanguageTo exploit the parallel process-ing capabilities of the MitrionVirtual Processor, traditional,sequential programming lan-guages are not sufficient. Forthat reason we have developedMitrion-C, an implicitly paral-lel C-family programming lan-guage. Mitrion-C helps you (asa programmer) reveal and uti-lize the parallelism inherent inyour algorithm. It gives youaccess to implicit parallelism.This means that parallelism is

that is required to extract the performancebenefits that you can get from an FPGA.

Pontus Borg and Stefan Möhl, two“software guys” and the founders ofMitrionics, realized that the concept of anabstract machine that executes the software

{ a1 = _memwrite(a0, index, value); } a1; a3 = _wait(a2);} a3;

#define RANGE <0 .. 0xffff> //Defines how many values to read from each memory

(mem int:64[0x100000], mem int:64[0x100000], mem int:64[0x100000], mem int:64[0x100000]) main(mem int:64[0x100000] extA, mem int:64[0x100000] extB, mem int:64[0x100000] extC, mem int:64[0x100000] extD){ (a, extAr) = mem2collection(extA, RANGE); //Read values into vector a (b, extBr) = mem2collection(extB, RANGE); //Read values into vector b (c, extCr) = mem2collection(extC, RANGE); //Read values into vector c

d = foreach(e0, e1, e2 in a, b,c) e0 * e1 + e2; //Calculate each element in vector d

extDr = collection2mem(d, extD); //Write vector d back to memory} (extAr, extBr, extCr, extDr);

Mitrion-C 1.0;

// Options: -cpp

mem2collection(a0, range){ (values, a2) = foreach(e in range) { (value, a1) = _memread(a0, e); } (value, a1); a3 = _wait(a2);} (values, a3);

collection2mem(data, a0){ a2 = foreach(value in data by index)

Figure 1 – The Mitrion debugger and simulator running the program shown in Figure 2.

Figure 2 – A Mitrion-C program that multiplies two vectors and adds a third vector element by element.

found automatically, without you having tomanually specify which parts of the algo-rithm are to be executed simultaneously.With Mitrion-C, many hundreds of dissim-ilar parallel interacting operations are creat-ed without the risk of deadlocks or raceconditions. Mitrion-C is a C-family lan-guage, which is easy to learn and has a syn-tax familiar to C programmers.

Developing Software for the Mitrion Virtual ProcessorFor the great majority of applications, onlya small amount of code limits the perform-

ance. Typically, when you develop an appli-cation to be run on the Mitrion VirtualProcessor, you would identify the criticalcode and write that in Mitrion-C. Youwould then write the rest of the applicationin the language of your choice to executeon a host CPU, calling the critical code tobe run on the Mitrion processor.

The Mitrion Software Development Kitcomprises the Mitrion-C compiler, agraphical debugger, a code simulator, and aprocessor configurator. A C/C++ library isincluded that gives easy access to theMitrion processor from the host applica-

tion. Several integration modules interfacethe Mitrion Virtual Processor to a numberof system platforms.

The graphical debugger and code simu-lator give you a hierarchical overview of allthe parallel operations and data dependen-cies in the program. Figure 1 illustrates theresults from running the debugger andsimulator on the sample Mitrion-C codeshown in Figure 2; Figure 3 shows a non-trivial example. Through the debugger, youwill find it easy to locate programmingerrors and identify performance bottle-necks and inefficient code. This lets youefficiently refine your Mitrion-C programand then recompile.

The final output from the MitrionSoftware Development Kit is a hardwaredesign in VHDL for a Mitrion VirtualProcessor, programmed and optimized toexecute the software algorithm. You willthen use third-party tools to completesynthesis and place and route. Because thealgorithm you are developing can be test-ed and simulated inside the MitrionSoftware Development Kit, there is noneed for tedious and time-consuming iter-ations through the synthesis and place androute steps to confirm the functionality ofyour finished Mitrion Virtual Processor.

ConclusionThe Mitrion Platform allows rapid soft-ware development for FPGA-based systemsto be completed by regular programmerswho have no hardware design experience.The resulting applications are typicallyaccelerated 10x to 30x compared to execu-tion on a sequential CPU, with a fractionof the power consumption.

The development of FPGA technolo-gy is set to outpace the performance oftraditional CPUs. Solutions created onthe Mitrion Platform leverage thisthrough the platform’s portability. Whena new, more powerful FPGA is available,all you have to do is reconfigure theMitrion Virtual Processor to take advan-tage of the capabilities of the new device.

For more information, contactMitrionics Inc. at (310) 558-9495, visitwww.mitrionics.com, or e-mail info@mitrionics.com.

Figure 3 – A Bayesian Confidence Propagating Neural Network running in the Mitrion simulator and debugger.

by Gokul KrishnanSr. Marketing Manager, Market Specific ProductsXilinx, Inc.gokul.krishnan@xilinx.com

Xilinx® EasyPath™ FPGAs are theindustry’s only customer-specific solu-tion that gets you to volume productionwith very little risk and in as few as eightweeks. EasyPath FPGAs use the same sil-icon as standard FPGAs. The key differ-ence between the two is that while theformer are tested for specific customerdesigns, the latter are tested for all possi-ble customer designs. By testing to a spe-cific design, EasyPath FPGAs provide asmuch as an 80% reduction in unit price(due to improved yields) compared tothe equivalent standard FPGAs.

One of the major advantages ofEasyPath devices is that because they areidentical (in all aspects except testing) tostandard FPGAs, all features supportedin standard FPGAs are in turn support-ed in the analogous EasyPath FPGA.From a customer perspective, this meansthat very little engineering resources arerequired to interface with Xilinx. Themigration process itself is essentiallyrisk-free. Once the design files are hand-ed off, Xilinx creates custom test pat-terns based on the design to get to aguaranteed 99.9% stuck-at fault cover-age. With EasyPath FPGAs, you get thesame extensive Xilinx IP portfolio with-out any additional licensing fees for theEasyPath conversion.

Flexibility with EasyPath FPGAsFlexibility with EasyPath FPGAs

You can now seamlessly convert to production and still retain some unique flexibility features.

Just as with structured ASICs (or stan-dard cell ASICs), you would typically movefrom a standard FPGA to an EasyPathFPGA when your design has been frozenand the volumes justify a cost-reductionpath. However, unlike structured ASICsolutions, EasyPath FPGAs provide youwith some unique flexibility features,including:

* Dual bitstream and in-system engi-neering change order (ECO) capability

* Flexibility in production and lifecyclemanagement

* Ability to use all standard FPGA fea-tures without any constraints

Dual Bitstream and In-System ECOsXilinx Virtex™-4 and Spartan™-3EasyPath FPGAs allow you to retain someof the design flexibility of standard FPGAseven after the devices are in production.Specifically, the dual bitstream optionallows you to target two different designs ina single EasyPath device – so long as theirpinouts remain the same. This allows you tocombine, for example, two different modesof operation in the same socket. One designcould provide a diagnostic check that isactive only at or soon after power up;another design could be active during thenormal functional mode of the device.

A second way you can use the dual bit-stream option is to try and address two dif-ferent industry standards at the same time.Product specifications and market viabilityin many wired and wireless segments, forexample, depend heavily on evolving stan-dards. Given long product developmentcycles and the importance of being first tomarket, you can now go to market withpotentially two different versions of a prod-uct to hedge your market position. Whenyou exercise the dual bitstream option,Xilinx ensures that all of the resources cor-responding to both designs are tested.

Another important flexibility feature inVirtex-4 and Spartan-3 families is the abil-

rerun synthesis and implementation. Tolearn more about making these changesusing FPGA Editor, see XAPP803,“Leveraging ‘In-System ECO’ Capabilityof Spartan-3 and Virtex-4 EasyPathFPGAs” at www.xilinx.com/bvdocs/appnotes/xapp803.pdf.

Figure 1 shows a screenshot of theFPGA Editor user interface that you would

use to drill down into the functional blocksyou want to change. The change processitself requires no intervention from Xilinx.You can make changes on your own, gener-ate a new bitstream, download it into theEasyPath devices already in production,and implement a bug fix in the field in avery short time.

Production FlexibilityAnother big advantage of EasyPath FPGAsis that because they are very similar to stan-dard FPGAs, they can be used inter-

ity to make in-system ECOs. Specifically,the ECO feature allows you to make mod-ifications to the combinatorial logic in anEasyPath FPGA (contained in look-uptables [LUTs]) and I/O block (IOB)parameters (such as drive strength and slewrate) even after volume shipments havebegun and devices are deployed. Xilinxensures that 100% of the LUTs used in the

design and all combinations of drivestrengths and slew rates for IOBs are com-pletely tested to allow for any potentialchanges later.

This feature allows you to make simplebug fixes (within a LUT) such as adding orremoving an AND gate or tweaking thedrive strength of an I/O based on systemrequirements with minimal disruption toongoing production. You can open theconfiguration equation of a LUT or the rel-evant IOB within the FPGA Editor tooland make modifications without having to

Figure 1 – Main window in FPGA Editor used to implement in-system ECOs in Virtex-4 and Spartan-3 EasyPath FPGAs(Courtesy: Elizabeth Janney, Application Engineer, Xilinx, Inc.)

Xilinx Virtex-4 and Spartan-3 EasyPath FPGAs allow you to retain some of the design flexibility of standard FPGAs even after the devices are in production.

changeably with a standard FPGA on agiven board. (All EasyPath devices areoffered in the full range of packages asthe corresponding standard FPGAs.)The advantage of this interchangeabilityis that should a design modification berequired that cannot be accommodatedby the ECO feature, you can quicklymake modifications in a standard FPGAand continue shipping in productionwith standard FPGAs for the eight weeksor so that it takes to do the EasyPathmigration of the modified design.

Because of the identical nature ofEasyPath FPGAs and standard FPGAs,no additional prototyping phase or re-qualification is required. Instead, you godirectly from design freeze to full pro-duction in as little as eight weeks. Thisquick turnaround allows you to postponeyour design freeze milestone in the prod-uct development cycle and adapt to anylast-minute changes in market condi-tions, demand, or design specifications.In today’s age of just-in-time supplychain management, EasyPath devicesenable you to get to market quickly with-out the constraints of large inventories.

No Constraints on FPGA DesignsWith alternative cost-reduction solutionssuch as structured ASICs (or standard cellASICs), you typically have to plan aheadto take advantage of the lowered cost.Some of the migration issues you mayface when converting from FPGAs tostructured ASICs are well documented(see the November 2004 article in theFPGA Journal, “Customer-specificFPGAs: Low-cost solution for volumeproduction”). To get around some ofthese issues, some vendors impose signifi-cant constraints on the up-front FPGAdesign, thus reducing the flexibility thatFPGAs are designed to provide.

EasyPath FPGAs, on the contrary, donot impose any design constraints. Youcan take full advantage of the flexibilityand embedded features (such as multi-pliers, PowerPC™, Ethernet MACs, andhigh-speed transceivers) in standardFPGAs to cost-reduce if the design/mar-ket conditions are appropriate.

ConclusionWith the introduction of Spartan-3 andVirtex-4 EasyPath FPGAs, you can nowprototype with standard FPGAs andthen move to the corresponding lowercost EasyPath FPGA in a seamless fash-ion. The unique in-system ECO anddual bitstream capabilities in theseEasyPath FPGAs allow you to makechanges in your design and target differ-ent functional modes even after you havemoved into high-volume production.

The quick time to market and inter-changeability of standard FPGAs withEasyPath FPGAs allows you to imple-ment bug fixes and efficiently manageinventory without loss of revenue or dis-ruption of production. In addition,EasyPath FPGAs do not impose con-straints on the FPGA design, thus leav-ing you with the flexibility to choose ifand when you want to freeze your designand cost-reduce.

The EasyPath “Migration-Free Advantage”

Since their introduction in March2002, EasyPath FPGAs, which offer aninnovative approach to high-volumecost reduction, have received broadmarket acceptance. This has been driv-en by the higher level of flexibility andease of migration offered by EasyPathFPGAs when compared to traditionalASIC solutions.

In the past year, EasyPath FPGAusage has grown by more than 600% byenabling customers in applicationsfrom communications equipment tostorage solutions to achieve a total costof ownership that is lower than anyASIC. With the introduction of Virtex-4 EasyPath and Spartan-3 EasyPathFPGAs, customers will continue tobenefit from the EasyPath “migration-free” advantage.

by Timothy Campbell FPGA Programmer University of Vermont, Burlington, VT tcampbel@uvm.edu

Tian XiaUVM Assistant Professor of EEUniversity of Vermont, Burlington, VTxiat@cems.uvm.edu

DSP algorithmic realization in FPGAfabric can be easily controlled anddebugged through high-speed Ethernetcommunication. This interoperable solu-tion allows for portability and usage inother designs. Because of the widespreaduse of the standard, the FPGA is capableof transmitting data at high speeds to avast array of external devices.

In the example we’ll present in thisarticle, a broadcast HTML page is used asa debugging interface to the FPGA. Thisbroadcast page can be easily modifiedthrough software to tailor to your debug-ging needs. The design is built on a sam-ple design packaged with the Xilinx®

ML403 Virtex™-4 evaluation board, inwhich the HTML code is stored locally tothe FPGA. Connecting to the IP addressof the Ethernet MAC (media access con-troller) through a Web browser loads theHTML file.

Ethernet BasicsThe first Ethernet standard was producedin 1985. Ethernet fits into the OpenStandards Interface model of theInternational Standards Organization, asillustrated in Figure 1. Several LAN tech-nologies are in use today, but Ethernet isby far the most popular technology fordepartmental networks. The vast majorityof computer vendors provide equipmentwith Ethernet attachments, making it pos-sible to link all manner of computers withan Ethernet LAN.

Transmitted data is encapsulated in aso-called Ethernet frame, which hasdefined fields for data and other informa-tion such that the data gets to its destina-

tion and the destination computer is ableto discern whether the data it receives isvalid. The frame format defines Ethernetand is illustrated in Figure 2. Frame sizesvary from 64 to 1518 bytes, and can be upto 1522 bytes when VLAN (virtual bridgedlocal area network) is tagged.

As noted in Figure 2, the source anddestination address of the transmitted datais encapsulated within the frame. The otherfields consist of:

• The preamble, which is a repeatingpattern of 1010 needed for some PHYs

• Start-of-frame delimiter (SFD), whichmarks the byte boundary for the MAC

• Type/length of frame

• Data

• Pad, which is only necessaryto extend the frame to 64bytes

• Checksum, which imple-ments a cyclic-redundancycheck (CRC) to determine ifthe frame is sent in error

• Idle, which occurs betweenframes and must be at least96 bit times

Xilinx Embedded Ethernet MACs Negotiate the DataXilinx Embedded Ethernet MACs Negotiate the Data

You can use the Xilinx Embedded Ethernet MAC as an interoperable standard for data communication between the FPGA and external devices. You can use the Xilinx Embedded Ethernet MAC as an interoperable standard for data communication between the FPGA and external devices.

Application

Transport(TCP/UDP) IETF

IEEE802 Scope

TIA/EIA/Others(Media)

Data Link(LLC/MAC)

Physical(PHYs/Media)

Network(IP)

ISO OSI StackReference Model

Presentation

ession

Transport

Network

Data Link

Physical

Figure 1 – Ethernet’s fit in the Open Standards Interface model of the International Standards Organization

Ethernet frame transmission is con-trolled by the MAC layer. The MAC han-dles data encapsulation from the upperlayers, frame transmission and reception,data decapsulation, and delivery to upperlayers. The MAC operates independentlyof the physical layer employed and thusdoes not need to know the speed of thephysical layer (as shown in Figure 3).

The Virtex-4 series offers two embed-ded Ethernet MACs. In this way, C-codesoftware implementation through thePowerPC™ processor can be used to tailorthe MAC to the particular application. Youare thus encapsulated from the lower layersof the MAC.

The Xilinx Ethernet Register Interface The Ethernet capability of a Xilinx FPGAallows for possibilities such as broadcastingcaptured data from the FPGA to anHTML-based website. In this manner, datacan be sent to and retrieved from theFPGA through user I/O to the HTMLGUI. The HTML page we used in build-ing our project is shown in Figure 4.

A register interface is critical in thedebugging of a design. It allows you to trig-ger events, obtain the status of intermediateresults, and dump the values stored in

read/write procedure. To read, we select anSRAM block by writing the block numberin Register 1, and write the SRAM addressto Registers 2 and 3. After the completionof register writing for Register 6, theSRAM contents corresponding to theSRAM address written in the SRAM blockselected are available for read out throughRegisters 4, 5, and 6.

To write a register, we select an SRAMblock by writing the block number inRegister 1, and write the SRAM address toRegisters 2 and 3. Then we fill the SRAMcontents we want to write into Registers 4,5, and 6. After the completion of registerwriting for Register 6, SRAM contents willchanged to new value.

The schematic shown in Figure 5 showsa single 8-bit register and the signals need-ed to control it. A select bus allows formultiple registers based on the size of theselect bus. The “write_d0” signal allowsthe register to be written to. When readinga register, its contents appear on theREGOUT(7:0) output, and this output isfed back to the HTML page during a reg-ister read. The alternative register output isused to act as control signals, sample inputto block RAMs, or whatever the situationmay call for. We’ll provide a detaileddescription of the transmission of datafrom the register to the HTML debuggingpage in the next section.

block RAM. The latter is especially usefulfor processing and verifying a partial pieceof an algorithm. As an example, an FPGAwith an ADC feeding into it can serve as asampler through a register dump of storedADC sampled data in block RAM.

The contents of block RAM data canbe accessed through a simple register

Preamble

Bytes 7 1 6 6 2 0-1500 0-46 4

Data Pad ChecksumDestinationAddress

Start of Frame Delimiter (SFD) Type/Length

SourceAddress

MAC Client(Upper Layers)

PhysicalLayer

write_d0

ethernet_reg_interface

write_d0

sel(23:0)

sel(21)

AND2OUT_EN

REG_EN

BUSIN(7:0)

BUSOUT(7:0)

REGOUT(7:0)

control_signals(7:0)

reg_out(7:0)

reg_in(7:0)

reg_in(7:0)sys_clk

system_reset

reg_out(7:0)

sys_clk

system_reset

q_strobe

write_d0

sel(23:0)

reg_in(7:0)

q_strobe

Figure 2 – Ethernet data frame

Figure 3 – Ethernet MAC layer

Figure 4 – Example HTML page connectionthrough communication to the Xilinx

embedded MAC IP address

Figure 5 – Ethernet register interface schematic

Implementation of the Ethernet Register InterfaceThe Xilinx EMAC interface we used sup-ports the IEEE Std. 802.3 media independ-ent interface (MII) to industry-standardphysical layer (PHY) devices and communi-cates to the PowerPC proces-sor through an IBM on-chipperipheral bus (OPB) inter-face. The EMAC comprisestwo IP blocks as shown inFigure 6. The IP interface(IPIF) block is a subset of theOPB bus interface featureschosen from the full set ofIPIF features.

The proposed EDK designto implement the register inter-face through Ethernet uses ageneral-purpose input/output(GPIO) core for the processorlocal bus (PLB) bus. TheGPIO is a 32-bit peripheralthat attaches to the PLB. Weused this GPIO capability tocommunicate between thePowerPC C-software imple-mentation and the registerinterface linked to the DSPalgorithmic implementationusing dedicated logic. Figure 7illustrates the block diagram forthe Ethernet register interface.

The Xilinx EDKPowerPC design was easilyintegrated into ISE™ designtools to access DSP algorith-mic implementation (thenon-software portion of thedesign implemented in dedi-cated logic). This wasachieved by exporting thedesign to ISE software andencapsulating the logic tocommunicate to the PowerPC inside aschematic block, as shown in Figure 6.The GPIO bus was used to feed inputand output to the processor, as differentbus lines served different purposes. At ahigher level, this schematic block waspackaged with a state-machine interfaceto achieve the register communication tothe FPGA, as illustrated in Figure 5.

Xilinx ML403 Board: Test Platform for the InterfaceAt first, the task of employing Ethernet as ameans of communication between theFPGA and the “outside world” seems prettydaunting. This is not the case, as Xilinx pro-

vides a reference design from which to buildyour own custom design. Xilinx also offers atutorial for those not familiar with imple-menting processor designs using EDK.

The ML403 Embedded ProcessorReference System contains a combinationof known working hardware and softwareelements. The reference system demon-strates a system utilizing the PLB, OPB,

device control register (DCR) bus, andthe PowerPC 405 or MicroBlaze™processor core. The design operates underthe EDK suite of tools, which yields agraphical tool framework for designingembedded hardware and software. Each

of the pieces of the system canbe separately activated as astand-alone project.

The proposed Ethernet regis-ter interface was built from theWebserver project, which comesas part of the ML403 referencesystem. The Webserver projectimplements an Ethernet MACthrough an IBM OPB, built-infor interfacing with the PowerPC405. This core supports10BASE-T- and 100BASE-TX/FX-compliant PHYs in full-or half-duplex mode.

The reference design servesas a good starting point to getyou up and running with adesign. The groundwork is laidout for you, allowing for addi-tional modifications andenhancements.

ConclusionWhat better way to transmitdata from the FPGA to an aux-iliary device, be it a web pageor server, than Ethernet?Regarded as one of the mostinteroperable communicationstandards, Ethernet allows youto achieve high-speed datatransmission with a widelyemployed standard.

Virtex-4 devices offer twoembedded Ethernet MACcores, and when combinedwith the design environment

ease of EDK, an application for datatransmission can be achieved with relativeease through C-code software implemen-tation. The capabilities of such a systemallow for FPGA algorithmic debuggingand trigger, as well as SRAM fill and readat speeds as high as 1 Gbps, providingcontrol and debugging of dedicated logicDSP algorithmic realizations.

ReadPacketFIFO

(2K, 4K,8K, 16K,

32K bytes)

MasterAttachment

SlaveAttachment

Register

OPBBus

WritePacketFIFO

(2K, 4K,8K, 16K,

32K bytes)

EthernetIP

ReceiveData

IPIF Local Bus

Ethernet IPIF (simplified)

TransmitData

EthernetMAC

RegisterInterface

.HTMLFile

StoredLocally to

PowerPC

DSP ApplicationDedicated Logic

End UserPC/Server

Figure 6 – IPIF and EMAC modules

Figure 7 – Block diagram of proposed application

www.xilinx.com/epq With Embedded Processing QuickStart!, Xilinx offers on-site training

and support right from the start. Our solution includes a dedicated expert

application engineer for one week, to train your hardware team on creating

embedded systems, instruct software engineers how to best use supporting

FPGA features, and help your team finish on time and on budget.

The Quicker Solution to Embedded Design

QuickStart! features include configuration of the Xilinx ISE™ design envi-

ronment, an embedded processing and EDK (Embedded Development Kit)

training class, and design architecture and implementation consultations

with Xilinx experts. Get your team started today!

Contact your Xilinx representative or go to www.xilinx.com/epq for

more information.

©2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc.All other trademarks are the property of their respective owners.

Xilinx Productivity Advantage: Embedded Processing QuickStart!

Getting You Started With

On-Site Embedded Training

by Rick FoleaCTORicreations, Inc.rfolea@UniversalScan.com

The good news: starting with the Xilinx®

Spartan™-3E device, you can now useindustry-standard SPI Serial Flash to config-ure your Xilinx FPGAs. Even better news:this will be the trend for all new Xilinxdevices going forward, meaning that you willhave multiple vendors from which to chooseinexpensive and readily available memories.

The bad news: Xilinx won’t be support-ing these devices with programming tools –you are left on your own to program them.In this article, I’ll take a look at SPI Flash,explain your options, and help you get upand running quickly.

SPI Flash – What’s the Big Deal?Until now, memories used to configureXilinx devices had to be specific Xilinx-supported memories. This meant poten-tially higher prices per byte of storage andbeing subject to limited deliveries ofdevices through limited sources.

By making the switch to SPI Flash, younow have many vendors from which tochoose, a wider variety of memory densitiesand types, and most importantly, lower cost

and better availability. Table 1 shows a list ofSPI Flash vendors and devices compatiblewith Xilinx FPGAs supporting SPI Flash.

SPI Flash VarietiesSPI Flash can be lumped into three gener-al categories: SPI Serial Flash for code stor-age, SPI Serial Flash for data storage, andAtmel’s DataFlash.

DataFlash devices are Atmel’s olderSerial Flash AT45DB parts. They aremature, plentiful, and have densities up to128 MB. They are slightly more difficultto use because the programming and erasealgorithms don’t conform to the SPI Flashalgorithms, and there aren’t many vendorsfrom which to choose. So unless youabsolutely have to have 128 MB of storage,you will probably want to look at the morerecent SPI Flash memories.

Figure 1 shows two types of SPI mem-ories: code and data. These are really thesame thing, except that data memorieshave a few extra instructions that allow youto do things like write to memory withouterasing first; this comes in handy whenyou are trying to quickly read/write datafrom your application. Unless you have areal need for the data memories’ specialfeatures, you will probably want to stickwith using the programming features of

the code memories for now – just becausethere are a lot more vendors and devicesfrom which to choose. You can still use thedata memories, as they are a superset of thecode memories, but don’t rely on the datamemories’ special command sets if youwant to keep your options open.

SPI Flash Memory CaveatsSPI Flash memories from the various ven-dors are very similar, but there are a fewthings to watch out for. Most of the newerdevices support a “ReadID” instruction sothat you can identify the device by pollingit. Make sure that you read the fine print inthe datasheet. For example, although theST datasheet for some of the M25P seriesdevices have the ReadID instruction listed,you have to read the fine print, which saysthat only devices with an “X” at the end ofthe part number support this feature. I wasunable to find these devices anywhere.

Although all vendors’ devices supportread/erase/program instructions, they alluse different command sets. The com-mands do the same things, but use differ-ent bits for the commands (a page programfor an ST device is a 0xD8, but a page pro-gram for an ATMEL device is 0x52, forexample). See the individual instructionsbelow for more caveats.

Program SPI Serial Flash from Xilinx FPGAs and CPLDs

It’s easy to do it yourself by using tools available today.

Programming SPI Flash MemoryAs long as you stick with the basic instruc-tions and keep an eye out for the caveatsI’ve discussed, it is really pretty simple towrite a small, dedicated SPI flash program-mer yourself. Note that all data and com-mands are shifted in msb first.

To erase the device, simply:

• Lower chip enable

• Serially shift in the 8-bit erase command

• Raise chip enable

You can then check the status register tosee when the erase cycle is done or wait themaximum expected erase time (one to sev-eral hundred seconds, depending on man-ufacturer and device density).

To read data from the memory array:

• Serially shift in the 8-bit ReadDatacommand

• Serially shift in the address (24 bits onmost devices, msb first)

• Continue issuing clocks, capturingdata from the memory serial output

• Raise chip enable when you have shift-ed out as many bytes as you want

You can continue issuing clocks right upto the end of the device, at which point the

Tools Available TodayCreating your own programmer works wellon a microprocessor where you have an SPIport and room in your memory for a fewextra functions. But what if you want toprogram a memory connected to a smallCPLD or FPGA? You may be surprised tofind that there are not a lot of low-cost,simple, general-purpose programmersavailable. The options we have today fornon-microprocessor applications fall intothe categories I’ll discuss next.

Dedicated FPGA Configuration Because the command set is small and sim-ple, some designers have resorted to creat-ing a dedicated FPGA configuration thatcan be used to program the SPI Flash.

The designers at Memec (recently pur-chased by Avnet) have taken this one stepfurther – their configuration uses theMicroBlaze™ soft-core processor and on-chip peripheral bus (OPB) SPI core. Giventhe processor in the FPGA, they simplywrote an application to read and write theSPI Memory. They ran a great demo of thisat the XFEST seminars last Spring thatshowed how the SPI Flash could be used to:

• Configure the FPGA (Spartan-3Edevice)

• Boot the MicroBlaze processor

• Be used for non-volatile data storage

Call your local Memec/Avnet FAE for ademo or more information.

These methods provide very fast SPIFlash access, but require a device withenough space to accommodate the config-uration. You could not use it on a CPLD orother non-configurable devices.

X-SPI UtilityThere is a little-known utility on the Xilinxwebsite called “X_SPI.” If you put an extraheader on your board and can 3-state alldevices connected to the SPI bus, then youcan use this command-line utility to direct-ly program your SPI Flash with a XilinxParallel-III or Parallel-IV cable. This is doc-umented in XAPP800, “ConfiguringXilinx FPGAs with SPI Flash MemoriesUsing CoolRunner-II CPLDs,” along with

address will wrap and start back at the begin-ning (on most devices).

To write data to the memory array:

• Lower the chip enable

• Serially shift in the WriteEnable command

• Serially shift in the 8-bit PageProgramcommand

• Serially shift in the address (24 bits onmost devices, msb first)

• Shift 8-bit data in, msb first (you can clockup to 256 bytes at a time on most devices)

• Monitor status register or just wait themaximum time (a few msec, typically) tosee when write is complete

• Repeat for next page (typically 256 bytes)

Note that if you go past a page boundary,the data will wrap and overwrite the data atthe bottom of the page – it will not continueon to the next page.

That’s all you really need to get started. Asyou can see, it is really pretty simple to pro-gram an SPI Flash memory, especially if youare creating a small programmer for a dedi-cated application and you stick with the basiccommand set.

Vendor Series Type 1M 2M 4M 8M 16M 32M 64M 128M

ST M25P CodeST M2/45PE DataATMEL AT45DB DataATMEL AT25F CodeNexFlash NX25P CodeSST SST25VF CodeSST SST25LF CodeAMD/Fujitsu S25FL CodePMC Pm25LV CodeSAIFUN SA25F CodeSAIFUN SA25C CodeSanyo LE25FW CodeMXIC MX25L Code

Figure 1 – SPI Flash is available from many vendors in a wide variety of memory densities (green =available, yellow = planned, gray = not available, and boldface series have been tested in Xilinx Labs).

a method to use a CoolRunner™ CPLDto transform the SPI signals into the signalsrequired by Xilinx FPGAs (pre-Spartan-3Edevices). The downside to this approach isthat you have to add an extra header toyour board, be able to 3-state all devices onthe SPI bus, and hope that the utility sup-ports the SPI Flash device you want to use.

A Better Way – JTAGXAPP800 also mentions JTAG, but at thetime there were no simple, inexpensive gen-eral-purpose tools available to program SPIFlash through JTAG. Now, the latest releaseof Universal Scan supports programming ofSPI Flash. We offer a free trial and the SPIprogramming feature has been added to thishardware debugging and memory program-ming tool without increasing the price.

Using the Boundary Scan chain in yourXilinx device to program SPI Flash is a snap

– you just specify which pins are connectedto the SPI device, choose a data file, and hit“program.” You don’t need any test vectors,test executives, CAD data, or anything elsenormally associated with Boundary Scantest and programming operations. Thepins on the Xilinx device are placed intoEXTEST; vectors are automatically for-matted and shifted in to drive the SPIFlash memory pins, and the next thing youknow the memory is programmed. Itincludes the usual erase, program, and ver-ify functions along with a memory viewer,utilities, sector erase capability, and auto-matic device ID interrogator.

The beauty of this method is that you canuse it on any FPGA, CPLD, microprocessor,

Ethernet switch, or DSP – any device thathas a JTAG port. The device does not needto be configured or programmed (or evennecessarily working entirely correctly) forthis to work. Plus, it requires no additionalprecious device resources in your FPGA orCPLD, no special code or configurations,and does not require any special headers orother support devices on your circuit card.

There is a downside to this – it can bemuch slower than the other methodsbecause you have to shift a full-scan vectorinto the JTAG chain for each and everyclock edge, data setup, and chip enable. Asan example, suppose you have an FPGAwith a scan chain of 2,000 bits. It takes onescan to setup the serial data into the SPIFlash, another to raise the clock, and yetanother to lower the clock. We need eightsets to shift a single byte into the SPI Flashmemory, and that’s 48,000 clocks just to

get 8 bits into the Flash device. Multiplythat by the number of bytes you have toprogram and you can see why it takes awhile, especially if you use a parallel portthat is fundamentally limited to a few hun-dred kilohertz TCK rate.

If you need to speed up the program-ming time, take a look at the newerUSB2.0 versions of these products; theyincrease the TCK rate by an order of mag-nitude. You can also make sure the SPIdevice is connected to a device with a shortscan chain and put all the other devicesinto bypass mode to help speed things up.

There are also traditional JTAG toolsavailable that will program SPI Flash, and ifyou are lucky enough to have access, defi-

nitely take advantage of them. These tools arevery powerful and can program memoriesrapidly. The only drawback is that they canrequire more setup than the method outlinedpreviously and also tend to be more expen-sive. But if you need speed, the traditionalBoundary Scan tools are a great option. (Seethe sidebar for a list of JTAG tool vendors.)

Other MethodsMost SPI Flash vendors have some kind ofutility for programming their devices, butthey are usually limited to their devices,require direct connection to the device, orare a cost adder to another set of utilities.

ConclusionFigure 2 summarizes all of these options I’vedescribed. If you need speed – design yourown. If you can afford some board modifi-cations and an extra header, consider using

the X-SPI utility, which isfree. Otherwise, take a lookat the new JTAG tools avail-able for your SPI Flash pro-gramming needs; they areeasy to use and inexpensive.

The proliferation of SPIFlash devices, memorydensities, and ease of usemake SPI Flash an idealconfiguration or data stor-age solution for your nextdesign. And because Xilinxis leading the way forfuture devices, there is no

time like the present to get started.For more information about Universal

Scan, visit www.universalscan.com.

MethodUse on Use on Use on Any General Extra Multiple ProgFPGAS CPLDs JTAG Device Purpose HW Reqd. Vendors Time

Special Configuration

Some No No No No Yes Fast

X-SPI N.A. N.A. N.A. Yes Yes Some Medium

JTAG Yes Yes Yes Yes No YesSlow or Medium

Miscel No No N.A. Yes Yes No Medium

Figure 2 – Options for quick, easy, and inexpensive general-purpose in-circuit SPI Flash programming

JTAG Tool Vendors

JTAG Technologies www.JTAG.com

Acculogic www.acculogic.com

Corelis www.corelis.com

Goepel www.goepel.com

Assett-Intertech www.assett-intertech.com

Intellitech www.intellitech.com

Flynn www.flynn.com

Universal Scan www.universalscan.com

by Matt GordonSoftware EngineerMicriummatt.gordon@micrium.com

The Xilinx® Embedded Development Kit(EDK) enables you to create powerfulembedded processor systems, as well as thecomplex applications running on them.These applications may need to performnumerous tasks and access a wide range ofperipherals, such as memory controllers,display interfaces, and network interfacecards. It is difficult, expensive, and ineffi-cient to design such large applications fromscratch, especially when existing softwaremodules are capable of meeting some or allof your system’s goals.

Integrating these modules with yourapplication, however, can sometimes proveto be more difficult than writing equivalentmodules yourself. Luckily, Xilinx PlatformStudio (XPS) provides a standard interface,known as the microprocessor library defini-tion (MLD) interface, to facilitate addingsuch modules to your projects. Operatingsystems and other software modules adher-ing to this interface can conveniently beconfigured for your application from with-in XPS. Micrium has adapted many of itssoftware modules to the MLD interface,allowing you to quickly and easily addhigh-quality, well-documented modules toyour embedded applications.

Add Valuable Software Modules with XPSAdd Valuable Software Modules with XPS

XPS, with its MLD interface, simplifies the process of configuring and compiling software modules.XPS, with its MLD interface, simplifies the process of configuring and compiling software modules.

The MLD FormatEach Micrium software module that can beconfigured using XPS has an MLD filedefining that module’s configurable parame-ters. Generally, configurable parameters cor-respond to features and services that willcomprise the compiled module. Theseparameters can also be used to identify thehardware components that the module willutilize. For example, the MLD file for agraphical user interface (GUI) modulemight define parameters stipulating the col-ors and resolutions supported by the com-piled module, as well as a parameterindicating the memory-mapped location ofthe display controller used by the module.You would be able to configure these param-eters from the Software Platform Settingsdialog box in XPS, tailoring modules asneeded to create a suitable software platformfor nearly any application.

Once you have configured a softwareplatform, you can build it with LibraryGenerator (LibGen), a utility that is auto-matically invoked when a project is com-piled in XPS. LibGen is responsible forbuilding software modules according to theparameter values set forth in XPS, and ituses Tool Command Language (Tcl) filesto meet this objective. Tcl files, which areprovided with each Micrium softwaremodule that complies with the MLD inter-face, contain functions that are able toaccess the parameter values you specify.These functions are called by LibGen, andthey normally generate source and headerfiles that allow the module to be built asper the requirements of the application.

The specific role that Tcl and MLD filesplay in the process of configuring andbuilding software modules can be betterunderstood by examining one of Micrium’ssoftware modules, µC/OS-II, The Real-Time Kernel. µC/OS-II is a popular real-time operating system (RTOS) that hasbeen proven effective in hundreds of prod-ucts. The module’s clean and consistentsource code, which is certifiable for use insystems deemed safety-critical by the FAAand FDA, is described in “MicroC/OS-II,The Real-Time Kernel,” by Jean Labrosse.This book describes µC/OS-II’s efficientimplementation of semaphores, message

II’s Tcl file has access to the values specifiedfor each of these configurable parameters,it can generate os_cfg.h whenever LibGenis run. This automatic generation of theconfiguration file gives you a custom ver-sion of µC/OS-II each time you build yourproject, making the operating systeminstantly amendable to the changing needsof an application.

An Example ApplicationAlthough the benefits of the MLD inter-face are apparent even when considering asingle module like µC/OS-II, they are

more pronounced in a large,multi-faceted application such asa Web server. A Web server hasmany responsibilities, and from adesign perspective, it is conven-ient to view each of these respon-sibilities as a separate component,as depicted in Figure 2. Once anapplication has been logicallydivided in this manner, you canchoose software modules toimplement each component.

A Web server might, for exam-ple, use modules to handle thevarious protocols, such as HTTPand TCP, involved in serving Webpages. It might also take advan-tage of a module capable of inter-

queues, timers, and other standard operat-ing system services.

As Labrosse’s book explains, µC/OS-II’sservices are normally configured by manip-ulating constants in a header file, os_cfg.h.The MLD interface eliminates the need tomanually edit this file, instead letting youconfigure the operating system from XPS’sSoftware Platform Settings dialog box, asshown in Figure 1. The contents of thisdialog box are determined by µC/OS-II’sMLD file, which defines configurableparameters representing the constants nor-mally found in os_cfg.h. Because µC/OS-

Web Pages

Web Server

Hardware

Real-TimeOperating System

µC/OS-II

HTTP ServerµC/HTTP

TCP/IP Protocol StackµC/TCP-IP

File SystemµC/FS

CPU Network Storage Device

Figure 2 – You can efficiently implement each of the components of a Web server with software modules from Micrium.

Figure 1 – You can configure µC/OS-II in XPS.

facing with a storage device, allowing Webpages to be saved as files. The Web servercould rely on an RTOS to coordinate thetasks performed by each of these modules.

The use of these types of softwaremodules as “building blocks” for con-structing large applications, such as Webservers, significantly reduces the time andeffort needed to develop such applica-tions, provided that you can easily incor-porate the modules in your design. Manysoftware modules, though, are unwieldy,and they introduce unnecessary complex-ities into projects that use them. Ourexample Web server could encompasshundreds of source files, requiring theinclusion of a comparable amount ofheader files. Adding these files to anapplication in XPS would result in a cum-bersome project, teeming with files thatshouldn’t be of concern to applicationdevelopers. To avoid such a confusingmuddle of files, you could create anobject code library, but this would forceyou to learn the compilation proceduresfor several software modules, and theresultant library would have to be rebuiltfor different configurations and versionsof the modules.

The MLD interface obscures thesecompilation details, resulting in organ-ized and efficient applications running onhighly configurable software platforms. AWeb server designed on such a platformwould be outwardly simple, with themultitude of files composing the server’smodules invisible to designers. The Webserver would appear to consist only of thefiles directly implicated in the applica-tion’s primary task of serving Web pages,so it could be debugged quickly, even bydevelopers unfamiliar with the projectand its modules. Updating or revising theWeb server would be similarly painless,because well-documented and depend-able software modules, such as those pro-

vided by Micrium, could be added to theapplication without needing to under-stand the mechanics of configuring andcompiling the modules.

The Micrium Software ModulesMicrium’s software modules are readilyadaptable to most applications. Theprocess of adding such high-quality soft-ware to a project is expedited in XPS, asMicrium offers an assortment of modulesthat comply with the MLD interface. Youcan add any of these modules to a projectby using the Software Platform Settingsdialog box, as shown in Figure 3. This

dialog box gives you the ability to rapidlydeploy practical software platforms tosupport complex applications, includingWeb servers. In fact, you can construct aWeb server entirely of software fromMicrium simply by specifying the appro-priate modules in XPS.

The centerpiece of such a Web serverwould be µC/TCP-IP, Micrium’s TCP/IPprotocol stack. µC/TCP-IP was meticu-lously developed using stringent codingstandards. The module’s completely orig-inal design, which was not derived fromany existing protocol stacks, supports anextensive array of network options thatare presented to XPS users, along with aselection of useful add-on modules. Youcan use these modules in conjunctionwith µC/TCP-IP to accommodate theneeds of a variety of applications. At leastone such module, µC/HTTPs (whichoffers a consistent means of serving Web

pages), would be an integral part of aWeb server based on Micrium’s modules.

Another key component of a MicriumWeb server would be µC/FS, Micrium’sembedded file system. µC/FS is a com-pact and versatile module, with an APIsimilar to that available from the stan-dard C library. The files accessed throughthis API can reside on storage devicesimplementing any of the numerous for-mats that µC/FS recognizes, includingCompactFlash, Secure Digital (SD), andSmartMedia. The files themselves canalso be of varying formats, becauseµC/FS is compatible with both the

Microsoft FAT file system and aproprietary file system. Supportfor either of these formats is oneof the many features reflected inthe module’s MLD file, allowingµC/FS, like Micrium’s othermodules, to be configured in XPSand seamlessly included in a Webserver or any other application.

ConclusionBy offering a convenient way of config-uring useful software modules, such asµC/FS, µC/TCP-IP, and µC/OS-II, theMLD interface accelerates the process ofconstructing complex applications fromthese embedded building blocks. Thebenefits afforded by the interface may beunintentionally relinquished, however, ifpoorly documented, unreliable modulesare selected.

To avoid the headaches that can resultfrom the use of such modules, you shouldchoose dependable and easy-to-use mod-ules. Micrium’s software modules com-plement the valuable tools provided withthe EDK, enabling the efficient develop-ment of powerful applications.

For more information aboutMicrium’s modules, please visitwww.micrium.com/microblaze/.

Figure 3 – The Software Platform Settings dialog box lets you quickly add Micrium’s modules to your project.

By offering a convenient way of configuring useful software modules, such asµC/FS, µC/TCP-IP, and µC/OS-II, the MLD interface accelerates the process of

constructing complex applications from these embedded building blocks.

Partitioning your design onto the 3.7 million ASIC gates (LSI meas-ure) of our new board is a lot easier. With three of the biggest,fastest, new Xilinx Virtex-4 FPGAs this PCI hosted logic prototypingsystem takes full advantage of the integrated ISERDES/OSERDES.400MHz LVDS differential communication with 10X multiplexingmeans more than 1800 signals between FPGA A & B. SynplicityCertify™ models are provided for partitioning assistance.

A dedicated PCI Bridge (533mb/s data transfer) means that all FPGAresources are available for logic emulation. Other features aredesigned to ease your prototyping job:

• 5 programmable clock synthesizers

• 2 DDR2 SODIMMs (custom DIMMs for SSRAM, QDR, Flash …)

• Two, 200 pin expansion connectors for daughter cards

• 10 GB/s serial I/O interfaces with SMA connectors and XFP or SFP Modules

• Configured by PCI, USB 2.0, or SmartMedia with partial reconfiguration support on all FPGAs

Various stuffing options and a wide selection of daughter cards letyou meet your exact design requirements. Prices start at less than$10,000. Call The Dini Group today for the latest ASIC prototypingsolution.

1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419Email: sales@dinigroup.com

www.dinigroup.com

VIRTEX-4

: flip

r of a

——

I/O™

ixPack

System Gates (see note 1)

CLB Array (Row x Col)

Number of Slices

Equivalent Logic Cells

CLB Flip-Flops

Max. Distributed RAM Bits

# Block RAM

Block RAM (bits)

Dedicated Multipliers

DCM Frequency (min/max)

# DCMs

Digitally Controlled Impedance

Number of Differential I/O Pairs

Maximum I/O

I/O Standards

Commercial Speed Grades(slowest to fastest)

– 32

I, III

Industrial Speed Grades(slowest to fastest)

Configuration Memory (Bits)

EasyPath

fferin

rix fo

40 68 92 124

-4 -4 -4 -4 -4

✔ ✔✔ ✔

icatio

l Xilin

r of u

XC3S100E

XC3S250E

XC3S500E

XC3S1200E

XC3S1600E

’s10

Q) –

T) –

G) –

XC3S50

XC3S200

XC3S400

XC3S1000

XC3S1500

XC3S2000

XC3S4000

I/O’s

784XC3S5000

97 8992

– 3.

s/sili

48 48 48 48 48

40 40 40 40 40 40

System Gates

Macrocells

Product Terms per Macrocell

Input Voltage Compatible

Output Voltage Compatible

Maximum I/O

I/O Banking

Min. Pin-to-pin Logic Delay (ns)

Commercial Speed Grades(fastest to slowest)

Industrial Speed Grades(fastest to slowest)

3Global Clocks

317Product Term Clocks per

Function Block

ily –

IQ Speed Grade

-7 -7 -7 -10

ix –

ner™

r of t

XC2C512

XCR3032XL

XCR3064XL

XCR3128XL

XCR3256XL

XCR3384XL

XCR3512XL

XC2C32A

XC2C64A

XC2C128

XC2C256

XC2C384

XC2C512

Q) –

T) –

) – w

FG) –

C) –

ix –

9500 S

XC9536XV

XC9572XV

XC95144XV

XC95288XV

XC9536XL

XC9572XL

XC95144XL

XC95288XL

XC9536

XC9572

XC95108

XC95144

3434 72

XC95216

XC95288

) – w

G) –

3434XC

Q) –

C) –

System Gates

Macrocells

Product Terms per Macrocell

Input Voltage Compatible

Output Voltage Compatible

Maximum I/O

I/O Banking

Min. Pin-to-pin Logic Delay (ns)

Commercial Speed Grades(fastest to slowest)

Industrial Speed Grades(fastest to slowest)

Global Clocks

Product Term Clocks perFunction Block

IQ Speed Grade

90 90 90 90

ily –

90 90 90 90

1 1 2 4

ily –

-II: X

n™ S

IIE: A

-II™

ead™

™/ C

r with

r at w

ISE™

Enabling success from the center of technology™

1 800 332 8638

www.em.avnet.com

Avnet Electronics Marketing has collaborated with National

Semiconductor® and Xilinx® to create a design guide that

matches National Semiconductor’s broad portfolio of power

solutions to the latest releases of FPGAs from Xilinx.

Featuring parametric tables, sample designs and step-by-step

directions, this guide is your fast, accurate source for choosing the

best National Semiconductor Power Supply Solution for your design.

It also provides an overview of the available design tools, including

application notes, development software and evaluation kits.

Go to em.avnet.com/powermgtguide

to request your copy today.

Support Across The Board.™

Power Management Solutions for FPGAs

National Devices supported:

• Voltage Regulators

• Voltage Supervisors

• Voltage References

Xilinx Devices supported:

• Virtex™

• Virtex-E

• Virtex-II

• Virtex-II Pro

• Virtex-4FX, 4LX, 4SX

• Spartan™-II

• Spartan™-IIE

• Spartan-3, 3E, 3L

Best Signal Integrity:7x Less SSO Noise

Virtex-4 FPGAs deliver the industry’s best signal integrity, allowing you topre-empt board issues at the chip level, for high-speed designs such as memoryinterfaces. Featuring a unique SparseChevron™ pin out pattern, the Virtex-4 familyprovides the highest ratio of VCCO/GND pin pairs to user I/O pins available in any FPGA. By strategically positioning one hard power pin and one hard ground pin adjacent to every user I/O on the device, we’ve reduced signal path inductance and SSO noise to levels far below what you can attain with a virtual ground or soft ground architecture.

The Industry’s Highest Signal Integrity, Proven By Industry ExpertsIncorporating continuous power and ground planes, plus integrated bypass capacitors, we’re eliminating power-supply noise at its source. In addition, weprovide on-chip termination resistors to control signal ringing. The lab tests speak for themselves. As measured by signal integrity expert Dr. Howard Johnson,no competing FPGA comes close to achieving the low-noise benchmarks ofVirtex-4 devices.

Visit www.xilinx.com/virtex4/sipi today, and choose the right high-performance FPGA before things get noisy.

The Programmable Logic CompanySM

Design Example: 1.5 volt LVCMOS 4mA, I/O, 100 aggressors shown.

Dr. Howard Johnson, author of High-Speed Digital Design,

frequently conducts technical workshops for digital engineers

at Oxford University and other sites worldwide.

Visit www.sigcon.com to register.

© 2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

PN 0010885

New ISE 8.1i Software - Xilinx · Synthesis The flow continues with synthesis, which provides an...

Documents

EEL 4783: HDL in Digital System Designmingjie/EEL4783/synthesis.pdf · EEL 4783: HDL in Digital System Design Lecture 15: Logic Synthesis with Verilog Prof. Mingjie Lin. 2 Synthesis

Verilog hdl-synthesis-a-practical-primer-j-bhasker

Hdl Synthesis Coding Guidance for Fpga

HDL-Based Layout Synthesis Methodologies

Tutorial #1 for ISE 8.1i Project Navigator Using Schematic ... · Tutorial #1 for ISE 8.1i Project Navigator Using Schematic Example Written By Tariq Naqvi, CEng, ... 1 for ISE Project

LeonardoSpectrum HDL Synthesis

LGraph: Live Graph infrastructure for synthesis and simulationmasc.cse.ucsc.edu/docs/lgraph19_latchup.pdf · 2021. 2. 3. · 26: HDL Elaboration Synthesis Initial Synthesis: spec0

HDL Synthesis for FPGAs Design GuideVHDL or Verilog Hierarchical design.xnf Place and Route. Getting Started HDL Synthesis for FPGAs Design Guide 1-3 Verifying Your Design You can

Precision RTL Synthesis Style Guide - fpgacpu.cafpgacpu.ca/fpga/hdl/precisionRTL_style.pdf · Precision RTL Synthesis Style Guide, ... Predefined Procedures ... procedures and functions

Verilog HDL Synthesis A Practical Primer

Automated Synthesis from HDL models · Automated Synthesis from HDL models Design Compiler (Synopsys) Leonardo (Mentor Graphics) Front-End Design & Verification. Create Behavioral/RTL

Introduction to Logic Synthesis Using Verilog HDL

Design Of High Performance IEEE- 754 Single Precision (32 bit) … · 2019-07-01 · VHDL code for floating point adder is written in Xilinx 8.1i and its synthesis report isshown

Verilog HDL Synthesis A Practical Primer...Title: Verilog HDL Synthesis A Practical Primer Author: J. Bhasker Subject: Electrical Engineering Electronic and Digital Design Keywords:

Verilog HDL Synthesis : Bhaskar

Tutorial #2 for ISE 8.1i Project Navigator Using ... · Tutorial #2 for ISE 8.1i Project Navigator Using HDLVerilog Example Written By Tariq Naqvi, CEng, MIET Step No.1: Open the

Verilog HDL: A Guide to Digital Design and Synthesis · Popularity of Verilog HDL Verilog HDL has evolved as a standard hardware description language. Verilog HDL offers many useful

Verilog HDL Synthesis a Practical Primer Bhasker

Automated Synthesis from HDL models - eng. · PDF fileAutomated Synthesis from HDL models ... Sdf, sdc, pow files.do files, simulation. results. ... -format vhdl (or verilog)

AN 238: Using Quartus II Verilog HDL & VHDL Integrated Synthesis