Transforming Reconfigurable Systems: A Festschrift Celebrating the 60th Birthday of Professor Peter

TRANSFORMINGRECONFIGURABLESYSTEMSAFestschriftCelebratingthe60thBirthdayofProfessorPeterCheung

TRANSFORMINGRECONFIGURABLESYSTEMSAFestschriftCelebratingthe60thBirthdayofProfessorPeterCheung

Editors

WayneLuk•GeorgeA.ConstantinidesImperialCollegeLondon,UK

Preface

Over the last three decades, Professor Peter Cheung has made significantcontributionstoavarietyofareas,suchasanalogueanddigitalcomputer-aideddesign tools, high-level synthesis and hardware/software codesign, low-powerandhigh-performancecircuitarchitecturesforsignalandimageprocessing,andmixed-signalintegrated-circuitdesign.

Theareathathasattractedhisgreatestattention,however,isreconfigurablesystemsandtheirdesign.Hisworkhascontributedtothetransformationofthisimportantandexcitingdiscipline.Forexample,hedevelopedthefirstdedicatedmultiplier unit for reconfigurable architectures; he pioneered a reconfigurablecomputercustomizedforprofessionalvideoapplications;andheisstillmakingseminalcontributionsinaddressingreliabilitychallengestofield-programmabletechnologyandreconfigurablesystems.

Hisintellectualprogenyincludethebetterpartoffiftyresearchstudentsandresearchassociates, includingoneof theeditorsof thisvolume.Manyof themhave now become leaders in their field. Peter’s most enduring impact willincludenotonlythosewhohavebeenfortunateenoughtobedirectlyinspiredormentored by him, but also his grand-students and great grand-students whoindirectlybenefitfromhisheritage.

ExceptforthreeyearsatHewlettPackard,Peterhasdevotedhisprofessionalcareer to Imperial College, and has served with distinction as the Head ofDepartment of Electrical and Electronic Engineering for several years. HisoutstandingcapabilityandhisloyaltytoImperialCollegeingeneral,andtotheDepartmentofElectricalandElectronicEngineeringinparticular,arelegendary.Forhisdepartment,Peterhasmadetremendousstridesinensuringexcellenceinboth research and teaching, and in establishing sound governance and strongfinancial endowment; but above all, he hasmade his department a wonderfulplacetoworkandtostudy.Hiseffortshavebeenrewardedbythewarmthwithwhichheisregardedbyhiscolleaguesandstudents.

Most of the papers in this festschrift are based on the presentations at a

workshop on 3 May 2013 celebrating Peter’s 60th birthday. We thank thecontributorstothisvolume;whilethetopicscoveredvarysignificantly,allwereable to relate their work to the contributions made by Peter. There are manymore,especiallymostofPeter’sformerstudents,whowishedtocontributethanthelimitedspacecanoffer,andweapologiseforthelackofspace.TheeffortofThomasChau,EddieHungandTimTodmaninassistingtheproductionofthisvolumeismuchappreciated.

Last but by no means least: Happy Birthday, Peter! We look forward toworkingwithyouformanyyearstocome!

WayneLukandGeorgeConstantinides

ListofContributors

NorbertAbel Goethe-UniversityFrankfurtSamuelBayliss ImperialCollegeLondonDavidBoland ImperialCollegeLondonSrinivasBoppu UniversityofErlangen-NurembergAndrewBrown UniversityofSouthamptonJasonCong UniversityofCalifornia,LosAngelesGeorgeA.Constantinides

ImperialCollegeLondon

HeikoEngel Goethe-UniversityFrankfurtMichaelJ.Flynn MaxelerTechnologiesandStanfordUniversityPaulJ.Fox UniversityofCambridgeMichaelFrechtling TheUniversityofSydneySteveFurber UniversityofManchesterJanoGebelein Goethe-UniversityFrankfurtFrankHannig UniversityofErlangen-NurembergUdoKebschull Goethe-UniversityFrankfurtVahidLari UniversityofErlangen-NurembergPhilipH.W.Leong TheUniversityofSydneyWayneLuk ImperialCollegeLondonSebastianManz Goethe-UniversityFrankfurtA.TheodoreMarkettos

UniversityofCambridge

OskarMencer MaxelerTechnologiesandImperialCollegeLondonSimonW.Moore UniversityofCambridgeMichaelMunday MaxelerTechnologiesMatthewNaylor UniversityofCambridgeOliverPell MaxelerTechnologiesLesleyShannon SimonFraserUniversity

LesleyShannon SimonFraserUniversityJürgenTeich UniversityofErlangen-NurembergDavidB.Thomas ImperialCollegeLondonSteveWilton UniversityofBritishColumbiaAlexYakovlev NewcastleUniversity

TableofContents

Preface

ListofContributors

1.Accelerator-RichArchitectures—ComputingBeyondProcessorsJ.Cong

2.WhitherReconfigurableComputing?G.A.Constantinides,S.BaylissandD.Boland

3.AnFPGA-BasedFloatingPointUnitforRoundingErrorAnalysisM.FrechtlingandP.H.W.Leong

4.TheShroudofTuringS.FurberandA.Brown

5.SmartModuleRedundancy—ApproachingCostEfficientRadiationToleranceJ.Gebelein,S.Manz,H.Engel,N.Abel,andU.Kebschull

6.AnalysingReconfigurableComputingSystemsW.Luk

7.CustomComputingorVectorProcessing?S.W.Moore,P.J.Fox,A.T.MarkettosandM.Naylor

8.MaximumPerformanceComputingwithDataflowTechnologyM.Munday,O.Pell,O.MencerandM.J.Flynn

9.FutureDREAMS:DynamicallyReconfigurableExtensibleArchitecturesforManycoreSystemsL.Shannon

10.CompactCodeGenerationandThroughputOptimizationforCoarse-GrainedReconfigurableArraysJ.Teich,S.Boppu,F.HannigandV.Lari

11.SomeStatisticalExperimentswithSpatiallyCorrelatedVariationMapsD.B.Thomas

12.On-ChipFPGADebuggingandValidation:FromAcademiatoIndustry,andBackAgainS.Wilton

13.EnablingSurvivalInstinctsinElectronicSystems:AnEnergyPerspectiveA.Yakovlev

Index

Chapter1

Accelerator-RichArchitectures—ComputingBeyondProcessors

JasonCongComputerScienceDepartment,

UniversityofCalifornia,LosAngeles

Inordertodrasticallyimproveenergyefficiency,webelievethatfuturecomputerprocessorsneedtogo beyond parallelization and provide architecture support of customization and specialization,enablingprocessorarchitectures tobeadaptedandoptimizedfordifferentapplicationdomains. Inparticular,webelievethatfutureprocessorarchitectureswillmakeextensiveuseofacceleratorstofurther increase energy efficiency. Such architectures present many new challenges andopportunities, such as accelerator synthesis, scheduling, sharing, virtualization,memoryhierarchyoptimization,andefficientcompilationandruntimesupport.Inthispaper,Ishallhighlightsomeofour ongoing research in these areas that has taken place in the Center for Domain-SpecificComputing(supportedbytheNSFExpeditionsinComputingaward).ThematerialhereisbasedonatalkthatIpresentedinJuly2012atImperialCollegeLondon,hostedbyProfessorPeterCheung.

1.1.Introduction

Inordertomeettoday’sever-increasingcomputingneedsandovercomepowerdensity limitations, the computing industry has halted simple processorfrequencyscalingandenteredtheeraofparallelization,withtenstohundredsofcomputingcoresintegratedinasingleprocessor,andhundredstothousandsofcomputing servers connected in awarehouse-scale data center.However, suchhighlyparallel,general-purposecomputingsystemsstillfaceseriouschallengesin terms of performance, power, heat dissipation, space, and cost.We believethat we need to look beyond parallelization and focus on domain-specificcustomization to provide capabilities that adapt architecture to application inordertoachievesignificantpower-performanceefficiencyimprovement.

Infact,theperformancegapbetweenatotallycustomizedsolution(usinganapplication-specific integrated circuit (ASIC)) and a general-purpose solutioncanbeverylarge.Acasestudyofthe128-bitkeyAESencryptionalgorithmwas

presentedin[Ref.1].AnASICimplementationin0.18µmCMOSachieves3.86Gbit/sat350mW,whilethesamealgorithmcodedinJavaandexecutedonanembeddedSPARCprocessoryields450bit/sat120mW.Thisdifferenceimpliesa performance/energy efficiency (measured in Gbit/s/W) gap of roughly 3million! Therefore, one way to significantly improve the performance/energyefficiency is to have as much computation done as possible in acceleratorsdesignedinASIC,insteadofusinggeneral-purposecores.

Oneargumentagainstusingaccelerators is their lowutilizationandnarrowworkload coverage; however low utilization is no longer a serious problem.Giventheutilizationwall[Ref.2]anddarksiliconproblem[Ref.3]revealedinrecentstudies,wecanactivateonlyafractionofcomputingelementson-chipatonetimeinfuturetechnologies,giventhetightpowerandthermalbudget.Theproblem of narrow workload coverage is addressed by using the composableaccelerators or reconfigurable accelerators discussed below. Therefore, webelievethatthefutureofprocessorarchitectureshouldberichinaccelerators,asopposedtohavingmanycores.

Inthispaper,IshallhighlighttheprogressmadeintheCenterforDomain-SpecificComputing(CDSC)[Ref.4]ondevelopingenergy-efficientaccelerator-richarchitectures. Ihad thepleasureofamonth-longvisit to ImperialCollegeLondon in July 2012, hosted by ProfessorWayne Luk and supported by theDistinguishedVisitingFellowProgramof theRoyalAcademyofEngineering.ProfessorPeterCheungwasverykindtoletmeusehisofficeduringmyvisit.ThematerialcoveredhereislargelybasedonaninvitedtalkthatIgaveduringmyvisittotheEEEDepartment.ThetalkwashostedbyProfessorCheung,whohas made many fundamental contributions to the technology on whichacceleratorarchitecturescanbebased.Ihighlightedourprogress infourareas:acceleratorsharingandmanagement(Sec.1.2),memorysupportforaccelerator-rich architectures (Sec. 1.3), on-chip communication for accelerator-richarchitectures (Sec. 1.4), and software support for accelerator-rich architectures(Sec. 1.6). I also included some of the latest results on using fine-grainprogrammablefabricstosupportcomposableaccelerators(Sec.1.2)andarecenteffortonprototypingofaccelerator-richarchitectures(Sec.1.5).

1.2.AcceleratorSharingandManagementinAccelerator-RichCMPs

We began our investigation of accelerator-rich architectures in 2010 anddeveloped three generations of architecture templates. The first generation of

architecturefocusesonhardwaresupportforaccelerator-richCMPs(ARC)[Ref.5]. Figure 1.1 shows the overall architecture of ARC, which is composed ofcores, accelerators, the global accelerator manager (GAM), shared L2 cachebanksandsharednetwork-on-chip(NoC)routersbetweenmultipleaccelerators.AllofthementionedcomponentsareconnectedbytheNoC.AcceleratornodesincludeadedicatedDMA-controller(DMA-C)andscratch-padmemory(SPM)for local storage and a small translation look-aside buffer (TLB) for virtual tophysical address translation. GAM is introduced to handle accelerator sharingand arbitration. In this architecture we first propose a hardware resourcemanagementscheme,facilitatedbyGAM,foracceleratorsharing.Thisschemesupports sharing and arbitration of multiple cores for a common set ofaccelerators, and it uses a hardware-based arbitration mechanism to providefeedbacktocorestoindicatethewaittimebeforeaparticularresourcebecomesavailable.Second,weproposea lightweight interrupt system to reduce theOSoverhead of handling interrupts which occur frequently in an accelerator-richplatform.Third,we propose architectural support that allows us to compose alarger virtual accelerator out of multiple smaller accelerators. We alsoimplemented a complete simulation tool-chain to verify ourARCarchitecture.Onasetofmedicalimagingapplications(ourinitialapplicationdomain),ARCshow significant performance improvement (on average 50 ×) and energyimprovement (on average 379 ×) compared to an Intel Core i7 L5640 serverrunningat2.27GHz.

Fig.1.1.OverallarchitectureofARC(from[Ref.5]).

AlthoughARCproduces impressiveperformanceandenergy improvement,ithastwolimitations.Oneisthatithasnarrowworkloadcoverage.Forexample,the highly specialized accelerator for denoise cannot be used for registration.Thesecondlimitationisthateachacceleratorhasrepeatedresources,suchastheDMAengineandSPM,whichareunderutilizedwhentheacceleratorisidle.

To overcome these limitations of ARC, we introduced CHARM: acomposableheterogeneousaccelerator-richmicroprocessordesignthatprovidesscalability, flexibility, and design reuse in the space of accelerator-richCMPs[Ref.6].WenoticedthatalltheacceleratorsintheARCforthemedicalimagingdomain can be decomposed into a small set of computing blocks, such as 16-input polynomial, floating-point divide, inverse, and square root functions.These blocks are called the accelerator building blocks (ABBs). CHARM(shown in Fig. 1.2) features a hardware structure called the accelerator blockcomposer (ABC), which can dynamically compose a set of ABBs into adedicated accelerator to provide orders-of-magnitude improvement inperformance and power efficiency. Our compiler decomposes each compute-intensivekernel(candidateforaccelerator)intoasetofABBsatthecompilationtime,andstoresthedataflowgraphdescribingthecomposition.ABCusesthisgraph at runtime to compose as many accelerators as needed and available(basedonfreeABBs)forthatkernelforacceleration.Therefore,althougheachcomposedacceleratorissomewhatslowerthanthededicatedaccelerator,wecanpotentially get many more copies of the same accelerator, leading to betteraccelerationresults.OurABCisalsocapableofprovidingloadbalancingamongavailablecomputeresourcestoincreaseacceleratorutilization.Onthesamesetof medical imaging benchmarks, the experimental results on CHARM showbetterperformance(almost2×betterthanARC)andsimilarenergyefficiency.DetailsareshowninTable1.1.

Fig.1.2.OverviewoftheCHARMarchitecture.

Table1.1.Performanceandenergy-efficientcomparisonbetweenmulti-coreCPU,GPU,FPGA,ARCandCHARM.ResultsarenormalizedtoanIntelCorei7([email protected]).

Moreimportantly,CHARMarchitectureprovidesbetterflexibilityandwiderworkload coverage. By using the same set ofABBs designed for themedical

imagingdomain[Ref.6],onecancomposeacceleratorsinotherdomains,suchas navigation and vision, and still achieve impressive speedup and energyreduction.

ThelatestextensionthatwemadetotheCHARMarchitecturewasfinalizedshortlyaftermyvisittoImperialCollegeLondon.AlthoughCHARMhasmuchbetter flexibility, it is possible that it misses some ABBs needed to composesomefunctionsinanewapplicationdomain.Toaddressthisissue,weproposeCAMEL: composable accelerator-rich microprocessor enhanced for longevity[Ref.7].CAMELfeaturesprogrammablefabric(PF)toextendtheuseofASICcomposableacceleratorstosupportalgorithmsthatarebeyondthescopeofthebaselineplatform.Figure1.3showstheoverallarchitecturediagramofCAMEL.Using a combination of hardware extensions and compiler support, wedemonstrateanonaverage11.6×performanceimprovementand13.9×energysavings across benchmarks that deviate from the original domain for ourbaselineplatform.Moredetailisavailablefrom[Ref.7].

Fig. 1.3. (a)Overall architecture of CAMEL; (b) programmable logic used forABB composition (from[Ref.7]).

1.3.MemorySupportforAccelerator-RichArchitectures

Inmost cases, the data access patterns of accelerators are known in advance.This is especially true with many image processing applications whereaccelerators typicallyprocessone tileof the imageat a time. In this case,one

usesbuffersorSPM insteadofcache to supplydata toaccelerators.However,since accelerators may not be used all the time, dedicated buffers can bewasteful.Therefore,wedeveloped techniques to implementSPM inL1or/andL2caches.Here,Ihighlighttwoeffortsinthisdirection.

ThefirsteffortisonadaptivehybridL1cache.ThebasicideaistoallowpartofanL1cachetobereconfiguredasanSPM,asshowninFig.1.4.WeaddanextrabittoeachcacheblocktoindicatewhetheritbelongstocacheorSPM,andadd amapping table thatmaps the consecutive SPMblocks to a collection ofcacheblocksindifferentsetsdependingontheutilizationofthesets(intuitively,underutilized sets contribute to blocks for the SPM). In fact, by reconfiguringpartofthecacheassoftware-managedSPM,hybridcachescanhelphandlebothunknown and predictable memory access patterns to improve the cacheefficiency for processor designs in general. To demonstrate this point, weimprovethepreviousworkonhybridcachesbyintroducingdynamicadaptationto the runtime cache behavior (dynamically remaps SPM blocks from high-demandcachesetstolow-demandcachesets)[Ref.8].Thisapproachachieves19%, 25%, 18% and 18% energy-runtime-production reductions over fourprevious representative cache set balancing techniques on a wide range ofbenchmarks (previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access).We leveraged domain-specificknowledgetoallocateblocksintheSPM,examiningdatare-usepatternsinapplicationstomaximizetheefficiencyoftheSPM.

Fig.1.4.CustomizablemappingofSPMinAH-cache(from[Ref.8]).

ThehybridL1 cache is helpful for the casewithone (or few) acceleratorstightly coupled with a processor core so that it may share its L1 cache as abuffer.Inthecaseofanaccelerator-richCMP,weneedanefficientwaytosharethebuffersamongdifferentaccelerators,andshare theacceleratorbufferswiththeprocessorcache.Existingsolutionsallowtheacceleratorstoshareacommonpoolofbuffers(acceleratorstore[Ref.9])or/andallocatebuffersincache(BiC

[Ref. 10]). In order to achieve higher flexibility and better efficiency, weintroduced a buffer-in-NUCA (BiN) [Ref. 11] scheme with the followingfeatures:(1)adynamicinterval-basedglobalbufferallocationmethodtoassignshared buffer spaces to accelerators that can best utilize the additional bufferspace, and (2) a flexible and low-overhead paged buffer allocationmethod tolimit the impact of buffer fragmentation in a shared buffer, especially whenallocatingbuffersinanon-uniformcachearchitecture(NUCA)withdistributedcachebanks.BiNwasimplementedontopoftheaccelerator-richCMPARC.Ithas a global accelerator buffermanager (ABM),whichworkswith the globalaccelerator manager to allocate buffers (including both the buffer size andlocations) at the time of accelerator allocation. The overall ARC architecturewithBiNandthecommunicationflowisshowninFig.1.5.Experimentalresultsshow that when compared to accelerator store and BiC, BiN improvesperformance by 32% and 35% and reduces energy by 12% and 29%,respectively.Moredetailisavailablein[Ref.11].

Fig. 1.5. (a) Overall architecture of ARC with BiN; (b) communication between core, ABM, andaccelerators(from[Ref.11]).

1.4.Network-on-Chip(NoC)SupportofAccelerator-RichArchitectures

In accelerator-rich architectures, on-chip interconnects need to provide highbandwidthbetween the accelerators andbuffers (e.g. allocated inL2 cachebyBiNintheprecedingsection),andalsobetweenbuffersandmemorycontrollers

(for streaming data from off-chip) for an extended period of time. It is verydifficult for existing packet-switching based NoCs to deliver such highbandwidth,unless theyaresignificantlyover-designed tocover theworstcase,but this results in large area and power overhead. We think that suchrequirementscanbeaddressedfromourearlyworkthatprovidedacombinationof packet switching and circuit switching via integration of radio frequencyinterconnect (RF-I) through on-chip transmission lines overlaid on top oftraditionalNoCimplementedwithRCwires[Ref.12].TheRF-Igoestoallthetiles on-chip, and each tile has aRF-I transmitter and receiver. By tuning thetransmitterand receiverof twodifferent tiles to thesamecarry frequency,onecan create a dedicated link (or shortcut) between these two tiles, i.e. acustomized interconnect. Research in [Ref. 12] showed that in addition to thelatencyadvantageofRF-I(signalsaretransmittedatthespeedoflightandcangoacrossthechipinasingleclockcycle),therearethreeadditionaladvantagesof RF-I: 1) RF-I bandwidth can be flexibly allocated to provide an adaptiveNoC;2)RF-Icanenableadramaticpowerandareareductionbysimplificationof thebaselineNoC(e.g. in termsof its linkbandwidth),as theRF-Ishortcutscan be customized to meet the peak communication demand; and 3) RF-Iprovidesnaturalandefficientsupportformulticast.Basedontheseobservations,a novel NoC design was proposed in [Ref. 12], exploiting dynamic RF-IbandwidthallocationtorealizeareconfigurableNoCarchitecture.Forexample,thework in [Ref. 12] shows that using adaptiveRF-I architecture on top of ameshwith4BlinksonlycanmatchorevenoutperformthebaselineNoCwith16Bmesh links,but reducesNoCpowerbyapproximately65%, including theoverheadincurredforsupportingRF-I.Figure1.6showsameshNoCwithRF-Ioverlaid on top of it, and the customized shortcuts (based on a givencommunication pattern).We are interested in using this kind of hybrid NoCswithamixofpacketswitchingandcircuitswitching(implementedbytheRF-Ishortcuts)toprovidehighbandwidthbetweentheacceleratorsandbuffers(e.g.allocated in the L2 cache byBiN in the preceding section), and also betweenbuffers andmemory controllers (for streaming data from off-chip). This is anareaofongoingresearch.

Fig. 1.6. (a)RF-I overlaidwith ameshNoC; (b)RF-I shortcuts for customized communication patterns(from[Ref.12]).

1.5.PrototypingofanAccelerator-RichArchitecture

Duringmytalkat ImperialCollegeLondon,Imentionedourongoingeffort toprototypeanaccelerator-richarchitecture ina largeFPGA. Iamglad to reportthatwemadegoodprogressinthisarea.Weprototypedageneralframeworkofan accelerator-rich architecture, named PARC, on a multi-million gate FPGA(XilinxVirtex-6FPGA)[Ref.13].WedevelopedasetofsystemIPsthatserveasea of accelerators, including an IO memory management unit (IOMMU), aGAM, and a dedicated interconnect between accelerators and memories toenable memory sharing [Ref. 14]. Figures 1.7 and 1.8 show the system-leveldiagramandthefloorplanimplementationofPARCwithfouraccelerators(accgradient, acc Rician, acc Gaussian, and acc seg). Table 1.2 shows theperformanceandenergygainovertheembeddedMicroBlazecoreontheXilinxVirtex-6 FPGA and the embedded ARM Cortex A9 core in the Xilinx ZynqFPGA. It also shows the projected performance and energy of PARC if it isimplemented in a 45nmASIC and the comparisonwith an 8-core IntelXeonserver.

Fig.1.7.OverviewofourPARCimplementedintheXilinxVirtex-6XC6VLX240TFPGA.GAM=globalacceleratormanager(from[Ref.13]).

Fig.1.8.FloorplanofPARCimplementedintheXilinxVirtex-6FPGA(from[Ref.13]).

Table1.2.SpeedupandenergysavingsoftheacceleratorsinPARC(from[Ref.13]).

1.6.SoftwareSupport

Software support is crucial for the successful design and deployment of theaccelerator-rich architectures. We can classify the software support into twocategories,oneforthecreationofaccelerator-richarchitecturesandanotherformappingandprogrammingonanaccelerator-richarchitecture.

Softwaresupportforaccelerator-richarchitecturecreation:

• Accelerator synthesis: We synthesize all accelerators automatically fromhigh-levelC/C++specification toVerilogorVHDLdesignsusingAutoPilot[Ref. 15] from AutoESL (based on xPilot [Ref. 16] from UCLA); thissupported C-to-RTL designs for both ASIC and FPGA designs. AfterAutoESL was acquired by Xilinx, AutoPilot was integrated into the XilinxVivado design tool suite and renamed Vivado HLS. However, it supportsXilinxFPGAdesignsonly.OtherC-to-RTLtools,suchasC-to-SiliconfromCadence[Ref.17],CatapultCfromCalypto[Ref.18],andCynthesizerfrom

Forte [Ref.19],canalsobeusedforacceleratorsynthesisandgenerationonASICs.

•Acceleratorvirtualization:Furthermore,weneedtoolstoprovideacceleratorvirtualization— that is, using available physical accelerators to implementmorecomplexacceleratorsof thesameorsimilar types(e.g. implementinga1024-pointFFTfunctionusinga128-pointFFTaccelerator).

• Accelerator identification: Currently, we identify accelerators manuallythrough profiling. In the future, we plan to develop tools to automaticallyidentify accelerator candidates. One direction is to use pattern-miningtechniques[Refs.20and21]onthecontrolanddata-flowgraph(CDFG),inawaysimilartohowwegeneratethecustomizedvectorinstructions[Ref.22].

• Platform synthesis: Together with the IPs we developed in the PARCprototype, we also developed an automated flow with a number of IPtemplates and customizable interfaces to aC-based synthesis flow to enableautomated generation of an accelerator-rich platform on aXilinx FPGA, sothat one can rapidly design and update accelerator-rich architectures. Weachieve significant productivity improvement with such an automated flow.Forexample, to add tennewacceleratorsof a certain type,weonlyneed toadd three lines of code in our configuration file, which will generate over3,000linesofCandRTLcodeinPARC[Ref.12].

Mappingandprogrammingonanaccelerator-richarchitecture

Therearemanydemandsforefficientmappingandoptimizationtoolstomapanapplication to an architecture-rich platform with all the customizationcapabilitiesaswedescribedearlierinthispaper.Herearesomeexamples.

•WeneedefficientwaystomapalargecomputationkernelintoasetofABBsin a composable accelerator-rich architecture, as in CHARM, withconsiderationofarea,communication,andload-balancingoptimization.

•WeneedcompilationsupporttoefficientlyusethehybridL1cachewithSPMs.Wemadesomeprogressinthisarea[Ref.23],butmorestudiesareneeded.

•Weneed tools tosynthesizecustomizableNoCswithRF-Is todecidehowtodynamicallyaddshortcutsandconstruct theroute.Forexample,ourworkin[Ref.24]showsthatroutinginanirregularNoC(aresultofaddingaRF-bustotheunderlyingmesh-basedNoC)maydeadlock,andwedevelopedefficientNoCconstructionandroutingalgorithmstoavoidthedeadlock.

1.7.ConcludingRemarks

1.7.ConcludingRemarks

Driven by the need for higher energy efficiency, we believe that futurecomputing platforms will be heterogeneous, customizable, and rich inaccelerators,includingbothdedicatedacceleratorsandcomposableacceleratorsfrom either accelerator building blocks or fine-grain programmable fabrics. Infact, such accelerator-richplatformswill happen at different levels: chip-level,rack-level, and data-center level. This offers many research opportunities forarchitecture, compiler, and runtime system support. I hope that what I havepresentedherewillengenderanincreasedinterestfromtheresearchcommunity,andencourageafocusonaccelerator-richcomputingplatforms.

Acknowledgements

The CDSC is funded by the NSF Expeditions in Computing program (awardCCF-0926127).Mostoftheadvancementssummarizedinthispaperarethejointwork of CDSC faculty and students — especially those who worked on thedesign,prototyping,andsoftwaresupport for theaccelerator-richarchitectures:Yu-ting Chen,MohammadAli Ghodrat,Michael Gill, BeaynaGrigorian, HuiHuang,ChunyueLiu,GlennReinman,BingjunXiao,BoYuan,andYiZou.Thecomplete list of all CDSC faculty and students is available atwww.cdsc.ucla.edu. The author would also like to thank the support of theDistinguishedVisitingFellowProgramfromtheRoyalAcademyofEngineering(UK)andthewarmhospitalityofProfessorsWayneLukandPeterCheungthattheauthorreceivedduringhisvisittoImperialCollegeLondoninJuly2012.

References

1. P. Schaumont and I. Verbauwhede. Domain-specific Codesign for Embedded Security, Computer,36(4),68–74,2003.

2. G. Venkatesh et al. Conservation Cores: Reducing the Energy of Mature Computations, ACMSIGARCHComputerArchitectureNews,38(1),205–218,2010.

3. H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling, in Proc. InternationalSymposiumonComputerArchitecture,pp.365–376,2011.

4.J.Congetal.CustomizableDomain-specificComputing,IEEEDesign&TestofComputers,28(2),6–15,2011.

5. J. Cong et al. Architecture Support for Accelerator-rich CMPs, in Proc. Design AutomationConference,pp.843–849,2012.

6. J. Cong et al. CHARM: A Composable Heterogeneous Accelerator-rich Microprocessor, in Proc.InternationalSymposiumonLowPowerElectronicsandDesign,pp.379–384,2012.

http://www.cdsc.ucla.edu

7.J.Congetal.ComposableAccelerator-richMicroprocessorEnhancedforAdaptivityandLongevity,inProc.InternationalSymposiumonLowPowerElectronicsandDesign,pp.305–310,2013.

8.J.Congetal.AnEnergy-efficientAdaptiveHybridCache,inProc.InternationalSymposiumonLowPowerElectronicsandDesign,pp.67–72,2011.

9.M.Lyonsetal.TheAcceleratorStoreframeworkforHigh-performance,Low-powerAccelerator-basedSystems,ComputerArchitectureLetters,9(2),53–56,2010.

10. C.F. Fajardo et al. Buffer-Integrated-Cache:ACost-effective SRAMArchitecture forHandheld andEmbeddedPlatforms,inProc.DesignAutomationConference,pp.966–971,2011.

11. J. Cong et al. BiN: A buffer-in-NUCA Scheme for Accelerator-rich CMPs, in Proc. InternationalSymposiumonLow-powerElectronicsandDesign,pp.225–230,2012.

12.M.C.F.Changetal.PowerReductionofCMPCommunicationNetworksviaRF-Interconnects,inProc.InternationalSymposiumonMicroarchitecture,pp.376–387,2008.

13. Y.-T. Chen et al. Accelerator-Rich CMPs: From Concept to Real Hardware, inProc. InternationalConferenceonComputerDesign,pp.169–176,2013.

14.J.CongandB.Xiao.OptimizationofInterconnectsBetweenAcceleratorsandSharedMemoriesintheDarkSiliconAge,inProc.InternationalConferenceonComputer-AidedDesign,pp.630–637,2013.

15.J.Congetal.High-levelsynthesisforFPGAs:Fromprototypingtodeployment,IEEETransactionsonComputer-AidedDesignofIntegratedCircuitsandSystems,30(4),473–491,2011.

16. J.Congetal.Platform-basedBehavior-levelandSystem-levelSynthesis, inProc. InternationalSOCConference,pp.199–202,2006.

17. Cadence C-to-Silicon Compiler. [Online] Available at:http://www.cadence.com/products/sd/silicon_compiler/[Accessed20April2014].

18.T.Bollaert.“Catapultsynthesis:aPracticalIntroductiontoInteractiveCSynthesis”,ineds.P.CoussyandA.Morawiec,High-LevelSynthesis,pp.29–52.Springer,Netherlands,2008.

19. M. Meredith. “High-level SystemC Synthesis with Forte’s Cynthesizer”, in eds. P. Coussy and A.Morawiec,High-LevelSynthesis,pp.75–97.Springer,Netherlands,2008.

20. J. Cong and W. Jiang. Pattern-based Behavior Synthesis for FPGA Resource Reduction, in Proc.InternationalSymposiumonFieldProgrammableGateArrays,pp.107–116,2008.

21.J.Cong,H.Huang,andW.Jiang.AGeneralizedControl-flowAwarePatternRecognitionAlgorithmforBehavioralSynthesisinProc.Design,Automation&TestinEuropeConference&Exhibition,pp.1255–1260,2010.

22.J.Congetal.CompilationandArchitectureSupport forCustomizedVector InstructionExtension, inProc.AsiaandSouthPacificDesignAutomationConference,pp.652–657,2012.

23.J.Congetal.AReuse-awarePrefetchingSchemeforScratchpadMemory,inProc.DesignAutomationConference,pp.960–965,2011.

24. J. Cong, C. Liu, and G. Reinman. ACES: Application-specific Cycle Elimination and Splitting forDeadlock-freeRouting on IrregularNetwork-on-chip, inProc.Design AutomationConference, pp.443–448,2010.

http://www.cadence.com/products/sd/silicon_compiler/

Chapter2

WhitherReconfigurableComputing?

GeorgeA.Constantinides,SamuelBaylissandDavidBolandEEEDepartment,ImperialCollegeLondon

We argue that FPGAs, more than two decades after they began to be used for computationalpurposes, have become one of the key hopes for extending the performance of computationalsystemsintheeracharacterisedbytheendofDennardscaling.Webelievethatprogrammabilityoffuture heterogeneous computing platforms has brought a new urgency to bear on several oldproblems in high-level synthesis for FPGAs.Our focus is on the two areaswe believe aremostunderdeveloped in today’s high-level synthesis software: effective utilisation of the numericalflexibility afforded by high-level correctness specifications, and application-specific memorysubsystemsynthesis.Weconcludewithourperspectiveonthelikelyfutureevolutionofthefield.

2.1.ASelectiveContext

WepresentbelowanecessarilyrathernarrowviewoftheevolutionoftheFPGAand the microprocessor, highlighting the interaction between the two and themajorexternaldrivers.

Thefield-programmablegatearray(FPGA)wasinventedinthe1980s,butitdeveloped and matured in the 1990s. Already in the early 1990s, academicconferences started toappear thatwere largelydedicated to thepotential thesedevices had to implement computation, such as the first FPL, held in 1991 inOxford.However,thenatureofsuchdeviceshastransformedoverrecentyears.InitiallyFPGAswere largely homogeneous architectures, consisting of a largenumber of very fine grain logic cells. Responding to the nature of the newapplication areas, manufacturers evolved the FPGA architecture, firstincorporating larger RAM blocks [Ref. 1] and then dedicated multiplicationlogic [Ref. 2]. Today, modern FPGA architectures are highly heterogeneousdevices, containing logic cells, embedded RAM and DSP functionality, highspeed transceiver circuitry and microprocessors. It is worth noting that themajority of these components, present as hard IP within an FPGA, could be

implemented using lookup table functionality. However, to do so would beeithertoolarge,tooslow,orconsumetoomuchpowertobeworthwhile[Ref.3].ThusthroughonelenswecanseetheevolutionoftheFPGAinrecentyearsasaconsciousdecision tomoveaway fromhavingall areadevoted to simple fine-grain units and their fine grain interconnect, towards the specialisation ofcircuitrytoperformcertaincommontasksorclassesoftasks.

Inasense, theevolutionof thegeneralpurposeprocessorhasmirrored theevolution of the FPGA over the same timeframe. Traditional, latency-driven,computerarchitecturelargelyconsistedofutilisingallavailablesiliconinorderto keep a single (or small number of) computational unit as busy as possible.Thisresultedinaverylargeamountofsiliconandpowerconsumptiondevotedto caching in particular, as well as various micro-architectural innovations toavoidlatency-consumingpipelinestalls[Ref.4].Theseprocessorshaveformedthe core of general purpose computer design for several decades. The mostsignificant innovation to arise as a result has been the GPGPU, which hasdelivered major performance improvements in certain domains by explicitlyabandoning some of the receivedwisdom of computer architecture, a processreferred to by Bill Dally as ‘the end of denial architecture’ [Ref. 5]. GPGPUcomputing achieves its performance by using an explicitly software-managedmemoryhierarchy,returninghardwaretocomputation,andusinganabundanceof threads to hide pipeline stalls.Wemay therefore view the evolution of themicroprocessor as a conscious decision tomove away from one complex unittowards dedicating areas to a large number of much simpler units and theirinterconnect;inacertainsensethisisamirroroftheevolutionoftheFPGA.

Itisnoaccidentthattheco-evolutionofFPGAandmicroprocessorsisnowreaching the point of blurred boundaries. On the microprocessor side of thepicture, this has largely been driven by the recent failure ofDennard scaling.Dennard scaling [Ref. 6] provided a road map of how to scale variousparameters undermanufacturer control, such as supply voltage, in response tothegeometric scalingofVLSIgivenbyeachprocessorgeneration.The recentdeviationfromDennardscaling,largelydrivenbypowerconsumptionconcerns[Ref.7]hasforcedthegeneralpurposeprocessorindustrytolookbeyondclockfrequency as the driver for performance. While high performance forthroughput-dominated applicationswith embarrassing levels of parallelismcanbeachievedinadirectwayusingtheGPGPUapproach,latencyconstraintsandalgorithmbottlenecksmandateamoreheterogeneousapproach[Ref.8].OntheFPGAsideof thepicture, siliconandpower inefficienciescombinedwithnewmarket opportunities have driven the evolution of FPGA architecture to itspresentheterogeneousstate.

The future of manycore computing using traditional — but simple —microprocessor cores is not rosy. In a landmark paper, Berger et al. [Ref. 9]make a detailed study of the power/area and power/performance tradeoffsavailable across a spectrum of processor designs. Using predictions of futuretechnologies, even with extremely parallel workloads, the performance to begainedbyusingmore,simpler,traditionalprocessors,isboundedfromabovebyfactor of only about 4–8× over the next decade, far slower than the historicaltrends. This is largely due to power consumption limitations, resulting in thespectreofdarksilicon,transistorsthatmaybepresentonadevicebutcannotbepowered on simultaneously without overloading the power limitations. Theconclusion is clear: to move beyond such limits, it becomes necessary toimprove the Pareto tradeoff itself, rather than simply move towards more,simpler,processorsontheParetofront.Theonlyclearwaytodothisisthroughcircuit specialisation: creating parts of a processor that are specialised toparticular commonly occurring tasks, and avoiding the energy inefficiency inusing general purpose architectures for these tasks. This is exactly the areawhere the reconfigurable computing community has a head start, and canprovidedirectiontothegeneralpurposearchitecturecommunity.

It seems inevitable that future computer architecture will therefore beprogrammable, contain elements of application-specific or domain-specificarchitecture,andbehighlyheterogeneousinnature.Themajorchallengeishowtoefficientlyandeffectivelycompileapplicationsontosuchplatforms.Thisisachallengethatmustbeovercome,butonethatisnolongerfacedbytheFPGAcommunity alone, as was often the case in the early efforts of high-levelsynthesis for reconfigurable computing. The coming industrial turn towardsheterogeneousparallelcomputingopensmanydoors.

2.2.ThePromiseandtheChallenges

High level synthesis for reconfigurable computing has made great stridesrecently.TheAutopilot tool [Ref. 10] now includedwithin theVivado designsuite is a high-quality high-level synthesis environment, using C as the inputlanguage.AcademiceffortssuchasLegUp[Ref.11]alsopoint toapromisingfutureforFPGA-basedhighleveldesign.However,existingsolutionsforhigh-levelsynthesisdonot—inouropinion—adequatelyaddressmemorysystems.Itshouldnotbeuptotheprogrammertoexplicitlymanagethetransferofdatabetweenexternalmemoryofvarioustypes(SDRAM,SRAM,etc.)andon-chipmemory.Equally,we shouldnot squander thepotentialofFPGAarchitectures

byapplyinggeneralpurposemicroprocessorcacheschemeswithinanFPGA.Inourview, it ishightimethat toolsforcustomisationofcomputationalcircuitrywere matched by aggressive tools for customisation of memory subsystemdesign. By pushing the complexity into the synthesis tool, we believe thatsignificantperformanceadvantagescanbeobtainedwithout theareaorenergyoverheadofcachingschemes.ThisisthetopicweconsiderinSec.2.4.Wenotethatthedegreeofpredictabilityofmemoryaccesseslargelydefinesthepotentialimprovementpossiblebyacustomisedmemorysystemand thatoftenmemoryaccesses are very predictable in nature, especially for embedded applications,whichwebelieveformthekeydriverfornext-generationcomputerarchitecture.

Theotherareathat ispoorlycoveredbyexistinghigh-leveldesignflowsisthe automationof the selectionof numerical representation andprecision.Thedesigner of any hardware accelerator for a numerically intensive algorithmknows that this is one of the areaswhere customised logic can result in hugeperformance gains, and will naturally ask ‘should I use floating-point, fixed-point,orsomemoreesotericnumbersystemtoperformthistask’,‘howprecisedomy internal results really need to be’, etc. These questions remain largelyunautomated. As a result, designers will again often ape the systems used ingeneral purpose processor designs, such as IEEE standard floating-pointarithmetic as the ‘gold standard’ of real number representation.There are twoproblemswiththisapproach.Firstly,itdoesnotwork:typicallyadesignerwillwanttoperformoperationsinadifferentordertothatexpressedintheoriginalcode, in order to improve hardware efficiency, for example by applying theassociative law to regroupaddition intoa treestructure:a+(b+(c+d))= (a+b)+(c+d),alawthatholdsforrealnumbersbutdoesnotholdforfloating-pointthusraisingquestionsof correctness.Usually,whether formalisedornot, therewillbe some notion of an acceptable numerical result,which can be used to drivesuch decisions; indeed, without such a notion, it becomes impossible todemonstrate that the behaviour of even the original source code is acceptable.Westronglyadvocate theformalisationofsuchspecifications.This leadsus tothesecondproblem:thedesigneroperateswith‘onehandtiedbehindherback’bybeing forced to replicate thehardwarestructurespresent ingeneralpurposeprocessors,whichmay be grossly inefficient for the problem at hand.Once aformalspecificationofnumericalcorrectnessisavailable, thedesigner,andthesynthesis tool, should be free to produce any hardware structuremeeting thatspecification,playingtotheadvantagesoftheunderlyingarchitecture.Thusthesame algorithmic specification may map automatically to a mixed-precisionimplementation inaGPU,adoubleprecision implementation inaCPU,andafixed-pointimplementationinanFPGA.Notwooftheseimplementationsmay

produce the samebit pattern at their outputs, but all should be verifiablewithrespect to the formalcorrectnesscriteria.This is the topicweconsider inSec.2.3. We note that, while such freedom can be exploited in all numericalapplications, the degree of freedom is particularly great in embeddedapplications,where specificationsof correctness tend tobe expressible at veryhighlevelsofabstraction,leavinglotsoffreedomforanadvanceddesigntooltoexplore,e.g.acontrollerforanaircraftmightmandatestabilityandminimisationoffuelconsumptionoftheaircraft[Ref.12];amuchhigherlevelofabstractionthanbit-levelequivalencetoagoldenCmodel!

2.3.NumericalBehaviour

When creating digital hardware architectures, one must first select a finiteprecisionnumbersystemtorepresentnumericaldata.Sincethisnumbersystemcanonlyrepresentasubsetofrealnumbers, roundingwilloftenoccurafteranarithmeticoperationsoastorepresentvaluesusingthechosennumbersystem.Whilst theerror introducedby the roundingofanysinglevaluemaybesmall,over the course of an algorithm the accumulation of these errors can cause asignificantdeviationfromthedesiredresult.

Asimple tactic tominimise thiserrorwouldbe toerron thesideofsafetyand select a number system that hasmuch greater precision than necessary toobtain the desired quality of output, if such a precision can be determined.However, thiswill come at a substantial cost in terms of performance.As anexample, recent figures for the difference in performance, in terms of peaktheoretical FLOPs, between single and double precision is approximately afactorof2to3foraCPU[Ref.13]or24foraGPU[Ref.14].

Since arithmetic computation forms the heart of many high-performancedigitalsystems,ifwearetocreateefficienthardwareaccelerators,thenwefirstneed to select number systems with the minimum precision necessary toguaranteethatourdesigncriteriaaremet.UnlikeCPUsandGPUs,FPGAsofferthefreedomtofullycustomisetheprecisionusedthroughoutanaccelerator.Asaresult, development of techniques to select an optimised number system havebeenanextensiveresearchtopicfortheFPGAcommunityoverthepastdecade[Ref.15,16].Inthissection,wefirstdescribethestate-of-the-arttechniquesthathelpusguaranteethatagivennumbersystemforahardwareacceleratorsatisfiesanumericalcorrectnesscriterion.Wefurtherdiscusshowthese techniquescanbeenhancedso that theyareapplicable toawiderangeofalgorithms.Finally,weoutlinesomeofthefuturechallengesforresearchinthisfield.

2.3.1.Boundingnumericalerrors

Themoststraightforwardwaytoestimatetheerrorofanyhardwareacceleratoris through simulation; indeed, this is the main technique used by industry.Unfortunately, the sizeof the search space for the inputswillgenerallybe toolargetoexploreexhaustively;thismeanssimulationmaymisscornercasesandunder-allocatethenumberofbitsforanaccelerator.Thisisunacceptableinanysafety-critical system, and in any case only works when there is a trusted,‘golden’referencemodelormethodofcertificationavailable.

Incontrast,analyticalapproachesprovideguarantees thatadesigncriterionwill not be violated. Early analytical approaches were based on intervalarithmetic(IA)[Ref.17],affinearithmetic(AA)[Ref.18]andLTItheory[Ref.15].Unfortunately, because IA andAAcannot find tight boundson theworstcaseerror,theywilltypicallyover-allocatebitsforanynontrivialexample.LTItheoryispowerfulenoughtocomputetightbounds,butitisrestrictedtotheLTIdomainandthisdoesnotincludegeneralmultiplication,forexample.

More recently, new approaches have been created which involveconstructing polynomials to represent the worst-case range of intermediatevariables throughout an algorithm. Through computing the lower (γlower) andupper (γupper) bounds of these polynomials, we can select a number systemwhich prevents overflow. Furthermore, if we first construct a polynomialrepresenting the range of every intermediate variable in the presence of finiteprecision errors and a second polynomial p representing the range in infiniteprecision,thentheextremaofthefunction representtheworst-caserelativeerrorintroducedbytheuseoffiniteprecisionarithmetic.

Table2.1.Constructionofpolynomials.

x,yareinputsΔistheerrorbounddeterminedbytheprecision,sothat|δi|≤Δ

Code PolynomialRepresentationofVariableValue(Floating-Point)

a=x*y; a=xy(1+δ1)

b=a∗a; b=(xy(1+δ1))2(1+δ2)

c=b−a; c=[(xy(1+δ1))2(1+δ2)−xy(1+δ1)](1+δ3)

To create these polynomials, we use standard models to represent finiteprecision errors. When using fixed-point, provided there is no overflow,numericalerrorsare limited tooneunit in the lastplace. Ifwechooseanη-bit

numbersystemwhere themaximumvalue is2X, theworst-case roundingerrorforanyfixedpointnumberx isgivenbyEq.(2.1). It followsthat theresultofanyscalaroperation(⊙∈{+,−,∗,/})isboundedasinEq.(2.2).Similarly,forfloating-point,providedthereisnooverfloworunderflow,foranyrealvaluex,theclosestfloating-pointapproximation ofxcanbeexpressedasinEq.(2.3),where η is the number of mantissa bits used. Once again, it follows that thefloating-pointresultofanyscalaroperation(⊙∈{+,−,∗,/})isboundedasinEq.(2.4).

Through applying these models of error to every computation in analgorithm, we can construct polynomials that represent the potential range ofeveryintermediatevariable.ThisisshownforasimpleexampleinTable2.1.

While constructing these polynomials is straightforward, finding theirextrema is computationally intractable. Instead, algorithms focus on finding acomputationally tractable lower bound ≤ γlower and upper bound ≥γupper.Ideallywewishtofindboundssuchthatγlower− and −γupperareassmallaspossible.

OneofthelatestandmostpowerfultechniquestoachievethisisbaseduponaresultfromrealalgebradiscoveredbyHandelman[Ref.19].Thisstatesthatapolynomialpisnon-negativeifandonlyifphasaHandelmanrepresentationoftheformEq.(2.5).

whereeachcαisapositiveconstant,eachgiisapositiveinequalityandℕisthesetofnaturalnumbers.

Using this result,we first rewrite the bounds of a polynomial ≤p ≤astwoseparateequations −p≥0.andp– ≥0.Ifwecanfinda

Handelman Representation to prove each inequality is non-negative, then wehave found the lower and upper bounds of the polynomial. Heuristics whichsearch for these representationshavebeen shown tobe able to computemuch

tighter bounds than IA or AA and enable us to create substantially smallerhardware[Ref.20].

2.3.2.Canweapplythesetechniquestogeneralcode?

The techniques described in the previous section are powerful and have beenshown to result in substantial performance improvements for some simplebenchmarks.Unfortunately,thesizeofthepolynomialscangrowexponentiallyinthenumberofoperations,meaningitwouldbecometootimeconsumingtobeapplicabletorealbenchmarks.

However, we can simplify large polynomials by replacing all terms thatcontribute little to the final result with a single term. Table 2.2 analyses apolynomialrepresentingtherangeofafloatingpointadditionoftwovariables.Itcalculates the worst-case range of every individual term in this polynomial.Clearly, several terms such as 100y1δ1, 100x1δ1 and 100x1y1δ1will have littleimpact on the final bounds. As such, if we replace them with a single newbounded variable, we shrink our polynomial with little impact on the finalbounds.Thissimplificationtechniqueenablestheearlierboundingproceduretobeappliedtomuchlargeralgorithmsconsistingofstraight-linecode[Ref.21].

Table2.2.Potentialcontributionofeachmonomialin(1+x1)(1+y1)(1+δ1).

Computea=x•y,wherex=[8;12],y=[9;11]in6bitfloating-pointlet|x1|≤0.2,|y1|≤0.1,|δi|≤2

−6⇒x∈10(1+x1),y∈10(1+y1)a=10(1+x1)10(1+y1)(1+δ1)=(100+100x1+100y1+100δ1+100x1y1+100x1δ1+100y1δ1+100x1y1δ1)

However, many algorithms cannot be converted into straight-line code;algorithmsoftencontaincomplexcontrol structuressuchas ‘while’ loops.Thechallengewiththesestructuresisthatfiniteprecisionerrorsmaycausea‘while’

loop to fail to terminate [Ref. 16]. Interestingly, these polynomial boundingprocedures can also be useful in choosing sufficient precision to ensure that‘while’loopsterminate.

Onetechniquetoproveprogramterminationisbasedonthefollowingsteps:

(1) Construct a ranking function [Ref. 22], f(x1, . . . , xn) that maps everypotentialstatewithinthelooptoapositiverealnumber.

(2)Provethatforallpotentialvaluesofthevariablesx1,...,xnwithintheloopbody,whentherankingfunctionisappliedtotheloopvariablesbeforeandafter the looptransitionstatements, italwaysdecreasesbymore thansomefixedamount∊>0,i.e.f(x′1,...,x′n)≤f(x1,...,xn)−∊.

Ifwenotethatprovingarankingfunctiondecreases(f(x′1,...,x′n)≤f(x1,...,xn)−∊)canbere-writtenasaquestionofnon-negativity(0≤f(x1,...,xn)−f(x′1, . . . , xn) −∊), then we can apply the same techniques that prove non-negativitytoproveterminationinfiniteprecisionarithmetic[Ref.16].

2.3.3.Nextsteps

Thetechniquesdescribedinthissectiononlytouchthesurfaceofresearchintoautomaticallyselectingtheminimumprecisionnecessarytomeetdesigncriteria.However, crucially they offer substantial progress in answering the followingquestion:givenahardwarearchitectureandword-lengthspecification,willmydesign satisfy the specification?Thiswill enable further research in this field;this includesdelvingdeeper into techniques toassign theword-lengthforeachindividualoperatorinalargedatapathandminimisethetotalareaconsumption[Ref. 23, 24], studying the links between how the order of operations in ahardware datapath can affect the error seen at the output and exploring therelationshipbetweennumericalprecisionandterminationofiterativealgorithms.Researchintonumericalbehaviourhasenteredexcitingtimes.

2.4.MemorySystems

In the preceding sections, we described how high-level specification ofnumerical accuracy can enable us to makemore efficient use of silicon area.Thisinturnenablesbetterperformancewherethenumberofparallelprocessing

unitscanbeincreased,providedthoseunitscanbeefficientlyfedwithdata.The most area-efficient technologies in common use for implementing

commoditymemorytoday(DRAMandFlash)haveoptimalprocessparametersthatconflictwiththoseneededtobuildfastlogic.Whereverapplicationsrequirelargeamountsofmemory,thatmemoryisimplementedusingaseparatememorydie. However, because device pin-density and off-chip switching frequencieshavenotscaledasrapidlyas theexponentialgrowthof transistorsdedicatedtologic datapath implementation, external memory bandwidth has increasinglybecomeaperformancebottleneck.

So it is critical to ensure that limited off-chipmemory bandwidth is usedefficiently.Hereinliesasecondchallenge.DRAMmemorystructureisarrangedinbanks, rowsandcolumns.Each rowmustbe ‘activated’beforedataheld incolumns within that row can be read or written. The row must then be‘precharged’ before data in another row can be accessed. Timing parametersdetermined by physical DRAM memory array architecture constrain theminimum time between successive row activations. Over time, increasingmemory clock frequencies mean that there is an ever larger penalty paid forrandom access to DRAM memory. In practical terms, this means there is agreater than10×performancedifferencebetween theworst case andbest casememorybandwidthobtainedthroughdifferentmemoryaddresssequences.

Thishasmadeitessentialtodevelopmemorysubsystemswhichexploitthelocality of memory accesses to provide the illusion of fast access to largeamountsofmemory.InaCPU,cachesanddynamicmemorycontrollersbufferand reordermemory requests tohelpensure thishappens.They typicallymustassume no prior knowledge of the sequence of memory requests from thedatapath.Furthermore,CPUsimplementnon-deterministicbusinterfaceswhichmake memory performance difficult to analyse. Where a memory system isimplemented in reconfigurable hardware, it can be customised for a specificapplication.Threekeybenefitscanthenberealised:

(1) fine grained on-chip memories provide a very large on-chip memorybandwidthtocustomiseddatapath,

(2) data buffered in thosememories can be reused, reducing off-chipmemorybandwidthrequirements,and

(3)off-chipmemoryrequestscanbereorderedtomakethemostefficientuseoflimitedbandwidth.

Themostmemory intensiveparts of a program tend to be in loops, sowetargetnestedloopsinourwork[Ref.25].Staticanalysistomodelthesequence

of memory accesses which occur in nested loops can be done using thePolyhedralModel[Ref.26,27].Fromthisanalysis,automatedtoolsallowustosynthesise a high performance application-specific memory system. In Sec.2.4.1, we provide a brief overview of the Polyhedral Model. Section 2.4.2describes a way in which this model can be used to build high performanceapplication-specificmemorysystems.

2.4.1.WhatisthePolyhedralModel?

ThePolyhedralModelrepresentsasetofloopiterationsasthoseintegervectorswhichsatisfyafinitesetofaffineinequalities.ThecodeinFig.2.1showsatwolevel nested loop. The set of loop iterations is described by upper and lowerboundswhich are affine expressionsof the surrounding loopvariables (x1 andx2). The iterations of an n-level loop nest can be described implicitly as anintegerset{x∈ℤn|Ax≤b}whereAisa2n×nintegermatrix,bisa2nintegercolumnvectorandthevectorinequalityisinterpretedasx≤yiffxi≤yiforalli.

These iterations can be scheduled according to a linear mapping functionwhichdeterminesapartialorderingofthoseiterations.FortheexamplegiveninFig. 2.1, a mapping function describes the iteration ordering.Each iterationxmay accessmemory viamemory accessing function(s) of theformgj(x)=fjx+hj.

Codethatfitsintothisformiscommoninvideoprocessinganddenselinear-algebraapplications.Exactdependenceanalysisforcodewhichcanbedescribedinthiswayisoftentractableusingintegerlinearprogrammingtechniques[Ref.28].ThePolyhedralModelgivesusaformalmathematicalrepresentationofthesequenceofmemoryaddressesaccessedintheprogram.InSec.2.4.2,weshowhowtransformationsappliedtothatformalrepresentationcanhelpbuildahighperformancememorysystem.

Fig.2.1.Examplecodeforatwo-levelnestedloop.

2.4.2.Buildinghighperformancememorysystems

In the preceding section, we showed how we could formally characterise thememory access requirements within a nested-loop structure. We can use thisinformationtodecoupletheoff-chipmemoryaccessesfromdatapathlogicusingon-chipmemorybuffers.Ifwecantransformcodesothatdataisreusedfromtheon-chip memory buffer, we can reduce the number of accesses to off-chipmemory.

We can represent the specific ‘row’ and ‘burst’ accessed in eachmemoryrequestbyaddingnewdimensionstotheloop-nestrepresentation.IfthesizeofeachDRAM row isR words, the row accessed bymemory address fx +h isgivenbyr=fx+hdivR=⌊fx+h/R⌋where⌊·⌋representsthefloorfunction.Thecolumnswithineach rowcanbe representedasnon-overlappingbursts to takeadvantage of the multi-word burst accesses supported by modern memorydevices.Thesecanberepresentedbyu=⌊fx−rR)/B⌋whereaburstisBwordslong.

Whileneitheroftheseisdirectlyamenabletolinearalgebraicrepresentation,wemaynotethatfromthepropertiesofthefloorfunction:

and

WecanrewriteEq.(2.6)andEq.(2.7)aslinearequalitiesasshownbelowinEq.(2.8)andEq.(2.9),withoutlossofinformation.

We can then add these four extra inequalities to those already presentdefiningtheloopbounds.Thisforms,foreachmemoryreference,anaugmentedsystemoflinearinequalitiesthatcompletelycapturenotonlytheiterationspacebut also the specific SDRAM rows and bursts accessed within the innermostloop.

Using standard unimodular loop transformations, we can transform this

augmentedpolyhedralrepresentationtoexposethoseoccasionswheredataitemsare reused by multiple loop iterations. After transformation, those redundantdimensionswhichonlyrepresentdatareusecanbeprojectedoutoftheresultingpolyhedralrepresentationtoproducecodewhichfetcheseachmemoryitemonlyoncefromoff-chipmemory.Fromthisrepresentation,wecanusestandardloopreordering transformations tomove ‘row’ dimension to the outermost level oftheloopnest,improvingdata-locality.

Figures2.2(a)–(d)illustratethisprocessfortheexamplecodeshowninFig.2.1.ThecodeinFig.2.2(a)isaugmentedwiththerowiterator‘r’andtheburstiterator ‘u’ to form Fig. 2.2(b). The ‘r’ variable can then be hoisted to theoutermostloopleveltogiveFig.2.2(c).Notenowthatthex2variableisaccessedonly once in each loop iteration and can therefore be eliminated to give Fig.2.2(d).ThesequenceofmemoryaddressesaccessedinFig.2.2(d)accessesthesame rows and bursts of the original source code, but nowdoes so in amoreefficient order, since there are fewer row-swaps (and their associated timingpenalties)incurred.

Whenthistechniqueisappliedtocode,itcansignificantlyimproveinterfacebandwidthefficiency.Weshow this inFig.2.3 for threebenchmarks (Matrix–MatrixMultiply,SobelFilterandGaussianBacksubstitution)parameterisedwithreuse buffers inserted at different levels of the loop nest. The insertion of thebuffer at the outermost level of the loop nest (t = 1) allows reordering of allmemoryaccessesandmeans less than10%ofmemoryaccesscyclesarespentidlewhilstDRAMrowsareswappedcomparedwith>75%intheoriginalcode.The different levels of parameterisation allow a tradeoff between performanceandtheamountofon-chipmemorydedicatedtodatabuffering.

Fig.2.2.Transformationstepsinimprovingmemorybandwidthefficiency.

Fig.2.3.Memorybandwidthimprovements:Breakdownofinterfacecommandsbytype.

2.4.3.Whatmightthisenableustodointhefuture?

Looking beyond our existing work, the formal model of memory accessprovided by the Polyhedral Model is a promising representation for enablingother application-specific memory transformations. One emerging area ofresearchistheexplorationofhowthePolyhedralModelenablestheoverlappingofoff-chipmemoryoperationswithon-chipcomputation.Thisworkmakesuseofmathematicaladvances[Ref.29]whichallowustocounttheexactnumberofintegerpointscontainedwithapolyhedronwithoutenumeratingthem.

Knowledge of the exact lifetimeof variables fetched into on-chipmemorycan enable more compact mapping of those variables into limited on-chipmemory. Exploratory work on how to better utilise the multiple independentbanks within a DRAM also seems like a promising direction, allowing us tofurtherimprovetheefficiencyofoff-chipmemoryaccesses.

Exact dependency analysis allows auto-parallelisation, but to support this,we need to ensure that enough on-chip memory ports are available to avoidcontention. Emerging automatic array partitioning techniques [Ref. 30] ensurethatcontentionforon-chipmemoryportsinminimised.Thisallowsefficientuseof the large on-chip bandwidth provided by block RAM resources which areubiquitousinmodernheterogeneousFPGAs.

The key theme is that there are significant opportunities opened up byexpanding our synthesis tools to target complete reconfigurable systemsincluding off-chip memory. The formal representation of memory accesssequences provided by the Polyhedral Model allows tools to automaticallyproduce efficient application-specific hardware with tailor-made memorysystems.

2.5.Conclusion

Our view is that many problems in high-level design automation, once ofconcerntothesmallgroupofpioneersofreconfigurablecomputing,nowariseinvarious guises in the much broader setting of computing generally, andembeddedcomputing inparticular.Whilehigh-level synthesisandcompilationtools have progressed significantly over the past decade,webelieve that thereare two very significant gaps in existing tool flows: customisation ofmemorysystemsandauto-generationoffiniteprecisionarithmeticimplementations.Wehavedescribedourownapproachestothesecentralproblems.OurviewisthattheFPGAcomputingcommunityispoisedtoplayacentralroleintheevolutionofcomputerarchitectureandcompilersoverthenextdecade.Wemusttakeupthebaton.

2.6.Acknowledgements

Wewish to acknowledge the inspirationalmentorship of Professor PeterY.K.Cheung, one of the small group of pioneers of FPGA-based computing. Inaddition,wewouldliketoexpressourthankstoEPSRCforfundingreceivedtosupport the development of the ideas expressed in this paper (grantsEP/G03157/1,EP/I012036/1,EP/I020357/1,EP/K034448/1).

References

1.S.Wilton.ArchitectureandAlgorithmsforField-ProgrammableGateArrayswithEmbeddedMemory,PhDthesis,UniversityofToronto,1997.

2.S.Haynes,A.B.Ferrari, andP.Y.K.Cheung.FlexibleReconfigurableMultiplierBlocksSuitable forEnhancing theArchitecture of FPGAs, inProc.Custom IntegratedCircuitConference, pp. 16–19,1999.

3.I.KuonandJ.Rose.MeasuringtheGapbetweenFPGAsandASICs,IEEETransactionsonComputer-AidedDesignofIntegratedCircuitsandSystems,26(2),203–215,2007.

4. J.L.Hennessy andD.A. Patterson.ComputerArchitecture-AQuantitativeApproach (5 ed.),MorganKaufmann,Burlington,MA,USA,2012.

5.W.J.Dally.TheEndofDenialArchitectureand theRiseofThroughputComputing inProc.DesignAutomationConference,p.xv,2009.

6.R.Dennard et al.Design of Ion-ImplantedMOSFETswithVery Small PhysicalDimensions, IEEEJournalofSolidStateCircuits,SC-9(5),256–268,1974.

7. T.N.Mudge. Power: A First-Class Architectural Design Constraint, IEEE Computer, 34(4), 52–58,2001.

8.A.Rafique,N.Kapre, andG.A.Constantinides.AvoidingCommunication inGPUandFPGA-basedSparseIterativeSolvers:AlgorithmandArchitectureInteraction.Inpreparation.

9. H. Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling, in Proc. InternationalSymposiumonComputerArchitecture,2011.

10. Z. Zhang et al. “AutoPilot: A Platform-Based ESL Synthesis System”, in eds. P. Coussy and A.Morawiec,High-LevelSynthesis,pp.99–112,Springer,Netherlands,2008.

11.A.Canisetal.LegUp:High-levelSynthesisforFPGA-basedProcessor/AcceleratorSystems,inProc.InternationalSymposiumonFieldProgrammableGateArrays,pp.33–36,2011.

12.E.Hartleyetal.PredictiveControlofaBoeing747AircraftusinganFPGA,inProc.IFACNonlinearModelPredictiveControlConference,pp.80–85,2012.

13.A.Vladimirov.Whitepaper:“ArithmeticsonIntel’sSandyBridgeandWestmereCPUs:NotAllFLOPsareCreatedEqual”,2012.

14. NVIDIA. NVIDIA Tesla Kepler GPU Computing Accelerators. [Online] Available at:http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf. [Accessed 23April2014].

15. G.A. Constantinides, P.Y.K. Cheung, and W. Luk. The Multiple Wordlength Paradigm, in Proc.InternationalConferenceonField-ProgrammableCustomComputingMachines,2001.

16. D. Boland and G. Constantinides. Wordlength Optimization Beyond Straight Line Code, in Proc.InternationalSymposiumonField-ProgrammableGateArrays,2013.

17.R.E.Moore.IntervalAnalysis,Prentice-Hall,EnglewoodCliff,NJ,USA,1966.18. L.H. de Figueiredo and J. Stolfi. Self-Validated Numerical Methods and Applications, Brazilian

MathematicsColloquiumMonographs,IMPA/CNPq,RiodeJaneiro,Brazil,1997.19.D.Handelman.RepresentingPolynomialsbyPositiveLinearFunctionsonCompactConvexPolyhedra,

Pac.J.Math.,132(1),35–62,1988.20.D.BolandandG.Constantinides.BoundingVariableValuesandRound-offEffectsusingHandelman

Representations, IEEETransactionsonComputer-AidedDesignof IntegratedCircuitsandSystems,30(11),1691–1704,2011.

21.D.BolandandG.A.Constantinides.AScalablePrecisionAnalysisFramework,IEEETransactionsonMultimedia,15(2),242–256,2013.

22. B. Cook, A. Podelski, and A. Rybalchenko. Proving Program Termination,Communications of theACM,2009.

23.D.-U.Leeetal.MiniBit:Bit-widthOptimizationviaAffineArithmetic, inProc.DesignAutomationConference,pp.837–840,2005.

24. D.M.H.-N. Nguyen and O. Sentieys. Novel Algorithms for Wordlength Optimization, in Proc.EuropeanSignalProcessingConference,pp.1944–1948,2011.

25. S. Bayliss and G.A. Constantinides. Optimizing SDRAM Bandwidth for Custom FPGA LoopAccelerators, inProc. InternationalSymposiumonFieldProgrammableGateArrays,pp.195–204,2012.

26. C. Lengauer. Loop Parallelization in the Polytope Model, in Proc. International Conference onConcurrencyTheory,pp.398–417,1993.

27.W. Kelly andW. Pugh. A Framework for Unifying Reordering Transformations, Technical ReportUMIACS-TR-92-126.1,UniversityofMaryland,CollegePark,MD,USA,1993.

28. W. Pugh. The Omega Test: A Fast and Practical Integer Programming Algorithm for DependenceAnalysis,CommunicationsoftheACM,8,4–13,1992.

29. A. Barvinok. A Polynomial Time Algorithm for Counting Integral Points in Polyhedra when theDimension is Fixed, in Proc. Symposium on the Foundations of Computer Science, pp. 566–572,1993.

30. P. Li et al.Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis, inProc.

http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

InternationalConferenceonComputer-AidedDesign,pp.488–495,2012.

Chapter3

AnFPGA-BasedFloatingPointUnitforRoundingErrorAnalysis

MichaelFrechtlingandPhilipH.W.LeongSchoolofElectricalandInformationEngineering,

TheUniversityofSydney

Detection of floating-point rounding errors normally requires runtime analysis in order to beeffective and software-based tools are seldom used due to the extremely high computationaldemands.In thischapterwepresentafieldprogrammablegatearray(FPGA)basedfloating-pointcoprocessor which supports standard IEEE-754 arithmetic, user selectable precision and MonteCarlo arithmetic (MCA). This coprocessor enables the detection of catastrophic cancellation andminimizingrequiredfloating-pointprecisioninreconfigurablecomputingapplications.

3.1.Introduction

IEEE-754[Ref.1]haslongbeenthestandardforcomputingusingfloating-point(FP)numbers;however, asa finiteprecisionarithmetic system it is capableofanomalousresults.Roundingerrorduringcomputationcansignificantlyreducethe accuracy of a computation, and result in errors many times larger thanexpected[Ref.2].Inordertoproperlyimplementandverifynumericalsoftware,techniques to determine the effects of such errors are required. Monte Carloarithmetic(MCA)[Ref.3]cantrackroundingerrorsatruntimebyforcinginputsand outputs to behave like random variables. Analysis of repeated operationsturnsanexecutionintotrialsofaMonteCarlosimulationallowingstatisticsontheeffectsofroundingerrorstobeobtained.MCAistypicallyperformedusingSW routines and as such its implementation involves a drastic reduction inperformance.Fieldprogrammablegatearrays(FPGAs)offeraplatforminwhichhardware(HW)accelerationcanbeappliedtoarbitraryalgorithms.

In this work, we describe a complete HW accelerator for runtime erroranalysis. A novel MCA coprocessor architecture is employed which isimplemented entirely using standard floating-point cores. System-level

performance measurements are described, and a comparison with an existing(SW)implementationmade.

This work was influenced by Professor Cheung’s seminal research onoptimizingfloating-pointbitwidthstoadvantageouslyutilizethereconfigurablenature of FPGAs. It is complementary to his approach to floating-pointsensitivity analysis based on automatic differentiation [Ref. 4] and can beappliedtocustomizedhardwarefloating-point[Ref.5]anddual-fixedpoint[Ref.6]implementationschemes.

3.2.Background

3.2.1.IEEE-754floatingpoint

ThebinaryIEEE-754[Ref.1]floating-pointnumbersystem (β,p,emin,emax)isasubsetofrealnumberswithelementsoftheform

Thesystemischaracterizedbytheradixβ,whichisassumedtobe2inthispaper,precisionp, theexponentvaluesemin≤e≤emax, thesignbits∈{0,1}andthemantissam∈[0,β).Normalizedvaluesaremostcommonlyusedandarerepresented as a non-zero x∈ with |m|∈ [1, β) and emin < e < emax. De-normalized numbers are also supported and represent values of smallermagnitude than normalized numbers withm∈ [0, 1) and exponent e = emin.Otherclassesofnumbersincluding+/−Zero,InfinityandNotaNumber(NaN)areavailablewithspecialformats.Withoutlossofgenerality,weassumethe32-bitIEEEsingleprecisionformatinthispaperwithp=24,emin=−125,andemax=128.RealnumbersaregenerallynotexactlyrepresentableasFPnumbersdueto a number of factors including errors of measurement or estimation,quantization error or errors propagated from earlier parts of a computation.Although the IEEE-754 standard is used in all types of applications mostlywithout issue, if one or more non-exact numbers are subtracted, a loss ofsignificant digits can occur due to normalization of the result [Ref. 2]. Thisphenomenaiscalledcatastrophiccancellationandisoneofthemajorcausesoflossofsignificance.

3.2.2.Erroranalysis

Severalsystemshavebeendevelopedforperformingruntimeerrordetectionandanalysis. IntervalArithmetic (IA) represents a valuex by an interval [xlo, xhi].Intervalsarepropagatedthroughthecalculation[alo,ahi]−[blo,bhi]=[alo−bhi,ahi − blo]. IA can be used to track inexact values and rounding errors duringcomputation; however, it often produces overly pessimistic error bounds [Ref.7].A limited number of hardware implementations of IA can be found in theliterature[Refs.8–10].TheCESTACmethod[Ref.11]isaspecialcaseofMCAthat involves executing the same computation several times by randomlyperturbing the rounding schemeof the arithmetic operators.By comparing theresults fromanumberof different executions, thenumberof significant digitscanbeestimated.AhardwareimplementationoftheCESTACmethodhasalsobeen published [Ref. 12], but we are not aware of any system levelimplementation. An FPGA based implementation of MCA addition andmultiplicationwithanareapenaltyoflessthan22%overIEEE-754wasrecentlypublishedbyYeungetal. [Ref.13].Comparedwith their implementation, thiswork presents a complete system forMCA rather than just anMCA core. Inaddition,whileYeungdescribedacustomFPUforMCA,theFPcomputationsin thiswork are performed using standard IEEE-754 FP primitives, providingsignificantbenefitsintermsofportability,flexibilityanddevelopmenttime.

3.3.MonteCarloArithmetic

Ifxisafloating-pointvalueoftheformgiveninEq.(3.1)wedefinetheinexactfunctionas

wherex∈ℝ;tisapositiveintegerrepresentingthedesiredprecision;ξisauniformlydistributedrandomvariableintherange andmx,exarethemantissaandexponentofx.Itisassumedthat1<t≤p.Anoperation∘∈{+,−s,×,÷}isimplementedas

Adjustmentstoinputoperandsarereferredtoasprecisionboundingandareused

todetectcatastrophiccancellation,whileadjustments tooutputsare referred toasrandomroundingandareusedtodetectround-offerror[Ref.3].Thesystemdeveloped for this paper performs precision bounding usingmultiple floating-pointcomputations.UsingthismethodtheoperationcanbeperformedwithoutmodifyingtheinternalarchitectureofthestandardIEEE-754FPU;however,theuse of standard FP operations results in FP rounding being applied multipletimeswithinasingleMCAoperation.InthecaseoftraditionalMCAtheaverageof a Monte Carlo simulation can be used to estimate the true result of anoperationsensitivetoroundingerror,andtherelativestandarddeviationusedtodetectcatastrophiccancellation.InthecaseofMCAimplementedforthispaper,theaveragevalueof theresultsofaMonteCarlosimulationcannotbeusedtoestimatethetrueresultofthetestedoperation,asthisaverageisaffectedbytheuse of multiple rounding stages. In this case the system is only used for thedetection of catastrophic cancellation. The value t shown in Eq. (3.4) is thevirtual precision of theMCAoperation. This value determines the number ofplaces the value ξ is shifted to the right of themantissa of the floating-pointvaluex, and isused to control the levelof random fluctuations appliedduringMCA.Thevirtualprecisionof theMCAoperations is set toapositive integerless than or equal to themachine precision of the floating-point system beingused:

A large t value will result in a smaller exponent value for the operandperturbation, increasing the accuracy of the operation. Similarly, a smaller tdecreasestheaccuracy.Inpractice,variationoftisusedtodeterminewhateffectlowering or increasing the precision has on the accuracy of an operation. Theresultsofthisanalysiscanthenbeusedtodetermineanappropriatevalueforthemachineprecisionp of a floatingpointoperation thatwillmaintain a requiredlevel of accuracy. The implementation developed for this paper performsvariable precision MCA, and the value of t used by the coprocessor can bemodifiedatanytimeduringexecution.FurtherdetailsareprovidedinSec.3.4.

3.4.SystemImplementation

The MCA FPU is an FPGA coprocessor connected to a MicroBlaze softprocessorthroughaAXI4-streambus.Thecoprocessoriscapableofperformingboth standard floating-point arithmetic and MCA. The MCA FPU core was

developed using the high-level C-to-RTL design software AutoESL(http://www.xilinx.com/tools/autoesl.htm). Using AutoESL, our MCA FPU isdescribedusingstandardCstatementsandduringsynthesisandimplementation,floating-pointoperationsaretranslatedintoasetoffloating-pointmodulesbasedontheIEEE-754floating-pointlibrary.

3.4.1.MCAFPUimplementation

TheMCAFPUisabletoperformfourbasicarithmeticoperations;add,subtract,multiply and divide. A fifth configure operation allows the precision value, t,andtheMCAflagtobemodifiedatruntime.ThefinaloperationcombinestheaddandmultiplyoperationstoperformanFMA(fusedmultiply-add)operation,which calculates the result of the operation r = (a∗ b) + c. The arithmeticoperations are implemented by coupling standard IEEE-754 floating-pointoperator primitiveswith a configuration register and a perturbation generationmodule. Each perturbation module is used to determine a value that will beaddedtotheoperationbasedonthevalueoftheoperandsandarandomnumber.Random numbers are generated using Maximally Distributed TauswortheGenerators(TRNG)[Ref.14].Theconfigurationinformationconsistsofa1-bitBoolean flag indicatingMCAor IEEE-754mode,anda32-bitunsignedvaluefort.Thesearestoredintheconfigurationregisterforaccessduringsubsequentoperations.InIEEEmode,theFPUfullysupportsthestandard.Adescriptionofhowtheoperatorsareimplementedisgivenbelow.

3.4.1.1.MCAaddition/subtraction

Addition and subtraction operations are performed in terms of the inexactfunction(Eq.(3.3))asfollows:

where and The magnitude of ξ can becalculatedusingonlypositivevaluesvia

http://www.xilinx.com/tools/autoesl.htm

and a equal probability choice of addition or subtraction is made. Note thatanditsdistributiondependsonthevalueofxandy.Thefloating-

pointvalueξcanbecalculatedbyfirstusingfixed-pointarithmetic toproduceseparate values for the eξ andmξ then combining these values along with arandomlyselectedvalueforsξ inthecorrectformattoproduceafloating-pointnumber:

Toperformthisoperation tworandomvaluesforξxandξymustbecalculated.These valueswill be used to formmξ and as suchmust be 24-bit normalizedfixed-point values. Each value must also be in . Two 32-bit values areproduced (one from eachTRNG) and the lower 22-bits assigned as the fixed-pointvalueofξxorξy.TheMSBsofeach32-bitnumberareusedtocalculatethesignbitsξ.Oncethefixedpointvaluesofξxandξyhavebeenproducedthevalueof mξ can be calculated using fixed-point arithmetic At this point we haveproducedavaluemξ∈[0,1).Toproducethefinalvalueforξthisvaluemustbenormalized, theβex−t shift applied and thevalue converted to IEEE-754 singleprecisionformat.Thisisdoneinthefollowingstages:

(1)Determinethenumberofleadingzeroesλξ.Theleadingzerodetector(LZD)usedfortheMonteCarloFPUisbasedontheLZDfoundin[Ref.15].Themantissavaluemξ is thenshifted leftor rightdependingon thevalueofλξformingthefinal24bitnormalizedmantissa.

(2)Calculatethe8-bitexponentvaluebasedonthevalueofex,λξandt.(3)Merge the sign, exponent andmantissavalues to form the singleprecision

floating-pointvalueusingleft-shiftstomovethesignandexponentvaluestothecorrectlocation.

Once a value for ξ has been produced the second addition operation isperformed,producingthefinalresult.

3.4.1.2.MCAmultiplication

ThemultiplicationoperationcanberepresentedintermsoftheinexactfunctionshowninEq.(3.4)asfollows:

Theperturbationvaluesintheaboveequationcanbeexpandedandsimplifiedtothefollowing:

Fromtheaboveequationitcanbeseenthattheξxξy termwillbeshiftedtotheright by 2t places during the operation. Calculation of the perturbation valueincludingthistermwouldrequiretheprecisionoftheFPUtobeextended,eitherby modifying the internal architecture of the FPU core or by performing theMonteCarlocalculationinahigherprecisionformat.Thiscanbeavoidedbynotincluding the ξxξy term in the calculation, which can be done withoutsignificantlyaffectingtheresultsas thelargerightshiftresults inanextremelysmall value for ξxξy relative tomxξy +myξx. The Monte Carlo multiplicationoperationcanthereforebesimplifiedtothefollowing:

where andξ=βex+ey−t[mxξy+myξx].Themagnitudeofξcanbecalculatedusingonlypositivevaluesvia

wherearandomized,equalprobabilitychoiceofadditionorsubtractionismade.Notethat|ξ|∈βex+ey−t[0,2)anditsdistributiondependsonthevaluesofxandy.InMCAmultiplicationasimilarmethodtoadditionisemployedtoproducetheperturbation value. Two TRNGs are first used to produce 24-bit fixed pointrandomnumbersandthecorrespondingsignbits.Theserepresentξx,ξy∈.UsingEq.(3.15),eachvalueisthenmultipliedbythemantissaoftherelevantoperandandtheresultingvaluesaddedtogether,resultinginavalueformξ.Thisprocessproduces avaluemξ∈ (−2,2).Thisvalue isused toproducea singleprecisionfloating-pointvalueforξasfollows:

(1)Determinethenumberofleadingzeroesλξinmξandnormalize.(2)Calculatetheexponentvalueeξ.(3)Merge the sign, exponent andmantissa values to form the 32-bit floating-

pointperturbationvalue.

Once the finalperturbationvalueξhasbeenproduced it isadded to the initialresultr′toproducethefinalresultoftheoperation.

3.4.1.3.MCAdivision

The division operation differs from addition, subtraction andmultiplication inthat two individual floating-point perturbation values ξx and ξy are producedrather than a single perturbation value ξ. This operation can be described intermsofEq.(3.4)asfollows:

The above equation cannot be easily simplified to a point where a combinedvalue for ξ can be calculated as for previously discussed operators. Separateperturbationvalues(ξxandξy)arethereforecalculatedandappliedtothexandyoperands,requiringtheprecisionofthedivisionoperationtobeextended.Thisisdone by performing single precision (32-bit) MCA division using doubleprecision(64-bit)floating-pointdivision.Theperturbationvaluesarecalculatedasfollows:

Each perturbation value is applied to the relevant operand using a standardIEEE-754 floating-point addition operation, after which a standard IEEE-754double precision division operation is performed. Although this calculationrequires a total of three FP operations, calculation and addition of the twoperturbation values can be performed in parallel, and the only increase inoverhead over addition/multiplication is from the double precision divisionoperation. MCA division is performed as follows. Two TRNGs are used toproduce24-bitfixedpointmantissavaluesforξxandξyandtheircorrespondingsignbits.Thevaluesarethenconvertedtosingleprecisionfloating-pointformat:

(1)Determinethenumberofleadingzeroesλxiineachmantissaandnormalize.(2)Calculatetheexponentvalues.(3)Mergethesign,exponentandmantissavalues.

Once the perturbation values ξx and ξy have been calculated the final result iscalculated.

3.5.TestingMethods

Testing of the FPU was conducted by comparing the performance of thecoprocessortoaSWimplementationofMCA.Thetestroutinesusedarebasedon routines used by Parker in [Ref. 16], downloadable fromhttp://www.cs.ucla.edu/~/stott/mca.Detailsofequipmentandparametersare inTable 3.1. In order to compile unmodified C source code to use MCA, twodifferentversionsofthegccsoftwarefloating-pointlibraryweredeveloped.FortheFPGAcase, a library inwhich theaddsf3, subsf3,mulsf3anddivsf3werechangedtoutilizetheMCAcoprocessorwascreated.FPoperationscanthenberedirected to the appropriate subroutine by invoking gccwith the -msoft-floatoption.PCimplementationsofMCAwerecompiledinasimilarfashionusingadifferent,software-onlyMCAlibrary.Foreach testcasediscussedbelowthreetestswereperformed.ThefirstwasonaPCusingafloating-pointunit(SW-FP);the secondaPCwith softwareMCA(SW-MCA);and finally, anFPGAusingtheMonteCarlocoprocessor(HW-MCA).

Table3.1.Systemparameters.

Item Version/Description

FPGAParametersISEVersion 13.2FPGA Virtex-6LX240T(SpeedGrade3)FPGABoard XilinxML-605DevelopmentBoardProcessorClockSpeed 150MHzMCACoreClockSpeed 150MHz

PCParametersCPU IntelCore2Duo3GHzMemory 4GBOS Ubuntu12.0432-bitGCCVersion 4.7.0

3.5.1.Cancellation(Knuth)test

Thecancellationtestperformsasimpleassociativitytestbycalculating

http://www.cs.ucla.edu/~/stott/mca

Over the real numbers, u should equal v; however, for the values x =11111113.0,y=−11111111.0,z=7.5111111catastrophiccancellationoccurs.Using these values the difference between |u| and |v| is calculated over 1000samples and the standard deviation of the results is used to determine theaccuracyofthecalculation.

3.5.2.Cosinetest

Thecosinetestcalculatesthecosinefunctionusingapowerseriesexpansion:

forz∈[0,π].Foreachvalueofzovernstepsasetof100samplesarecalculatedandcomparedtothevalueofasingleprecisionFPcalculationofthesamevalue,andtheaccuracyofthecalculationmeasuredateachstep.

3.5.3.Kahantest

TheKahantestperformsanevaluationofarationalpolynomial

Thepolynomialrp(u)isfirstevaluatedforu=1.60631924usingsingleprecisionIEEEarithmetic,thennresultsforrp(x)arecalculatedusingMCAforincreasingvaluesofx:

with∊=2−23.Thedifferencevalued=rp(x)−rp(u)isthencalculatedforeachiteration.ResultsofboththeMCAtestandthestandardIEEE-754testcanthenbecomparedtodeterminethedifferenceinresultdistribution.

3.5.4.LINPACK

The LINPACK benchmark determines system performance by measuring thetime taken tosolveadensen×n systemof linearequationsAx=b [Ref.17].Usingthisbenchmark,performanceofthesystemcanbeeasilymeasuredusinganindustrybenchmarktool,andcomparedtotheperformanceofanequivalentSWsolution.Statisticalmeasurementsoftheresultsforxhavealsobeenmade,andprecisiontestingperformedusingthebenchmarktodemonstratetheuseofvariableprecisionMCA.sInthiswork,n=300wasused.

3.6.Results

3.6.1.Systemperformanceandsize

Tables3.2and3.3provideperformanceresultsandlogicutilizationfiguresfortheMCAcoprocessor.Theprimarytestusedtodeterminesystemperformanceisthe LINPACK benchmark, as this is an industry standard example of a FPbenchmark tool. In order to achieve maximum performance the LINPACKbenchmarkwasprofiled todeterminewhich functionswouldprovide themostbenefit from optimization. These functions were optimized by performingoperations of the type (a∗b) +c using the coprocessorFMAoperation.TheresultsoftheLINPACKtestshowthattheMCAcoprocessorachievedaspeedof 3.5MFLOPS for the LINPACK test, and this figure corresponds with theaverageperformanceofthesystemduringalltestroutines.ThisperformancecanbecomparedtothePCperformingMCAinsoftware,whichcanbeseentohaveachievedanaveragespeedof4.4MFLOPSduringtheLINPACKtest,whichisagain similar to the average speed of 4.75 MFLOPS achieved across all testroutines.Fromthiscomparison two thingsarenoted, firstly, there isa200×to600× decrease in performance for software-based MCA over standard FP.Secondly, the FPGA implementation has comparable performance to the SWimplementation.Table3.3showsthatthisperformancehasbeenachievedwitha5×increaseinlogicutilizationoverasingleprecisionIEEEFPunitcapableofperforming the same arithmetic operations.ComparedwithYeung et al. [Ref.13], thisdesignhassimilarthroughputbutaconsiderableareaoverheadduetoimplementing/interfacingeachMCAoperationseparatelyinordertoreduceI/Ooverhead.

Table3.2.Systemperformance(measured).

3.6.2.Improvingperformance

The FPGA MCA core implementation has not been fully optimized.PerformanceislimitedbyI/Ooverhead,aconflictbetweenthe-msoft-floatandoptimizationflagsingccandthemaximumclockspeedoftheimplementation.InordertooverlapcommunicationwithcomputationtheLINPACKbenchmarkwasprofiled,withresultsindicatingthat92%ofcomputationtimewasspentinthe daxpy subroutines. In addition, analysis of the coprocessor I/O overheadshowed 80% of execution time spent transferring data and only 20% oncomputation.Themaximumspeed-up,S,forLINPACKcanbecalculatedusingAmdahl’sLaw[Ref.18]:

Table3.3.Systemlogicutilization.

whereP=5isthemaximumspeedincreaseachievablebyminimizingI/O,andf=(1−0.92)=0.08istheratioofnon-daxpytodaxpycomputation.Thisvaluecould be approached since the daxpy operations are vectorizable. Furtherperformance improvements can be obtained using the gcc optimization flags,whichwerenotused in this casedue to conflictswith the -msoft-float option.Testingofroutinesdescribedinthispaperbydirectlymodifyingthesourcecodeshowed that a consistent 2× speed improvement is achieved with −O1optimization. Finally the MicroBlaze maximum clock frequency limits theoverall system to a clock frequency to 150 MHz, while the Zynq family ofprocessors recently releasedbyXilinxarecapableof clock frequenciesof800MHz. In addition Xilinx LogiCORE IP FP operator cores can achieve amaximum frequency of 400 MHz. Thus a conservative estimate for themaximumfrequencyofanoptimizeddesignis400MHz,a2.5×speed-upoverthereporteddesign.Theoptimizationsinthissectionareorthogonaland,takentogether,anadditional~20×speed-upmaybepossible.Thiswouldimprovetheperformanceofourapproachto60MFLOPS.

3.6.3.Errordetection

Figure3.1showstheresultsoftworunsofthecancellationtest.ThehistogramontherightshowsresultsusingthevaluesgiveninSec.3.5.1anddemonstratesdistributionofresultsforoperationssusceptible toround-offerror.FromTable3.2itcanbeseenthattheresultsofthistesthaveastandarddeviationof0.51,thisbeingofthesameorderasthemean,0.59.Itcanthusbeconcludedthat,forthe given inputs, the coprocessor detected catastrophic cancellation. Thedistributionof theseresultscanalsobecomparedtothehistogramontherightsideofFig.3.1whichshowsresultsforanotherexecutionofthecancellationtestusing different inputs. From the histogram, it can be seen that the standarddeviation and relative standard deviation of the results are much lower andhence, for these inputs, lower sensitivity to rounding error is indicated. TheresultsoftheKahantestareshowninFig.3.2.Theplotontherightsideofthe

figureshowstheresultsoftheKahantestwhenperformedusingstandardIEEE-754FPoperations,andascanbeseenintheplot,theresultsdonotshowrandomrounding.TheplotontheleftsideofthefigureshowsresultsobtainedusingtheMCAcoprocessor.ComparingthesetwosetsofresultsitcanbeseenthatMCAoperationsperformedbythecoprocessorproducerandomlyroundedresults,andthatstatisticalanalysisoftheseresultscanbeusedtodeterminethesensitivityofthesystemtoroundingerror.

Fig.3.1.Distributionofresultsfor[(x+y)+z]−[x+(y+z)].

Fig.3.2.Kahantestresults.

3.6.4.Precisiontesting

The final set of testing and results demonstrate the ability of the MCAcoprocessortoperformvariableprecisionMCA,andtodeterminetheminimumprecision required to perform an operation to a specified accuracy. TheLINPACK benchmark is executed using n = 300 and the virtual precision ofMCAoperationsvariedbetween1≤ t≤24.Thevalueof theresultvectorx isthenanalyzed todetermine theaccuracyof theresults,whicharegiven inFig.3.3. For a required output accuracy (specified in terms of either theminimumnumber of significant figures or theminimum relative standard deviation), theminimumrequiredmachineprecisionisthecorrespondingabscissainFig.3.3.

Fig.3.3.Precisiontestresults.

3.7.Conclusion

Afloating-pointunitfortheruntimedetectionofround-offerrorswasdesignedand implemented using high-level synthesis tools. It was integrated in aMicroBlaze soft processor system, and verified to give results of accuracyequivalent to previously published software implementations. Measurementsshowed that the performance of the current implementation is similar to anequivalentPC-basedSWimplementation,andthatafurtherspeed-upof20×ispossible.

This work shows that HW accelerated implementations of error detectionalgorithmscanprovideaccuratemeasurementsof theeffectsof roundingerrorwhilenotdramaticallyimpactingperformance.FutureworkwillfocusonbetterintegrationbetweentheFPUandprocessor,improvingperformance.

References

1.IEEE.IEEEStandardforFloatingPointArithmetic,IEEEStd754-2008,pp.1–70,2008.2. D. Goldberg. “Computer Arithmetic”, in eds. J.L. Hennessy and D.A. Patterson, Computer

Architecture:AQuantitativeApproach,MorganKaufmann,Burlington,MA,USA,1990.3.D.S.Parker,B.Pierce,andP.R.Eggert.MonteCarloArithmetic:HowtoGamblewithFloatingPoint

andWin,ComputinginScienceandEngineering,2(4),58–68,2000.4.A.A.Gaffaretal.UnifyingBit-WidthOptimisationforFixed-PointandFloatingPointDesigns,inProc.

InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.79–88,2004.5. A.A. Gaffar et al. Automating Customisation of FloatingPoint Designs, in Proc International

ConferenceonField-ProgrammableLogicandApplications,pp.523–533,2002.6. C.T. Ewe, P.Y.K. Cheung, and G.A. Constantinides. Dual Fixed-Point: An Efficient Alternative to

FloatingPoint Computation, in Proc. International Conference on Field-Programmable Logic andApplications,pp.200–208,2004.

7.W.Kahan.HowFutileareMindlessAssessmentsofRound-offinFloatingPointComputation.[Online]Availableat:http://www.cs.berkeley.edu/~wkahan/Mindless.pdf.[Accessed23April2014].

8.A.Amaricai,M.Vladutiu,andO.Boncalo.DesignofFloatingPointUnitsforIntervalArithmetic,inProc.Ph.DResearchinMicroelectronicsandElectronics,pp.12–15,2009.

9.M.J. Schulte and E.E. Swartzlander.A Family ofVariable-precision IntervalArithmetic Processors,IEEETransactionsonComputers,49(5),387–397,2000.

10.J.E.StineandM.J.Schulte.ACombinedIntervalandFloatingPointMultiplier, inProc.GreatLakeSymposiumonVLSI,pp.208–215,1998.

11.J.VignesandR.Alt.AnEfficientStochasticMethodforRound-OffErrorAnalysis,inProc.AccurateScientificComputations,pp.183–205,1985.

12.R.ChotinandH.Mehrez.AFloatingPointUnitusingStochasticArithmeticCompliantwiththeIEEE-754Standard,inProc.InternationalConferenceonElectronics,CircuitsandSystems,vol.2,pp.603–606,2002.

13.J.H.C.Yeung,E.F.Y.Young,andP.H.W.Leong.AMonteCarloFloatingPointUnitforSelf-validatingArithmetic, inProc. International Symposium on Field Programmable Gate Arrays, pp. 199–208,2011.

14. P. L’Ecuyer. Maximally Equidistributed Combined Tausworthe Generators, Mathematics ofComputation,1996.

15.V.G.Oklobdzija.AnAlgorithmicandNovelDesignofaLeadingZeroDetectorCircuit:ComparisonwithLogicSynthesis,IEEETransactionsonVeryLargeScaleIntegrationSystems,2(1),1–5,1994.

16. D.S. Parker. Monte Carlo Arithmetic. [Online] Available at: http://www.cs.ucla.edu/~stott/mca/.[Accessed23April2014].

17. J.J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK Benchmark: Past, Present and Future,ConcurrencyandComputation:PracticeandExperience,15(9),2003.

18. G.M. Amdahl. Validity of the Single Processor Approach to Achieving Large Scale ComputingCapabilities,inProc.SpringJointComputerConference,pp.483–485,1967.

http://www.cs.berkeley.edu/~wkahan/Mindless.pdf

http://www.cs.ucla.edu/~stott/mca/

Chapter4

TheShroudofTuring

SteveFurberandAndrewBrownUniversityofManchesterandUniversityofSouthampton

We make the case that computing is increasingly compromised by Turing’s heritage of strictlysequential algorithms, and the time has come to seek radically different computing paradigms.Whilst we are unable to offer any instant solutions, we point to the existence proof of the vast,distributed,asynchronousnetworksofcomputingelements that formbrains,andnote thatFPGAsofferafabriccapableofoperatinginasimilarmode.Weofferacomputationalmodel—partiallyordered event-driven systems — as a framework for thinking about how such systems mightoperate.

4.1.Introduction

AllmoderncomputersarebasedonanideafirstpublishedbyAlanTuringasathoughtexperimenttosupporthisproofofkeyresultsaboutthefundamentalsofcomputability. His concept of the Universal Machine forms the theoreticalfoundationforthestored-programcomputer,conceivedinpracticaltermsbyvonNeumann,EckertandMauchly,andwhichfirstbecameanengineeringrealityinthe Manchester Baby machine and soon thereafter as a practical computingservicesupportedbyMauriceWilke’sEDSAC.At theheartof this idea is theconcept of sequential execution: each “instruction” startswith the state of themachinewhentheprecedinginstructionhascompleted,andleavesthemachinein awell-defined state for its successor.High-speed implementations bend theactual timingof instructionexecutionas faras itwillgowithoutbreaking,butstillemulatethesequentialmodel.

Thehistoryofadvancesincomputinghasrevolvedaroundmakingthisverysimple execution model go faster, partly through bending the timing ofinstruction execution as noted above, but mainly through making transistorssmaller, faster, and more energy-efficient, all thanks to Moore’s Law. This

approach has delivered spectacular progress for over 50 years, but then hit abrick wall — the power wall. Since then largely illusory (i.e. marketing)advancesinperformancehavebeendeliveredthroughmulti-coreandthenmany-coreparallelism—puttingamodestnumberofsequentialexecutionenginesonthe same chip, whose potential (i.e. marketing) performance can rarely berealized due to the difficulty inherent in trying to make sequential programsworktogetherinparallel.

Thetimehascometolookagainatthefundamentalsofcomputation,andinparticular tocastoff theshroudofTuring—toabandonsequential instructionexecution as the only model of a computation. Alternatives to sequentialexecutionareallaroundus,includingthevastparallelcomputingresourceonamodernFPGA,asstudiedbyPeterCheung,andthevastcomplexofbiologicalneuronsinsideeachofourbrains.Theonlyminordifficultyisthatwedonotyethave any general theory of computing on such vast, distributed, networkedresources. It’s time that changed. Sequential computing is not natural, it’s notefficientand,infact,theonlythingithasgoingforitisthatit’seasy.

4.2.SixtyYearsofProgress

Fewwouldargue that electronicshasnotburgeoned in thehandfulofdecadessinceFaradayworriedaboutwhereandhowhemightsensiblyobtaininsulatedconductors (silk-coveredwire used as stiffening in themanufacture of ladies’hats).Thistrajectoryhasbeenablydocumentedinmanyplaces,andweattemptto do no more than précis the highlights below. However, for a variety ofreasons,wearguethatthistrajectoryhasbegun—orisstartingtobegin—aninflection.Digitalelectronics,inallitsmanifestations,isbeginningtoconverge.

4.2.1.VonNeumannhardware

The initial development trajectoryof electronic computers is relatively easy toplot: Table 4.1. The arrival of the transistor— invented in 1947, penetratingcomputer construction in 1955 — changed everything. Power consumption,weightandcostcamedownbyordersofmagnitude(butso toodidmean-timebetween failures (MTBF)— the Harwell Cadet Cadet [Ref. 1], introduced in1955,hadnovalves,ranat58kHzandhadanMTBFof90minutes).(ENIACprovidedremarkablereliabilitybythesimpleexpedientofneverbeingpowereddownintheeightyearsofitslife:valveswerehot-plugged—ineverysenseof

thephrase—andalthoughprogramsmightbeinterrupted,thesystemwasnevercloseddown.)

Theway forwardwas clear: faster calculations,morememory, lesspower,weightandcost.Eachof theseparameterswasattackedvigorouslybyindustryand governments hungry for more performance, and each frontier wassuccessfullyandrelentlesslypushedback:“Bythetimethatanyonehadtimetowrite anything down, it was obsolete.” Notwithstanding the diversity of theapproachesbroughttobearoneveryaspectofmachinedevelopment,thelegacyof Turing survived almost untouched:machines fetched instructions and data,the formeroperatedon the latter and the resultswere storedaway.Thepinch-pointofthesequentialfetch-executecycle—ifanyoneworriedaboutitatall—waswaydownonthelistofrate-limitingattributesthathadtobetackled.

4.2.2.Software

Arguably the first person to realise the need for some kind of high(er)-levelmethod by which one might program the machines was Zuse [Ref. 2]. Hedefined the language Plankalkuel (“plan calculus”) in 1945, but WWIIinterrupted his work and it was not implemented until 1998. Plankalkuelcontains subroutines, conditionals, loops, floating-point, nested structures,assertions, and exceptions. Figure 4.1 shows a development chart of themorecommonhigh-levellanguages.

Table 4.1. Early development trajectory of the electronic computer.http://en.wikipedia.org/wiki/History_of_Computers.

http://en.wikipedia.org/wiki/History_of_Computers

Ofthese,PL/1isthefirsttocontain(builtin)thenotionofacontrolthreadthat might be split and recombined under programmer control. Developed byIBMat theirHursleyResearchPark,UK,IBMwasprobably—at thetime—one of the few organisations that possessed sufficient computing hardware tosensibly make use of such a construct. Nevertheless, it acknowledged thatforcinganinstructionstreamthroughasingleALU—howeverfast—mightbesomethingtobeavoidedifpossible.

Fig. 4.1. Inheritance chart ofmajor programming languages.There is amore comprehensive diagram in[Ref.12]containingover150languages,includingtheCSP-OCCAMandCPL-BCPL-B-Cbranches.

4.2.3.Hardwareplatforms

After valves came point-contact transistors, then junction devices, thenintegratedcircuits.Theseofferedunprecedentedopportunitiesforthosewiththeresources,buttheseresourcesbecamemoreandmorecostly.Toovercomethis,the concept of an uncommitted wafer was introduced: a silicon die would becreated—withallthecoststhatthatincurred—completeuptoandincludinga

metallizationlayer.Themetallizationlayerwouldnotbeetched,sotheenduserhadonlytosupplythefinalmetalmask,andthesiliconwouldbecommittedtothe user’s design.Themarket benefitted from the volume discounts (theNREwouldbeamortizedtosomethingverysmall),andtheuserhadonlytopayforonemask, as opposed to the entire set. (If you wanted to, you could build acomputer like this.) More sophisticated platforms emerged — programmablelogicarrays,wheretheunderlyingsiliconcouldbereconfiguredmultipletimes.The final step in this progression is the field-programmable gate array,wherepre-designedblocksofcomplexlogicmaybereconfiguredviaaprogrammableinterconnectfabricthatiseverybitascomplicatedatthelogicblocksitservices.

4.2.4.(Hardware)descriptionlanguages

Aswithcomputingengines, thegrowingcomplexityof thehardware led toanawareness thatsomeformofformalspecificationmechanismwasnecessary togetthemostoutofthehardware,andhardwaredescriptionlanguagesstartedtoappear.Thesewereprimafaciesimilarinappearancetosoftwarelanguages,butunlike these— which allowed the user to dictate the flow of data, and maycontain notions of concurrency—hardware description languages contain theexplicitnotionoftimeembeddedwithinthem,andtheuserresponsibilityshiftssubtlytorequirethespecificationofcontrol,ratherthandata.Table4.2listsasetofcommonlanguages.Likesoftware,eachhasitsdevotees,buttheunderlyingprincipleissimilarinallofthem.

The designer is encouraged to think hierarchically, in terms of functionalblocks,andtochoreographtheflowofinformationbetweenthem.Thenotionofasequentialflowofcontroliseitherabsent,orlocalizedwithinaspecificblockatsomelevelintheoverallhierarchy.

Meanwhile siliconplatformsgetbigger, faster,andcontainmoreandmorehigh-levelfunctionalblocks.Cores,forexample.

4.3.AndThenItAllStopped...

The development of many spheres of activity can be characterized by an S-curve,showninFig.4.2.

Thecurvecharacterizesmanyaspectsofhumanactivity, fromthespeedofaeroplanes to the probability of having your car radio stolen. In Phase I, atechnologyoractivityisunexplored,itspotentialunknown,sofewfolkenterthe

field.InPhaseII,thepotentialbeginstobecomerealized,andgrowthbecomesrapidasfolkrealizewhatcanbeachieved,andwhatreturnoninvestmentcanbeobtained.Funding,researchandexploitationformapositivefeedbackcycleandthefieldgrowsexponentially.Finally,inPhaseIII,societyrealisesthattheboomisover.Theoutstandingproblemsbecometooexpensiveorjusttoodifficulttosolve,andresearchinterest—andwithit,thenecessaryfunding—wanes.Theproduct/areahasbecomeacommodity. If itbreaks,don’t fix it, justgetanewone. It follows from this that the number of people required to fix it (i.e. thenumberofpeoplerequiredtounderstandit)decreases.

Moore’s Law [Ref. 3] states that the number of devices on an integratedcircuitwillapproximatelydoubleeverytwoyears.Thisisanexponentialrateofgrowth,andexponentials—inany field—arenot sustainable innature.Thisrateofchangepredictsthatinaround150yearstherewillbemorememorycellsonasquarecentimetreofsiliconthanthereareatomsintheuniverse.

Somethingisgoingtobreak;whatandwhy?

Table4.2.Hardwaredescriptionlanguages[Ref.13].

Name Provenance

AdvancedBooleanExpressionLanguage(ABEL) AlteraHardwaredescriptionlanguage(AHDL) AlteraAHPL AHardwareProgrammingLanguageBluespec HaskellBluespecSystemVerilog(BSV) BluespecC-to-Verilog ConverterfromCtoVerilogChisel(ConstructingHardwareinaScalaEmbeddedLanguage)

Scala(embeddedDSL)

CUPL(CompilerforUniversalProgrammableLogic) LogicalDevices,Inc.HHDL Haskell(embeddedDSL).HardwareJoinJava(HJJ) JoinJavaHML SMLHydra HaskellImpulseC AnotherC-likeHDLParC(ParallelC++) C++extendedwithHDLstylethreadingandtask

communicationsJHDL JavaLava Haskell(embeddedDSL)M MentorGraphicsMyHDL Python(embeddedDSL)PALASM ProgrammableArrayLogic(PAL)devicesROCCC(RiversideOptimizingCompilerforConfigurableComputing)

Freeandopen-sourceCtoHDLtool

RHDL RubySystemC AstandardizedclassofC++librariesforhigh-level

behaviouralandtransactionmodelling

behaviouralandtransactionmodellingSystemVerilog AsupersetofVerilog,withenhancementstoaddress

system-leveldesignandverificationSystemTCL TclTHDL++(TemplatedHDLinspiredbyC++) VHDLextensionwithinheritance,templatesand

classesVerilog Awidelyusedandwell-supportedHDLVHDL(VHSICHDL)

Fig.4.2.Universaldevelopment“S-curve”.

4.3.1.Thedesignroadblock

Theresourcesrequiredtodesignastate-of-the-artintegratedcircuithavegrownexponentially,followingthegrowthinthetransistorresourcesonthechipitself.Whereasinthe1980sitwaspossibletodesignacompetitivechipwithasmallteaminayearorso,today’sconsumersystems-on-chip(SoCs)requireteamsofhundredsofdesigners.

DesigncostsforSoCsareinthe$10Ms,sothemanufacturingvolumeshavetobe very large to amortize these costs.Thenumber ofSoCdesign starts peryear has been going down since 2000; fabless semiconductor start-ups — apopularentrepreneurialmodelinthe1990s—haveallbutvanishedsince2000astheinvestmenttobreak-evenhasexceeded$100M,whichstretchestherisk-takingaspectoftheventurecapitalinvestmenttoofar,leavingSoCdesignasanenterpriseonlylargecompanieswithestablishedvolumemarketscanundertake.Thepricing-outofstart-upcompanieshasgreatlycompromisedopportunitiesfor

innovationinthemicrochipbusiness.

4.3.2.Thepowerroadblock

Moore’sLawisdeliveredthroughmanufacturingever-smallertransistors,andastransistors are made smaller they become faster, cheaper and more energy-efficient.But theseenergy-efficiencygainsaremore thancancelledoutby theincreasingnumberoftransistorsonachip,sothechippowerbudgethasgrowntothepointwhereitisbecomingverydifficult,forahigh-performancechip,togettheelectricalenergyinandtogetthethermalenergyout.

Asthis trendcontinues the industry isfacingtheprospectof“darksilicon”— future chips will simply not be able to operate with all of their functionsactiveatthesametime,soatanytimeseveralpartsofthechipwillhavetobepowereddowntoavoidmelt-down.

4.3.3.Thevariabilityroadblock

Astransistorsaremadesmaller,controllingtheiroperatingparametersbecomesincreasingly problematic. The statistics of variations in the positions ofindividual atoms within the active region of the transistor are already asignificant factor in design. Critical device parameter spreads must becompensatedbydesigntechniques,butwhereasinthepastdesignershavehadtocope with variations of percentages, in the future this will be orders ofmagnitudelarger.

4.3.4.Devicephysics

Atomic-level devices simply don’t behave in the manner described bymacroscopic models. As transistors shrink towards atomic scales variousquantum-levelphysicalphenomena,suchaselectrontunneling,makethemselvesfelt.

4.3.5.Theinterconnectroadblock

Thescalingoftransistorstoever-smallerphysicalsizeshas,onthewhole,made

thosetransistorsbetter.Thesameisnottrueoftheinterconnect—thewiresthatjoin those transistors together. As the transistors have gotten faster, the wireshave gotten proportionately slower, and the performance of chips today islimitedbythewiring,notthetransistors.

On-chip interconnect absorbs both silicon real estate and power budgets;physicsplaces strict limitsoncausalitywave fronts.Off-chip interconnect is amajor issue, effectively limiting the ability of externalmemory tokeepmany-coreprocessorssuppliedwithcodeanddata.

4.3.6.Theeconomicroadblock

The cost of building and running a chip manufacturing facility (a “fab”) hasgrownexponentiallyalongsideMoore’sLaw,andisnowoutofreachofmanygovernmentsandallbutthelargestmultinationalcompanies.

4.3.7.May’slaw

DavidMay,theBritishcomputerarchitectresponsiblefortheInmostransputer,madethefollowingobservation,nowknownasMay’sLaw:Softwareefficiencyhalvesevery18months,compensatingMoore’sLaw.

ThecausesresponsibleforMay’sLawareamixtureof:

•Ashortageofprogrammingskills;•Thetendencyofprogrammerstoaddtoomanyfeatures;•Copy–pasteprogramming;•Massiveoveruseofwindowsandmouse-clicks;•RelianceonMoore’slawtosolveinefficiencyproblems.

It is observed, for example, that Microsoft Office 2007 performs (on 2007hardware)athalfthespeedofMicrosoftOffice2000(on2000hardware)[Ref.4].

4.3.8.Allinall

Moore’sLawisnotafundamentallaw;itisnotaverifiablehypothesis;itisnotanextrapolatedprediction(well,itis,actually...);itisnotaphilosophy.Itisa

business model — arbitrarily created, trivially discarded. It became a self-fulfillingprophecy,andhadsignificantimpactonvariousroadmaps,butitwasbasedonanunsustainableexponentialprocess.It’stimetomoveon.

4.4.ACondensingofConcepts

With all these obstacles impeding further progress with sequential Turingmachines,howdowe findaway forward?Let’s try toputa fewobservationsandideastogetherthatcanprovideaplatformforprogressinanewdirection.

4.4.1.Theworldisnotsynchronous

Mostdigitalcircuitsinusetodayaresynchronous:thedesignisbasedaroundtheidea that the signal values are binary, and the entire system shares a commonnotionoftime,definedbysomeclocksignaldistributedthroughoutthesystem.Asynchronous systems, on the other hand, also have binary state values, butthere is no notion of an externally dictated passage of time. Sub-circuits usehandshaking to choreograph the movement of data throughout the system.Thingshappenwhentheyhappen,andotherthingspredicatedonthiswaituntilsome “ready” signal is asserted. (One way of looking at this is to assert thateveryindividualpartofthecircuithasitsown,localnotionoftime,dictatedbyalocalclock,butthisclockisdrivenbythepreviousprocessblock,andmaynotnecessarily have equally distributed ticks.) Compared to their synchronouscounterparts,asynchronouscircuitshave:

•Lowerpowerconsumption,becausethereareno“wasted”clocktransitions;•Ahigherdatathroughput,becauseblocklatenciesarealllocal,andthereisnorequirementtomakethecircuittolerantoftheworstgloballatency;

•Lowerelectromagneticemissions,becausethereisno“centrefrequency”;•Moretoleranceofprocessvariation,becausetimingcanbemadeinsensitivetowireanddevicedelays;

•Noclockskewproblems.

On the other hand, the handshake circuitry itself carries an area and poweroverhead, and the lack ofmainstream acceptance—and hencewillingness oflargeEDAvendorstoprovidetools—hasitselfbecomeaviciousspiral.Butitisanotherthreadinthegrowingstreamoftechnologiesthatarecomfortablewith

thingshappeningintheirowntime.

4.4.2.Thesiliconcompiler

The notion that the behaviour of a system can be captured and translatedautomatically intoaphysicalstructure, inamanneranalogous to that inwhichhigh-level computing constructs are translated into machine code by aconventional compiler, is called behavioural synthesis, or silicon compilation.Recall that themost significant difference between a software and a hardwarelanguageistheadditionoftemporalinformation;naively,thedifferencebetweenasoftwarecompilerandasiliconcompiler is that thesiliconcompilerrequiresknowledge of the temporal cost of each “instruction”, so that the dataflowdictatedbytheuserdescriptionisnotviolatedbytheimplementation.

And,ofcourse,thesiliconcompilerisfreetoexploitwhateverparallelismitcanextractfromtheuserdescription.

4.4.3.FPGAs—todayandtomorrow

FPGAs[Ref.5]havebeentreadingtheirownMoore’sLaw.TheearliestFPGAchips had a few thousand logic blocks, each ofwhich typically consisted of alatch and a fewgates of glue logic, that could be configured by anEDA toolonlyasonecompleteoperation.Today,FPGAsconsistofamassoflogicblocks,as before, although the complexity and size of each block has increased, andtypicallytherewillmanyordersofmagnitudemoreofthem.Thefabricwillalsobe replete with fast IO ports, RAM blocks, DSP blocks, crypto blocks, andexplicit cores.Theconnectivitycanbe reprogrammed inpartor inwhole, andsometimesthiscanbeachievedonthefly,underthecontroloftheFPGAitself—trulyself-modifyingsystems.

Thekeypoint is that thecapability forexploitingparallelism—both fine-andcoarse-grained—ismultiplying.Theusercanimplementpartofthedesignin a sea-of-blocks, part on (several) explicit fetch-execute cores, part ondedicatedhigh-functionalityhardwareblocks.

4.4.4.Multi-coresilicon—todayandtomorrow

Attheriskofgoingoutofdatebeforetheinkhasdried,ahandfulofmulti-core

technologiesillustratethatthemulti-corefieldhasitsownmomentum:Anton[Ref.6]isaspecial-purposesupercomputerconsistingof512custom

ASICsarrangedinahigh-bandwidth3-Dtorusnetworkdesignedforsimulatingmolecular dynamics (MD) problems. Each ASIC contains a high-throughputcomputepipelinetailoredforcalculatingtheinteractionsbetweenpairsofatoms,andaprogrammablesubsystemforcalculatingFFTsandintegratingtheparticletrajectories.ThespecificdesignofthefunctionalunitsallowsAntontosimulatea62.2Åcubicboxcontaining23,558atoms for14.5µsof simulation time inonedayofwall-clock timeby foregoinggeneralpurposecomputationsupport.Networktrafficisprimarilybasedaroundsmallmulticastpacketsbecauseahighvolumeofdataisconstantlystreamedthroughthemachine.

Intel have produced a prototype chip that features 48 Pentium-class IA-32processors, arranged in a 2-D 6 × 4 grid network optimised for the messagepassing interface [Ref. 7]. All the processors are coherent and are capable ofbootingLinuxsimultaneouslybuttheyaredesignedascoprocessorsratherthana stand-alone system. A special message-passing buffer shared by processorsand the introduction of a message-passing memory type implementscommunicationbetweenthecoresusingcaches.

Centip3De is a 130 nm stacked3-Dnear-threshold computing (NTC) chipdesign that distributes 64 ARM Cortex-M3 processors over four cache/corelayers connected by face-to-face interface ports [Ref. 8]. The 3-D networkprovides a large area reduction inside the chipbut cannotbeused for chip-to-chipnetworks.Fourprocessorsshareinstructionanddatacachesthatarepartofa 3-D network that permeates the device to ensure coherency and to provideaccesstothe256MBofsystemmemory.

TheSwizzle-switchnetwork[Ref.9]isa128bit64-input64-outputsingle-stage swizzle-switch network (SSN)which is similar to a crossbar switch butalso supports multicast messages. A least-recently granted (LRG) arbiter isembedded in the network fabric to implement channel switching thatmust beexplicitlyreleasedby theoriginatingmaster. It iscapableofprovidingon-chipbandwidthsashighas4.5Tbpsbutthelargewirecountmakesitimpracticalforchip-to-chipnetworks.

TILE64TMisachip-multiprocessorarchitecturedesignthatarranges64×32bitVLIWprocessorsina2-D8×8meshnetworkthatsupportsmultiplestaticand dynamic routing functions [Ref. 10]. High-bandwidth input/output (I/O)ports allow the TILE64 to act as either a coprocessor or a system processoroptimisedforstreamingdata taskssuchasdata-centernetworkroutingordeeppacket inspection algorithms.A 5-port crossbar-switch-basedwormhole routerattachedtoeachprocessortileallowscachestobeconnectedtogetherandtothe

systemmemorytoprovideamechanismforcoherence.As the size of parallel systems increases, the proportion of resource

consumption (including design effort) absorbed by “non-computing” tasks(communications and housekeeping) increases disproportionally. Architecturesthat sidestep these difficulties with unconventional approaches are gainingtractioninspecialisedareas.

4.4.5.Where’sitallheading,then?Convergence...

Our hypothesis, then, is that future progress in computing is dependent onmodelsofcomputationthatabandonTuring’sheritageofsequentialalgorithmicexecution, and instead see computation as a property of a vast distributed,asynchronous,non-deterministicnetworkofagents.Onthisbasisitwillbehard—insomenumberofyears’time—totellthedifferencebetweenastate-of-the-artFPGAandastate-of-the-artmulti-corecomputerplatform.

Whenwegettothepointthat—forallpracticalpurposes—thehardwareisfree,wemayaswellmaketheindividualpointcellsascomplicatedaswelike.Itmayseemanathematic touseanARMcoreasadedicatedarithmeticoperator,butifwehaveaninfinitenumberofcoresavailableandtheoperationisnotonthecriticalpath,whynot?

Asawaytoexplorethisspace,let’sbuildamachinewithamillioncoresonit,andseewhatitcando.

4.5.We’reCloserthanYouMightThink:ComputingonVast,DistributedNetworkedResources

There are already many examples of systems that implement some form ofinformation processing on vast, distributed networks. These can be found innatureandinengineering.Naturalsystemsinclude:

•Thebrain,ofcourse;•Coloniesofsocialinsects;•Businessesandindustries;•Humansociety?

Engineeredexamplesinclude:

•Theinternet.

4.5.1.HowdoesNaturedoit?Ofbrains,neuronsandsynapses

Thebrainistheprimeexampleinnatureofthesortofsystemthatsupportsourhypothesis that meaningful (albeit non-deterministic) computation can beperformed on vast distributed networks of asynchronous agents. Thefundamentaldrawbackhereisthatsciencecannotyetofferanythingapproachingacompletedescriptionofhowthisorgan,socentral to the livesofeachofus,carriesoutitsvitalfunctions.

Brainsarecomposedofneurons,andneuronsare,likelogicgates,multiple-inputsingle-outputdevices.True,theytendtohavealotmoreinputs:logicgatestypicallyhave two, threeor four inputs,whereasneurons typicallyhave1,000,10,000 or even sometimes 100,000 inputs, so there is a difference of scale.Neurons connect through synapses, and communicate principally by sendingimpulseswhen their inputs resemble a pattern they are tuned to. Synapses areplastic—theychangeundertheinfluenceoftheactivitiesofthepre-andpost-synapticneurons—andthewiringdiagramoftheneuralcircuitisitselfsubjectto change over time (unlike most electronic circuits, although dynamicallyreconfigurableFPGAscanchangetheireffectivewiringatrun-time).

Brainsalsodisplaystructuralregularities—neuroscientiststalkintermsofthe 6-layer micro-architecture of the cortex, which looks similar throughout,includingat the rearof thebrainwhere it supports low-level imageprocessingthatistosomedegreeunderstood,andatthefrontofthebrainwhereitsupportsnatural language processing and higher levels of thought, where we currentlyhave little clue as to what is going on. Yet the fact that these very differentprocesses run on the same substrate must be telling us something about thealgorithmsthatareinuse?Likewisethe2-Dnatureofthecortexandthegeneraluse of 2-D topographic maps suggests that the 2-D map may be a generalprincipleofoperation,thoughthisislittlemorethanahypothesisatthisstage.

4.6.AComputerEngineer’sApproach

Theworld of computing is obliged tomove into unknown territory. The onlyexistenceproofthatthereissomethingouttherethatworksisthebrain,butthereistheslightdifficultythatwehavenoideahowthebrainworks.Thecomputerengineer’sapproachisthentoaskifwecanusewhatwedoknowaboutbuilding

computers to help understand the inner workings of the brain. This line ofthought has led to the SpiNNaker project, an attempt to build a massivelyparallelcomputerinspiredby,andoptimizedtomodel,whatisknownaboutthedetailedworkingofthebrain.

4.6.1.SpiNNaker

SpiNNaker [Ref. 11] (Fig. 4.3) is a multi-core message-passing computingenginebaseduponacompletelydifferentdesignphilosophyfromconventionalmachineensembles.Itpossessesanarchitecturethatiscompletelyscalabletoalimit of over amillion cores, and the fundamental design principles disregardthree of the central axioms of conventional machine design: the core-coremessagepassing is non-deterministic (andmay, under certain conditions, evenbe non-transitive); there is no attempt to maintain state (memory) coherencyacrossthesystem;andthereisnoattempttosynchronizetimingoverthesystem.

Fig.4.3.A48-node(864-processor)SpiNNakerboard.

Notwithstanding this departure from conventionalwisdom, the capabilitiesofthemachinemakeithighlysuitableforawiderangeofapplications,althoughit is not in any sense a general-purpose system: there exists a large body ofcomputationalproblemsforwhichitisspectacularlyill-suited.Thoseproblemsforwhichitiswell-suitedarethosethatcanbecastintotheformofagraphofcommunicating entities. The flagship application for SpiNNaker — neuralsimulation— has guidedmost of the hard architectural design decisions, butothertypesofapplication—forexamplemesh-basedfinite-differenceproblems—areequallysuitedtothespecializedarchitecture.

There is a low-level software infrastructure, necessary to underpin theoperationofthemachine.Itistemptingtocallthisanoperatingsystem,butwehave resisted this label because the term induces preconceptions, and thearchitecture andmode of operation of themachine does not provide or utilizeresourcesconventionallysupportedbyanoperatingsystem.Eachofthemillion(ARM9)coreshas—bynecessity—onlyasmallquotientofphysicalresource(less than 100 kbytes of local memory and no floating-point hardware). Theinter-core messages are small (<=72 bits) and the message passing itself isentirelyhardwarebrokered,althoughthedistributedroutingsystemiscontrolledbyspecializedmemory tables thatareconfiguredwithsoftware.Theboundarybetweensoft-,firm-andhardwareisevenmoreblurredthanusual.

SpiNNaker isdesigned tobeanevent-drivensystem.Apacketarrivesatacore (delivered by the routing infrastructure, and causes an interrupt, whichcausesthe(fixedsize)packettobequeued.Everycorepollsitsincomingpacketqueue, passing the packet to the correct packet handling code. These packeteventhandlers are (required tobe) small and fast.Thedesign intention is thatthesequeuesspendmostoftheirtimeemptyor,attheirbusiest,containingonlya few entries. The cores react quickly (and simply) to each incident packet;queue sizes≫ 1 are regarded as anomalous (albeit sometimes necessary). Ifhandler ensembles are assembled that violate this assumption, the systemperformancerapidly(anduncompetitively)degrades.

4.7.PartiallyOrderedEvent-DrivenSystems(POEDS)

Hereweattempttobegintodevelopaformalsystemmodelthatcanbeusedtodescribe the operation of biological neural systems (such as the brain) andcomputationalmodels of such systems. Thiswork ismotivated by a desire tofind usefulways to think about information processing in the brain, and by adesiretoproduceaformalsemanticsthatcanunderpinreliableoperationofthe

SpiNNakermachines.We introduce three models at different levels of abstraction, progressing

from the biology of neural systems down to the details of the SpiNNakermachine.

4.7.1.Ahybrid-systemmodel

The system is a set of dynamical processes that communicate purely thoughevent communications using a set of event channels. Each dynamical processevolves in timeunder the influenceofreceivedevents; typicallyaprocesswilldependonasubsetoftheeventchannels,notallofthem.

Eacheventchannelcarriesevents thatareeithergeneratedbyaprocess,orcome from the environment, and go to all the other processes that depend onthem. Normally an “event” is a pure asynchronous event that carries noinformationotherthanthatithasoccurred,soaneventchannelcanbethoughtofasatimeseriesofidenticalimpulses.

Wecanconsidertheeventchanneltobeinstantaneous,sothateventsarriveatalloftheirdestinationsatthesametimethattheyaregeneratedbytheirsourceprocess, though causality allows us to view this as “after” they are generated,albeit by a vanishingly small delay. Likewise, if an incoming event causes aprocess to generate an outgoing event this causality is captured by the outputbeing“after”theinput.

Theoutputsfromthemodelaresimplyasubsetofthetotalsetofevents.

4.7.2.Biologicalneurons

Biologicalneuronsarecomplexlivingcells thathaveacellbody(thesoma),asingle output (the axon) that carries action potentials, and a complex multi-branched input structure (dendrites) that collect inputs. The axon from oneneuroncouplestothedendriteofanotherthroughasynapse,whichisacomplexadaptivecomponentinitsownright.

Actionpotentialsaresustainedandpropagatedbyelectro-chemicalprocessesintheaxonthatallowthemtobeviewedaspureasynchronousevents.

Long axons incur significant delays, but these can be rolled into thetransmitting and/or receiving process.Where there are different delays from asinglesource todifferent targets, forexampleashortdelay toproximal targetsand a long delay to distal targets, the hybrid-system model allows this to be

captured either by different delays in the receiving process or by the sourcetransmitting separateeventswithdifferent sourcedelays,or somecombinationofthese.

We therefore claim that the hybrid-system model captures the essentialfeatures of biological neurons that exchange information principally throughactionpotentials.

Actionpotentials arenot thewhole story,however.Someneuronsproducechemicalmessages, for example dopamine, thatmodulate the activity of otherneurons within a physical region. Some neurons make analogue dendriticconnections with their neighbours. These phenomena are outside the hybrid-systemmodel,butwehope that theirprincipaleffectscanbecaptured throughback-channelprocessesofsomesort.

In addition, biological systems do not have static connectivity — theydevelop andgrow,gaining and losingneurons and connections to their “eventchannels”.But these happen slowly relative to the real-time information flow,and again we hope to implement connectivity changes through back-channelprocessesasandwhenwegettothatpoint.

Biological systems are also very noisy, but we can accommodate this byusingnoisyprocesses.

4.7.3.Anabstractcomputationalmodel

Wecannotcomputeacontinuousprocessexactlyasinthehybrid-systemmodel,soforefficiencyit is important toapproximate theprocess insomeway.Mostneuron models are some form of system of differential equations, so it iscommon practice to compute these using a form of Euler integration overdiscretetime-steps.

For real-time modeling, the Euler integration can be implemented byintroducing an additional “time-step” event. Now time is just another, regular(e.g.1ms) event, fromanexternal source, and timecanbe removed from themodel.

Wecan,at least inprinciple ifourcomputer is sufficiently fast, ignore thetimetakenforaprocesstohandleanevent.Eacheventishandledasitarrives,and each process is simply a set of rules defining how that process’s state ischanged by every possible input event. This is the event-driven aspect ofPOEDS.

It is clear that a process is active only in response to an input event, andtherefore any output events it generates must also occur at the same time as

(though causally after) an input event. Note that this does not preclude aninternaltimedelaybetweenaneuralinputandtheoutputitcauses:theinputcanchangethestateoftheprocess,whichthenprogressesthroughseveraltime-stepeventsbeforeproducinganoutput.Buttheoutputwilleventuallybeproducedinresponse to, and at the same time as, a time-step event. As the onlyrepresentationoftimeinthesystemisthetime-stepevent,timeisdiscretized.

Sincethetime-stepeventconnectstomany(ifnotall)processes,theremaybemany events generated just after it.These events, fromdifferent processes,havenoimplicitorder.ThisgivesrisetothepartiallyorderedaspectofPOEDS.Eachprocesstowhichsomeoftheseconcurrenteventsareinputswillimposeanarbitrary order on their reception (at notionally the same time), and as aconsequencethesystembehaviourisnon-deterministicatthispoint.

4.7.4.ASpiNNakercomputationalmodel

SpiNNaker isamassivelyparallel systemwithan interconnect fabricdesignedspecificallytoconveyeventsgeneratedbyaprogramrunningononeprocessorto all of the processors towhich that event is an input.TheSpiNNaker fabricmustinitiallybeconfiguredtoputthenecessaryconnectionsinplace,butonceso configured the hardware looks after the event connections. Processors thenreceiveeventsintendedforthemandissueeventswithnoknowledgeofwheretheyaredestinedtogo.

Unfortunately the processors on SpiNNaker aren’t infinitely fast, so aprocesstakesafinitetimetocompleteitsresponsetoaninputevent.Whileitisrunninganothereventmayarrive,demandingpreemption.The time-stepeventmay not be synchronized across the machine (although near synchronizationmightbepossibleusingatechniquesuchasfire-flysynchronization).

AfurthercomplicationisthatSpiNNakerprocessorskeepsomeoftheirstatein off-chip SDRAM, access towhich incurs high latency costs. In generalweaim to hide this latency by exploiting DMA subsystems attached to eachprocessor to handle SDRAM transfers while the processor gets onwith otherstuff.

These(andother)nicetiesapart,SpiNNakeraimstoimplementtheabstractcomputationalmodelasfaithfullyasitcan,subjecttoalloftheconstraintsofthephysical system, delivering a reasonably efficient solution, and minimizingenergyconsumption.

SpiNNaker models may attempt to implement the abstract computationalmodelfaithfully,inwhichcasetheywillaimtosynchronizethe(notional)1ms

time-stepacrossthemachineandcompletealltheworkinevery1mstostayinlock-stepacrossthemachine.Inthiscasethepeakprocessloadmustcompletewithinthe1msforcorrectoperation.Ortheymayadoptanasynchronousmodelwhere there is no attempt to align a 1 ms period in one process with that inanother,inwhichcasetheaverageprocessloadmustcompletewithin1msforcorrectoperation.

4.7.5.SpikingneuronsonSpiNNaker

Each processor on a SpiNNaker machine handles one process, where eachprocessmodelsanumberofneurons.Asincomingeventsfromotherprocessesareverysimilartheyarehandledbyoneeventhandler.ThesimplestmodelofaSpiNNakerprocessthenhandlestwoeventtypes:

1.Incomingneuronevent:locate&processsynapticdata,updatinglocalneuralstateaccordingly.

2.Time-stepevent:performEulerintegrationstepforalllocalneurons,possiblygeneratingoutgoingevents.

Asanimplementationdetail theneuroneventhandlerwillusuallyinvokeaDMA transfer tobring the synaptic connectivitydata in fromSDRAM,but asthisisinternaltotheprocesswehopetohidetheDMAasmuchaspossiblefromtheapplicationcode.

Thismodeldoesnothandle the importantaspectof synapticplasticity,butalreadycreatessomeinterestingdataconsistencyissuesifevent2occurswhilethe (fairly long) event 1 process is running and preempts it. These dataconsistencyissuesareavoidedifnoinputisallowedtoaffectstatethatisusedinthecurrenttimestep,whichamountstoimposingaminimumaxonaldelayof1ms.

In general a SpiNNaker implementation will use a very simple real-timekernel of some sort,with drivers for the event communication system,DMA,etc. It will need queue management, priority scheduling, buffer overflowprocedures, and so on. But it will maintain a strongly event-driven nature,spendinganyidletimeinalow-powerwait-for-interruptstate.

4.8.Conclusions

We have come a long way building computers on Turing’s sequentialfoundations,butwe’ve reached the endof the road. It’s time to try somethingnew, butwhat? The brain is an existence proof of a different way to processinformation, but we don’t know how it works. Computers can help in thescientificquesttounderstandthebrain,andSpiNNakerhasbeendesignedwiththatendinmind.Itoffersadifferentwaytothinkaboutcomputation.Wherethatwilllead,wedonotknow.

References

1.S.Lavington.EarlyBritishComputers,ManchesterUniversityPress,Manchester,UK,1980.2. K. Zuse.Über den allgemeinen Plankalkül als Mittel zur Formulierung schematischkombinativer

Aufgaben,ArchivderMathematik,1(6),441–449,1948.3.G.Moore.CrammingMoreComponentsontoIntegratedCircuits,Electronics,38(8),114–117,1965.4. R.C. Kennedy. Fat, Fatter, Fattest: Microsoft’s Kings of Bloat. [Online] Available at:

http://www.infoworld.com/t/applications/fat-fatter-fattest-microsofts-kings-bloat-278?page=0,4,.[Accessed23April2014].

5. P.Y.K. Cheung, G.A. Constantinides, and J.T. de Sousa. Guest Editors Introduction: FieldProgrammableLogicandApplications,IEEETransComputers,53(11),1361–1362,2004.

6. D.E. Shaw et al. Anton, a Special-purpose Machine for Molecular Dynamics Simulation, ACMSIGARCHComputerArchitectureNews,51(7),91–97,2008.

7. J.Howardetal.A48-core IA-32Message-passingProcessorwithDVFS in45nmCMOS, inProc.InternationalSolid-StateCircuitsConferenceDigestofTechnicalPapers,pp.108–109,2010.

8.D.Ficketal.Centip3De:A3930DMIPS/WConfigurableNear-threshold3DStackedSystemwith64ARMCortex-M3Cores,inProc.InternationalSolid-StateCircuitsConference,pp.190–192,2012.

9. S. Satpathy et al. A 4.5 Tb/s 3.4 Tb/s/W 64 × 64 Switch Fabricwith Self-updating Least-recently-grantedPriorityandQuality-of-serviceArbitrationin45nmCMOS,inProc.InternationalSolid-StateCircuitsConferenceDigestofTechnicalPapers,pp.478–480,2012.

10.S.Belletal.TILE64TMProcessor:A64-CoreSoCwithMesh Interconnect, inProc. InternationalSolid-StateCircuitsConferenceDigestofTechnicalPapers,pp.88–89,2008.

11.S.B.Furberetal.OverviewoftheSpiNNakerSystemArchitecture,IEEETransactionsonComputers,62(12),2454–2467,2012.

12. Diagram & History of Programming Languages. [Online] Available at: http://rigaux.org/language-study/diagram.html.[Accessed23April2014].

13. Hardware Description Language. [Online] (Updated 3 April 2014) Available at:http://en.wikipedia.org/wiki/Hardware_description_languages.[Accessed23April2014].

http://www.infoworld.com/t/applications/fat-fatter-fattest-microsofts-kings-bloat-278?page=0,4

http://rigaux.org/language-study/diagram.html

http://en.wikipedia.org/wiki/Hardware_description_languages

Chapter5

SmartModuleRedundancy—ApproachingCostEfficientRadiationTolerance

JanoGebelein,SebastianManz,HeikoEngel,NorbertAbel,andUdoKebschull

InfrastructureandComputerSystemsforDataProcessing(IRI),Goethe-UniversityFrankfurt

ThischapterdealswiththeproblemswhenoperatingFPGAsinradiationenvironmentsandpresentspracticalrealizationsthatimplementtechniquestomitigateradiationeffects.

Hereby the focus is on smart module redundancy which replaces conventional TMR (triplemoduleredundancy)andaimsforhighercostefficiency.ThebasicideaistotakeacloserlookattheFPGAdesignandclassifyitsparts.Eachclassisthenprotectedinitsownmostefficientway.

Current implementations are used in readout boards by the Compressed Baryonic Mattercollaborationintestsetupsforhigh-energyphysicsdetectorsatGSI/FAIRinDarmstadt,Germany.Thesephysicsexperimentsarechallenging in twoways: theycomewithveryhighradiationratesanddemandveryhighcostefficiency.

Insummer2009thelastauthorofthischapterhadtheopportunitytojoinPeterCheung’s researchgroup for a short researchvisit at ImperialCollege.Duringhisstay inLondon,hegot in touchwithPhDstudents,postdocsand lecturerswhoworkedontopicslikechipagingeffects,faulttolerance,andcharacterizingfield programmable gate arrays (FPGA) for better optimization of timing andresourceusage.ManyoftheideasdiscussedwithPeterandhisstudentshadaninfluenceontheradiationmitigationtechniquesdescribedinthischapter.

The authors’ strong focus on radiation tolerance is motivated by theirinvolvement in high energy heavy ion collider experiments like the onescurrently run at CERN in Geneva/Switzerland and at GSI/FAIR inDarmstadt/Germany.Here,FPGAsareusedforthereadoutofexperimentdata,firststagesofdatareduction,andfeatureextractionofsensordata.Experimentdata needs to be reduced in the very early stages due to extremely high datarates,andthereforeitisnecessarytooperatetheFPGAsascloseaspossibleto

the collision point.While space applications in low-earth orbit (LEO) usuallyneed to handle a few errors per month, detector applications in heavy ionexperimentsneedtohandleuptotenerrorspersecond.

Thischapter focusesonways toprotectanFPGAdesignagainst radiation-based single event upsets (SEU) in a cost-efficient way. The gained insightsshowthattheimplementedtechniqueshelptosecurelyoperatestandardFPGAsinhigh-radiationenvironments.

5.1.Introduction

FPGAsarewell-knowndevices in theworldof reconfigurablehardware.Theyare physically built of various logical components like look-up tables (LUT),multiplexers,andflip-flops,supplementedbyacoupleofmanufacturerspecificpartslikeclockmanagersorhighlyspecializeddigitalsignalprocessors.Achip-wide interconnection routing network is provided for internal crosslinking andsignal transfer between all of these components. Nowadays, two majorcommercial FPGA types are available on themarket: Flash-based non-volatileand static random-access memory (SRAM) based volatile CMOS architecturechips.Bothmodelsprovidemanywell-knownadvantagesbutneverthelesssomemajordisadvantages,especiallywhenusedinradiation-susceptibleenvironmentssuch as avionics, space applications or particle accelerators. Therefore, FPGAmanufacturers like Xilinx provide chip features which allow partialreconfiguration that can be used to refresh configuration data like routinginformationandstaticLUTcontentatruntime:so-calledscrubbing[Ref.1].

The FPGAmaterial’s interaction with ionizing particles is founded in theSRAM architecture’s general characteristics. The penetrating radiation causesthe semiconductorchip’sdopedsilicon tochange itselectricalproperties.Thisphysical separation of electron-hole pairs results in spontaneous single eventeffects (SEE). These SEEs can for example change a signal temporarily fromlogic0tologic1,orviceversa.Ifthissignalhazardispickedupbyaflip-flop,thetemporaryhazardgetsstoredandthusbecomesapermanentSEU[Ref.2].Ingeneral,theseerrorsmayleadtospontaneousunexpectedsystembehavior,e.g.finitestatemachines(FSM)enteringundefinedstates.Intheworstcasetheycanleadtoatotalsystemhalt,knownassingleeventfunctionalinterrupt(SEFI).

To mitigate these radiation effects, various approaches are available:extensivehardwareshieldingaswellastheuseofradiationhardenedmaterialsformanufacturinguptoatriple-chipdesigncomplementedbyanexternalvotercircuit.Noneof these approaches is applicable foruse inparticle accelerators,

becauseof the tremendous sizeor financial requirements forhigherquantities.Another option is the integration of fault tolerance in a single FPGA usingspatialaswellas temporalcircuit redundancy.However, thiscanbeextremelyresource-intensiveasdiscussedinthefollowingsection.

5.2.StateoftheArt

According to the well-known problem of SRAM CMOS architecture’ssusceptibility to radiation, some major circuit design hardening technologieswere developed [Refs. 3–5]. One of their major disadvantages is theinapplicability to circuits providing re-programmability features like FPGAswith universally and evenly spread configurable logic blocks. Also theassociatedincreasedcostfactorisnotnegligible.Thus,veryspecializedshieldedradiation-hardmaterialshavebeenandcurrentlyaredevelopedformilitaryandspacegradeFPGAs, realizedwithin theXilinxVirtexQPro II, 4 and5 series,complementedbythestaticscrubbingfeature.

Asthesechipsarenotavailableforcustomnon-USapplications,commercialoff-the-shelf(COTS)devicesareprovidedwithspecialtotalionizingdose(TID)annealing techniquesandSEEfailsafecombinationsof logicblocks.Basically,suchambitionsinsecuringthelogicalhardwaredesignlayercanberealizedbythreedifferentapproaches:spatialredundancy,temporalredundancyandvariouscombinationsofboth.SpatialredundancyfeaturessynchronousdatasamplingofcombinatoriallogicatmultipleroutestomitigateSEUs.Theadditionofadjacentvoters is required to analyze processed data values. Well-known candidatesusing this principle are dual and triple modular redundancy (DMR/TMR),mostlyaccompaniedbyerrordetectionandcorrectioncodes(EDAC).Temporalredundancyenablesasinglecombinatoriallogiccircuittobesampledatmultipletimes.Additionalvotercircuitrycomparesalloftheresultsanddecideswhetheranerroroccurredornot.

The combination of both spatial and temporal redundancy leads to SEUimmunityat thepriceofmultipliedresourcerequirementsandtimingdecrease.Complementary fault tolerance techniques like self-replicating temporalsampling [Ref. 6] try to re-use chip resources to assuredata integrity at lowercosts. Furthermore, multiple bit upsets can be handled using weighted votingtechniques [Ref. 7]. However, some of these additional design methods havebeen proven to increase susceptibility to radiation instead of reducing itwhenusedinageneralizedcontext[Ref.8].

Thesedays,TMRincombinationwith scrubbinghasbecomeanunofficial

industrystandardforaddingfaulttolerancetologiccircuits.FPGAvendorslikeXilinxprovide tools that takeagivenFPGAdesign, triple itandaddvotersaspart of the place and route process.The advantage of thismethod is that it isfully automated.Thehugedrawbackof this approach is the area consumptionthat comes with it. Spatial requirements typically grow up to six times theoriginalsize.This isabigproblemregardingcostefficiencysinceusingasix-times bigger FPGA can easily translate to more than 20 times the cost,dependingonthecurrentmarketsituation.

5.3.SmartModuleRedundancy

Ourbasicapproachistotakeacloserlookatagivendesignanditscomponentsandtoaskforeachcomponentindividually:whatdoweactuallyneedtodotomakeitradiationtolerant?

It turned out that there are principally three classes of components. Eachclassdemandsitsownmitigationtechnology:

(1)DataPath—buses, registersandFIFOs that transport, storeandprocessadatastream.

(2) I/O— specific FPGA components used to communicate with the outsideworld.

(3) Control Logic — registers, FSMs or even embedded processors used tomonitorandcontrolthedatastreamprocessing.

5.3.1.Datapath

Most data paths are not limited to a single FPGA, but spread across severaldevices. Using a communicationmedium like optical fiber, these devices canevenbelocatedfarawayfromeachother.Thisnaturallycomeswiththedangerofmedia-introduced errors.Thus, nearly all communicationprotocols (such asEthernet)containredundancymechanismslikeaparitybitoraCRCthatallowthedetectionofintroducederrors.Normally,theseprotectionmechanismsfocusmore on theworld outside the FPGA.A typical FPGA-based Ethernet switchwouldforexamplechecktheCRCofanincomingstream,striptheCRC,andre-attachaCRContheoutgoingstream.

Inorder tomake thedata path radiation-hard, it is sufficient to keep theseprotectionmechanismsandusetheminsidetheFPGA(forexampleenhancethe

internaldatastreambyaCRC).Thismakesitpossible todetectandtocorrectSEUswithout theneed forTMR—which finally leads to a radiation-tolerantsystem that ismuch smaller than a fully triplicated system. This is especiallyimportant,sincethedatapathisgenerallythebiggestpartoftheFPGA.

Forexample,adesignimplementinga128-bitdatastreamwith20pipelinestagesand3FIFOstagesconsumes320registersand3FIFOs.TheTMRversionof this data stream including voters uses about 1,250 registers and at least 9FIFOs. Instead, adding a 32-bit CRC to the 128-bit data stream leads to aconsumptionof400registers.Today’sFIFOsprovideadditionalbits forCRCsand other protection mechanisms, so it is quite safe to assume that the CRCdesignstillonlyneedsthreeFIFOs.Inthissimplecaseexample,smartmoduleredundancysaved850registersand6FIFOs.

5.3.2.I/O

TheFPGA’sI/Oisinmanywayshighlycomparabletothedatapath.Thebasicrule for I/O is: if one places an FPGA in a radiation environment and lets itcommunicatewithotherdevices,mitigation technologiesalwayshave to focusonthesystemasawhole,notonlyontheexposedFPGA.IntheexperimentsatGSI/FAIRtheradiation-exposedFPGAcommunicatesviaopticalfibrewithanFPGAlocatedinashieldedcontrolroom.Thecommunicationprotocolbetweenthese twoFPGAshasbeendesignedinawaythathandles transmissionerrors.ThereforeitisnotreallyimportantifthesetransmissionerrorsarecausedbyanSEUintheFPGA’soutputbufferorbypoorsignalquality.Mostconventionalcommunication protocols (such as TCP/IP) are tailored to handle thesetransmissionerrors.Thus,averycost-intensivetriplicationoftheI/Opinsisnotnecessary. At GSI/FAIR, the communication protocols were additionallyspecifiedtoberobustagainstatemporarycompletedevicefailurehappeningontheexposedFPGA.

5.3.3.Controllogic

The most complex part of the FPGA design, in terms of mitigationmethodologies,iscontrollogic.Here,SEUscancauseeffectsthatarebothhardtopredictandhardtomonitor.Thebasicreasonbehindthisis:everymonitoringor mitigation logic implemented on the FPGA introduces a new radiation-sensitiveelementthatcanitselfbecomeavictimtoSEUs—andthusmaycause

moreproblemsthanitactuallysolves.ExamplesforeffectscausedbySEUsincontrollogicare:

•Sequencecountersthatjumptoanon-sensevalue.•Statusbitsthatchangecontent(e.g.fifoempty).•FSMsthatenteranundefinedstateorviolatethedefinedstatetransitionmatrix.

Trying to find a cleverway to protect control logic has proven to be verycomplex—andextremelypronetomisjudgement,endinginbadsurprises[Ref.8].Thus,ourobservationisthatthesafestwaytoprotectthecontrollogicistomake use of automatedDMRorTMR.The cost of this is not too high, sincecontrollogicmakesuponlyasmallpartofatypicalFPGAdesign.Forexample,inthecaseofGSI/FAIR’sGET4ReadOutController[Ref.9]itonlyrepresents10%oftheentiredesign.

Having said that, long-term experiments at GSI/FAIR revealed that onecannotfullyguaranteethattheprotectionmechanismswillwork.IfanSEUhitsa clock net or another critical part of the FPGA, even the best mitigationtechnologycanfail.Duetothis,itisveryimportanttohaveradiation-hardenedwatchdogs in place (normally implemented on a radiation-hard device outsidetheFPGA)thatarepermanentlymonitoringtheFPGAandareabletoresetthewholeSRAMchipifnecessary.Thecommunicationprotocolshavetoberobustagainstthistemporarydevicefailure(asdescribedabove).

ThemostcomplexcontrollogicscenarioistheutilizationofaCPU.InmanycasesthissimplyleadstonotusingaCPUinradiationenvironmentsinthefirstplace—butthisisnotalwayspossible.CPUscomewiththeadvantageofhighflexibilityandveryhighadaptability.Moreover,softwarerunningonaCPUcanexecute very complex tasks with a much lower resource consumption than acomparablepure-firmwaresolution.Hence,talkingaboutcostefficiencymeanstalking about CPUs. Due to this, our workgroup developed a fault-tolerantVHDLsoft-coreCPUfullycompatiblewiththeMIPSR-3000architecture.ItisdescribedindetailinSec.5.4.2.

5.4.TestSetupsandMeasurements

This section presents two test scenarios in which COTS FPGAs have beenexposed to high radiation doses in order to prove the effectiveness of smartmodule redundancy. The first setup focuses on design classification and thecorresponding mitigation strategies. The second setup focuses on a radiation-

tolerantCPU.

5.4.1.TestsetupI:designclassification[Ref.10]

ThemainobjectiveofthistestwastomeasuretheradiationmitigationcapabilityofsmartmoduleredundancyonanoperationaldetectorreadoutdesignrunningonanSRAM-basedFPGAinahighradiationlevelenvironment.

ThetesteddesignisthefirmwarefortheFPGA-basedreadoutcontrollerfortheGET4ASIC[Ref.9].ThisASICistheplannedtimetodigitalconverterchipfor the timeofflight(ToF)detector in thecompressedbaryonicmatter(CBM)experiment that is currently constructed inDarmstadt,Germany as part of theFacilityforAntiprotonandIonResearch(FAIR).

Theexperimentwascarriedoutat theCoolerSynchrotron(COSY)particleaccelerator at the Forschungszentrum Jülich inGermany. The accelerator wasconfiguredtoprovidea2.1GeV/cprotonbeamwithaparticlerateintheorderof107s−1·cm−2.

Thebeamparticleratewasroughly1,000timeshigherthantheexpectedfasthadron flux in the environment of the detector. The expected failure rate issignificantlylowerforasingleFPGA,however,sincethenumberofFPGAsthatarerequiredtoreadoutthefullCBM-ToFdetectorisalsoontheorderof1,000,the failure rates measured in this beam test are very well comparable to thefailureratethatisexpectedforthefullyequippedCBM-ToFdetector.

5.4.1.1.Thedeviceundertest

For the beam testweused theCBM readout controller boardwhich iswidelyusedbytheCBMcollaboration[Ref.11].TheboardisequippedwithanSRAM-based Xilinx Virtex-4 FX20 FPGA as core data processing device. A small,flash-based Actel ProASIC3 A3P125 FPGA in conjunction with an on-boardflashmemory isusedasconfigurationcontroller for theVirtex-4.TheActel isprogrammed to continuously scrub the Virtex-4 FPGA configuration (blindscrubbing).

Fig.5.1.Depictionofthesetup.Twoboardsaremountedinthebeamline,oneofthemrepresentsthemaindeviceundertest(DUT)whichisrunningthereadoutfirmware,theotheroneisareferenceboardwithanidenticalFPGAcountingtheSEUs.

5.4.1.2.Thefirmwareundertest

Thefirmwareundertestwasnotanacademictestdesign,buttheactualcoreofan operational design which is being used to read out real detector front-endelectronics(theGET4chip).Thelogicwasinitiallyclassifiedandthenmodifiedtooperateinradiationenvironments:

•ControlLogic—critical parts of the control logic havebeen triplicated andstate machines were designed to recover to normal operation even if theyenteredanundefinedstateforanyreason[Ref.12].

• I/O — the communication protocols were specified to be robust againsttemporarydevicefailure.

•Datapath—aCRCchecksumhasbeenaddedtothedatapath.

5.4.1.3.Testprocedurealgorithm

Thetestprocedurewhichwasrunningonthedataacquisition(DAQ)computerhas been specifically designed to evaluate the efficiency of the scrubbingtechnique.Figure5.2illustratesthesequenceofthealgorithm.

First,intheInitstep,someessentialparameterswerelogged.ThisincludedtheSEUratesofbothdevices in thebeamwhichwererecorded inparallel for

three minutes in order to cross-check whether the SEU rates of both deviceswere indeed the same.Next, theprocessingof themain loopbeganwithTakeData, dumping three seconds (~15MB) of raw data to the hard disk for laterofflineanalysis.Followingthis,intheReadbackstep,theconfigurationmemoryoftheSEUcounterdevicewasreadouttorecordthecurrentSEUrate.

Fig. 5.2. The key steps of the test procedure (Test DUT) are performed twice to allow the DUT to berepairedbyscrubbing.

FinallyinTestDUTthefunctionalstatusofthedeviceundertest(DUT)wasinspectedbasedon theanalysisof2000data samples.Since thedata sampleswere sent by a deterministic data generator, their consistency could easily beevaluated.Ifalldatasampleswereflawless,theDUTwasconsideredtobefullyoperational and the test procedurewould continuewith the next loop iterationagainwithTakeData. If one ormore corrupt data sampleswere detected theDUTwas considered to be not fully operational,most likely due to a criticalSEU.Inthiscase,insteadofinstantlyreprogrammingthedevice,theverysamedataconsistencycheckwasrunagain(TestDUTAgain)toallowthesetuptoberepaired by the scrubbing technique.A scrubbing cycle ismuch shorter (~ 80ms)thanthetimebetweenthetwoconsistencychecks(~1s).Onlyifthesecond

checkhadalsofailedthealgorithmenteredtheReprogramstatewherethewholesetupwouldbecompletelyreset.Thismethodmadesurethat thesystemcouldrecoverevenfromaSEFI.

5.4.1.4.Results

Figure 5.3 shows a direct impression of the values recorded during the beamtime.Bothdiagramsshowtheresultsofathreehoursrun,wherescrubbingwasdisabled(Fig.5.3(a))orenabled(Fig.5.3(b)).Pleasekeepinmindthatinordertoaccelerate the tests, theFPGAhasbeenplaced in themiddleof theparticlebeamandthusthebeamparticleratewasroughly1,000timeshigherthanitwillbefortheactualsetupatGSI/FAIR.

TheplotsmarkedwithsquaresinFig.5.3representthetimeperiodsincethelast persistent error has been detected and the test procedure has entered theReprogramstate.Significantdifferencesexistbetweentherunwithenabledanddisabledscrubbing.Withoutscrubbingthesystemfailswithinaboutoneminuteand isonly stableduring the timeof technical stops.However, if scrubbing isenabled,thesystemcansurviveforseveralminutes.Thisshowsthatscrubbingincombinationwithsmartmoduleredundancyhasreducedthedowntimeofthesetupbyafactorofalmost50.

Fig.5.3.TheplotmarkedwithcirclesreferstothenumberofSEUscollectedinthereferenceboard.Theplotmarkedwithsquaresshowsthetimeperiodsincethelastfullresetofthesetup(ReprogramstateinFig.5.2). During the highlighted time slots the beamwas shut down for technical reasons. (a) Scrubbing isdisabled.Fullresetofthesetuprequiredinlessthanaminute.Thesetupisonlystablewhenbeamturnedoff.(b)Scrubbingisenabled.Thesetuprunsstablyforseveralminutes.

TheanalysisofthedataobtainedinTakeDatastepshowsthatscrubbingincombinationwithsmartmoduleredundancyalsosignificantlyimprovedthedataquality. When scrubbing was not applied, about 7% of the data becamecorrupted, whereaswhen scrubbingwas enabled, only 0.03% of the data wasinconsistent.

5.4.2.TestsetupII:afault-tolerantCPU[Ref.13]

In order to be able to use a CPU in radiation environments, a fault-tolerantVHDLsoft-coreCPU,fullycompatiblewiththeMIPSR-3000architectureandinstructionset[Ref.14]hasbeendeveloped.aThisenablesastandardGNUGCCMIPS cross compiler to be used for software development. Data processingwithintheCPUisextremelySEEcriticalsinceeffectsmayshowupmanycyclesaftertheiractualoccurrenceinalmostanypartofthefivestagedpipelineandthesurrounding elements. The in-order command execution itself is extremelysusceptible and modifications may quickly lead to a SEFI and therefore to asystem halt. Thus, occurring errors are immediately detected to prevent faultycalculateddatatobewrittenbacktomemory.Figure5.4showsasketchoftheimplementedCPU.Themajoradvantageofthisdesignliesinthecombinationofchip-area-saving DMR (avoiding extensive TMR) with the static scrubbingtechnique.OnlythemostsensitiveprogramcounterparthasbeenrealizedwithTMR to guarantee integrity. At runtime, both doubled pipeline signals arecontinuously compared with each other subsequent to every calculation step.Errordetectionimmediatelyleadstopipelineinterruption,inwhichallcontentsaremarkedinvalid,preventingexecutionoffaultycommands.Bothpipelinesareflushedwith the instruction thatwas inmemorystagewhen theerroroccurredandCPUrestartscalculationfromthepointtheerroroccurred.IncaseanSEUhas caused the error in data or clock signals, calculation continues correctlywithin the following calculation cycle. If an SEUwithin the FPGA’s routingnetwork is responsible for the miscalculation, the error is recalculated until ascrubbingcyclerepairsthedefectivecircuits.Inparallel,asignalpulsefromanexternal scrubbing controller, indicating the refresh cycle completion state, iscounted.IftheCPUisdetectedtobeirreparablystuck,theprocessor’sregistersareclearedandtheCPUentirelyrestarts.

Fig.5.4.Sketchofthefault-tolerantCPU.Twoconcurrentpipelinesareimplementedtoallowdetectionofconfigurationanduserbitupsets.Bothpipelinessharethesameregisterbank.

TheCPUsystemhasbeenpractically testedunder experimental conditionswithin different particle accelerator beams. The deployed DUT consists of acentralXilinxXC4VFX20FPGA,allof the requiredscrubbingcomponentsaswell as different readout and test interfaces.The flipchipmanufacturedFPGAhad been directly placed within the center of the particle beam line to getcomprehensibleresultsatamaximumionizationimpact.Thechiphadnotbeenpreparedwithadditionalsubstrateorcasethinningtosimulateconditionsfoundinregularapplicationscenarios.

Test beams for the fault-tolerant CPU design used 96-Ru particles at 1.69GeV/uwithafluxof1.7−5.0·105ionspercm2each15secondsspilland12-Cparticlesat200−500MeV/uwithameasuredfluxof5.6·102−2.1·106ionspersecondattopoftheFPGA.Therutheniumbeamtheoreticallyleadsto2,000to15,000 errors.Therefore, anunhardenedversionof thedevelopedCPUwith asingle pipeline stopped right after 1.2 seconds (2,894measurements),whereasthefault-tolerantdoublepipelineCPUrunsfor15.7secondsonaverage(4,930measurements)withoutdroppingfaultyresultsorgettingstuck.ThisresultsinaSEFIrateof163perhour,whichisquiteahighvalue,butstillaverygoodonefor thatamountofbeam intensity.Thecarbonbeamused lower flux ratesanddidnotraiseSEUsasexpected,butthehigheronescausedSEFIratesupto29perhour(200MeV/u,2.1·102atFPGA).

5.5.Conclusion

Inthischapterweintroducedtheideaofsmartmoduleredundancythatcanbeusedincombinationwithscrubbingtogainradiationtoleranceinacost-efficientway.Thebasic idea is to classify theFPGAdesign’s components and to treateachcomponentclassseparately.Threeclasseshavebeenidentified:datapath,I/Oandcontrol logic.Whiledatapathand I/Omainlydependon theusageoferrordetectioncodeslikeCRC,thecontrollogicisbestprotectedbyautomatedDMR or TMR. Special focus has been on the implementation of a radiation-tolerantCPUsincethisisboththemostpowerfulandthemostdemandingpartoftheFPGA’scontrollogic.

Twotestscenarioshavebeenusedtoverifytheproposedapproach.ThefirstonetookaGET4ReadOut-Controllerandprotecteditagainstradiationviasmartmoduleredundancy.Themeasurementsareextremelypromisingandshowthatthe resulting FPGA design is ready to be used in real world scenarios atGSI/FAIR.Moreover, it demonstrates the powerof smartmodule redundancy.An unprotected design used 54%of the flip-flops and 36%of theLUTs.Theprotected design used 71% of the flip-flops and 78% of the LUTs. These arefactorsof1.3and2.1respectively,whichisoutstandingcomparedtothefactorof6usuallyfoundwithconventionalTMR.ThesecondtestscenariofocusedontheimplementationofaradiationtolerantCPU.Here,DMRinsteadofTMRhasbeenusedandproventowork.

Based on these very positive results, all our future designs will be usingsmartmoduleredundancyinsteadofTMRinordertocreateFPGAdesignsthatcanwithstandthehighradiationatGSI/FAIR.

References

1. C. Carmichael, M. Caffrey, and A. Salazar. Correcting Single-event Upsets through Virtex PartialConfiguration,Technicalreport,XilinxInc.,2000.

2.M.Caffreyetal.Single-EventUpsetsinSRAMFPGAs,inProc.MilitaryandAerospaceApplicationsofProgrammableLogicDevices,2002.

3. D. Bessot and R. Velazco. Design of SEU-hardened CMOSMemory Cells: theHIT Cell, inProc.EuropeanConferenceonRadiationanditsEffectsonComponentsandSystems,pp.563–570,1993.

4. T. Calin, M. Nicolaidis, and R. Velazco. Upset Hardened Memory Design for Submicron CMOSTechnology,IEEETransactionsonNuclearScience,43(6),2874–2878,1996.

5.Q.ShiandG.Maki.NewDesignTechniquesforSEUImmuneCircuits,inProc.NASASymposiumonVLSIDesign,pp.4.2.1–4.2.16,2000.

6.D.Mavis.SingleEventTransientPhenomena–ChallengesandSolutions, inProc.MicroelectronicsReliabilityandQualificationWorkshop,2002.

7.S.Baloch,T.Arslan,andA.Stoica.DesignofaSingleEventUpset(SEU)MitigationTechniqueforProgrammableDevices, inProc. International Symposium onQuality ElectronicDesign, pp. 330–345,2006.

8. K. Morgan et al. A Comparison of TMR with Alternative Fault-Tolerant Design Techniques forFPGAs,IEEETransactionsonNuclearScience,54(6),2065–2072,2007.

9.H.Deppe andH. Flemming. TheGSI Event-driven TDCwith 4ChannelsGET4, inProc. NuclearScienceSymposiumConferenceRecord,pp.295–298,2009.

10.S.Manzetal.RadiationMitigationEfficiencyofScrubbingon theFPGAbasedCBMTOFReadoutController,inProc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.1–6,2013.

11.V.FrieseandC.Sturm.CBMProgressReport,2011.12.J.GebeleinandU.Kebschull.InvestigationofSRAMFPGAbasedHammingFSMEncodinginBeam

Test,inProc.RadiationEffectsonComponentsandSystems,2012.13.J.Gebelein,H.Engel,andU.Kebschull.AnApproachtoSystem-wideFaultToleranceforFPGAs,in

Proc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.467–471,2009.14.G.KaneandJ.Heinrich.MIPSRISCArchitecture(2ndEdition),(PrenticeHall,1991).15.H.Engel.DevelopmentofaFaultTolerantSoftcoreCPUforSRAMbasedFPGAs,PhDthesis,Kirchoff

InstituteforPhysics,HeidelbergUniversity,Germany,2009.

aForextensivedetailsontheradiationtolerantCPUpleasesee[Ref.15].

Chapter6

AnalysingReconfigurableComputingSystems

WayneLukDepartmentofComputing,ImperialCollegeLondon

The distinguishing feature of a reconfigurable computing system is that the function and theinterconnectionofitsprocessingelementscanbechanged,insomecasesduringruntime.However,reconfigurability is a double-edged sword: it only produces attractive results if used judiciously,sincetherearevariousoverheadsassociatedwithexploitingreconfigurabilityincomputingsystems.Thischapterintroducesasimpleapproachforanalysingtheperformance,resourceusageandenergyconsumption of reconfigurable computing systems, and explains how it can be used in analysingsome recent advances in design techniques for various applications that produce runtimereconfigurable implementations. Directions for future development of this approach are alsoexplored.

6.1.Introduction

The exponential growth of the fabrication cost of integrated circuits makesreconfigurable computing increasingly attractive. Most industrial applicationsinvolving reconfigurable computing today, however, adopt just compile-timeconfiguration:onceadevicesuchasanFPGA(field-programmablegatearray)is configured, its configuration either does not change over its lifetime, orchanges only when a new application is required. Since most reconfigurabledevices can, in principle, be reconfigured as many times as needed, manyresearchers are curious about the conditions underwhich reconfigurability canbe exploited effectively to improve the capabilities of an implementation, andhow such implementations can be characterised to enable estimates ofperformance,resourceusage,energyconsumptionandsoon.

This curiosity has been driving the research of Peter Cheung and me formany years, since I joined Imperial College London and started collaboratingwith him.We haveworkedwithmany of our outstanding students, including

Tobias Becker, Thomas Chau, Gary Chow, Joern Gause, Pete Sedcole, ShaySeng and Nabeel Shirazi, in advancing the exploitation of reconfigurabletechnology,especiallyruntimereconfigurability,forcomputingapplications.

Inthefollowing,motivationsforruntimereconfigurabilitywillfirstbegiven.An approach for analysing reconfigurable computing systems will then beintroduced. It will be followed by a description of some recent advances indesign techniques. Finally, directions for future development of this approachwillbeexplored.

6.2.WhyRuntimeReconfigurability?

Many computer systems, including those for high-performance computing andfor embedded applications, require designers to cope with three competingrequirements:adaptability,highperformance,andreducedtime-to-deployment.

In a continuously evolving environment, embedded systemsmust adapt tochanges in both function and performance. The adaptation may entail in situprogramming tomeet new protocols or upgrade to new algorithms. Similarly,accelerators for cloud computing should support a variety of functions andworkloads.Allthesesystemsmustdeliverincreasinglyhighperformancewithinchallenging power, size, and fault tolerance constraints, often precluding aconventionalprocessorapproach.

The requirement for both adaptability and high performance conflictswithanother goal of minimising time-to-deployment. Recent hardware compilationtools are beginning to support rapid design development; they often involveapplication-specific design customisation to meet constraints in performance,resource usage and power consumption. Some tools also support a systematicapproachtoruntimeadaptation.

Anintroductiontoreconfigurablecomputing,includingabriefdescriptionofruntime reconfigurability, is available [Ref. 1]. Runtime reconfigurability candeliverthefollowingbenefits[Refs.2and3]:

•implementingalargedesignbytimemultiplexing•acceleratingdemandingapplications•improvingpowerandenergyconsumption•supportinghealthmonitoring•enhancingreliabilityandfaulttolerance•speedingupthedesigncyclebyenablingincrementaldevelopment.

Next, an approach for analysing reconfigurable computing systems will beintroduced.Someadvancesindesigntechniquestargetingimplementationswithruntimereconfigurabilitywillthenbepresented.

6.3.AnAnalysisApproach

Therehavebeenmanylaudableadvancesintechniquesforperformanceanalysisof reconfigurable computing systems [Refs. 4–9]. However, they are eitherdesigned to model various effects such as those of computation andcommunication in detail, or focused only on specific optimisations or on aspecificapplication.

Incontrast,wepresentbelowasimpleapproachforanalysingperformance,resource usage and energy consumption of reconfigurable computing systems.This approach isnot intended to replace the existingworkcitedearlier, but toprovide an abstract model which can be further refined to include variousspecificeffects.Ourapproachisdesignedtofocusonthekeyessentialelementsthattranscendimplementationdetailsofparticulartechnologiesorsystems.

Our approach involves three factors, ς, ρ and ε, capturing respectivelypotential improvements in completion time, resource usage and energyconsumption of a reconfigurable computing system against a conventionalreferencesystem. In theappropriatecontext, theconventional referencesystemcan be an instruction processor, an application-specific integrated circuit, or areconfigurableprocessorthatdoesnotsupportruntimereconfiguration.Thiswillbeillustratedbelow.

6.3.1.Analysisofcompletiontime

Let us adopt ς to denote the ratio of the completion time for a conventionalreference system to the completion time for the corresponding reconfigurablecomputing system. The completion time for the reconfigurable computingsystem includes two components: the execution time and the reconfigurationtime.

Let Nc denote the number of cycles of a particular application for theconventionalreferencesystem,Nedenotethenumberofcyclesforexecutionforthecorrespondingreconfigurablecomputingsystem,andNrdenote thenumberof cycles for its reconfiguration. Let Tc denote the cycle time for the

conventionalreferencesystem,andTeandTrdenoterespectivelythecycletimefor execution and for reconfiguration for the corresponding reconfigurablecomputingsystem.Onecanthendefinethat

Clearly ς shows the speed benefit of reconfigurability relative to theconventional reference system; ς = 1 means that reconfigurability does notdeliveranyspeed-up.However,thereconfigurablesystemcanstillbeattractiveeven when ς < 1, as long as the system meets the speed requirement whilehaving significant reduction in resource usage or in energy consumption, orboth. Before we look at these cases, let us study the above equation inmoredetail.

First, consider the conventional reference system being an instructionprocessor.InthatcaseNc=NiCi,whereNiisthenumberofinstructionsforthatapplication, and Ci is the average number of cycles per instruction for thatapplication.

Second, the equationabovecaneasilybegeneralised to coverdesigns thatsupportmultiplereconfigurationsatruntime:

Third, theeffectsofoptimisationssuchasconfigurationprefetchingcanbeapproximated by including an additional parameter γj to account for thereductioninreconfigurationtimeduetoprefetching:

Fourth, the above model can be used in exploring the trade-offs betweendifferentreconfigurationregimes.Forexample,itispossiblethatanincreaseinparallelismwouldreducetheexecutiontimewhileincreasingthereconfigurationtime; an optimal amount of parallelism can be derivedwhichwould offer thelargestreductionincompletiontime[Ref.8].

Fifth, once ς is known, the effect due to Amdahl’s law can be estimated:giventhatonlyafractionβofthecompletiontimecanbenefitfromacceleration,theoverallimprovementfactorδisgivenby

6.3.2.Analysisofresourceusage

LetR1 andR2denote the resourcesneeded for twohardwareelements thatarenot needed concurrently for a given application. A nonreconfigurable designwould still need to have sufficient resources R1 + R2 to accommodate bothelementsevenifonlyoneofthemisactiveatonetime.

In contrast, a runtime reconfigurable solution would only need to havesufficientresourcesformax(R1,R2)+Rr,whereRrdenotestheresourceoverheadrequiredbythereconfigurabledesigntosupport,forexample,additionalstorageforcapturingthestatebetweensuccessiveconfigurations.

Thebenefitρofreconfigurabilityonresourceusagecanthenbecalculatedasfollows:

Asbefore,thisequationcaneasilybegeneralisedtomodeldesignsthatsupportmultiplereconfigurationsatruntime:

The largest gain can be obtained when all the Rj elements have the sameresourceusage.

ForFPGAs,therearedifferentkindsofresourcessuchasfine-grainedlogicblocks, coarse-grained arithmetic and signal processing elements, andconfigurablememory elements. Another method is to estimate the number oftransistors required for each of these resources so that an aggregate can beobtained; it would enable us to compare resource usage for other forms ofprocessorsaslongastheyusethesametransistortechnology.

6.3.3.Analysisofenergyconsumption

Let Pc denote the power consumption of the conventional reference system.

Since energy consumption is the product of power consumption and theassociatedactivatedtime,thenitsenergyconsumptionwillbegivenbyPcNcTcwhere, as before,Nc andTc denote respectively the number of cycles and thecycle time for a given application executing on the conventional referencesystem.

Similarly, letPr,NrandTrdenoterespectively thepowerconsumption, thenumberofcyclesandthecycletimeforreconfigurationofagivenapplicationonthe corresponding reconfigurable computing system, andPe,Ne andTe denoterespectivelythepowerconsumption,thenumberofcyclesandthecycletimeforits execution. The total energy consumption for the reconfigurable computingsystemisthengivenbyPrNrTr+PeNeTe.

Thebenefitεof reconfigurabilityonenergyconsumptioncanbecalculatedasfollows:

Theanalysisaboveisdeliberatelykeptstraightforward.Itwillbeusedinthenextsectiontoillustratehowdesigntechniquespromotingreconfigurabilitycanimproveperformance,resourceusageorenergyefficiencyofimplementations.

6.4.DesignTechniquesforRuntimeReconfigurability

This section presents three techniques that target runtime reconfigurableimplementations. They are based onmixed-precision computation,multi-stageprocessing, and elimination of idle functions. The main consideration is topartition a design into multiple optimized configurations, which can then beplacedontothetargetreconfigurableengineattheappropriateinstantduringruntime.

6.4.1.Mixed-precisioncomputation

This approach involves havingmultiple configurations of different precisions,and it is assumed that the computation can benefit from having multipledatapathsoperatingconcurrently.Sinceadatapathoperatinginlowprecisionissmaller than one operating in high precision, one would include as manydatapathsaspossibleineachconfigurationtomaximiseparallelism.

Themainchallengesofthisapproacharetofindappropriateprecisionsthatwould deliver correct results with sufficient accuracy while maximisingparallelism,andtoschedulemultipleconfigurationsatappropriateinstants.

This approach can be analysed based on the techniques introduced in theprevioussection.Therearetwosteps.

(1)LetRfdenote the resources fora full-precisiondatapath, andNfdenote thenumberofsuchdatapathsthatcanbeaccommodatedonagivenFPGA.AlsoletRsdenotetheresourcesforasmalldatapathwithreducedprecision,andNs denote thenumberof suchdatapaths that canbe accommodatedon thesameFPGAasbefore.HenceideallyNsRs=NfRf≤RT,whereRTdenotesthetotalamountofresourcesonadevice.SinceRf>RS,henceNf<Ns.

LetN be the total number of iterations required for this computation.WhentheFPGAcontainsonlyfull-precisiondatapaths,thecompletiontimeisgivenbyN/Nf.WhentheFPGAcontainsbothfull-precisiondatapathsandreduced-precisiondatapaths,ifafractionβoftheiterationscantakeplaceinreducedprecision, thecompletiontimeisgivenbyβN/Ns+(1−β)N/Nf.Sotheimprovementincompletiontimeµduetomixed-precisioncomputationcanbeexpressedby

(2)Thebenefitof reconfigurabilityonresourceusagecanbederivedfromEq.6.5,assumingnegligiblereconfigurationoverhead:

Letτfandτsdenoterespectivelytheexecutiontimefortheapplicationwhenthe systemhas a single full-precision datapath andwhen the systemhas asingle reduced-precision datapath, and τr denote the time it takes toreconfigurethedevice,Eq.6.1becomes

Themixed-precision approach has been applied to three application domains:

Monte Carlo simulation [Ref. 10], function comparison [Ref. 11], andmathematical optimisation [Ref. 12].While computation in full precision canalso be performed in software on a general-purpose processor, if thereconfiguration overhead is low while the communication cost between thereconfigurableengineand thegeneral-purposeprocessor ishigh, then itwouldbeprofitable toadopt runtime reconfiguration inconjunctionwithhavingbothfull-precision and reduced-precision computation in the reconfigurable engine.The following provides an overviewofmixed-precision computation for threeapplicationdomains; theperformance reportedbelowdoesnot include runtimereconfigurability,sothereisscopeforfurtherimprovement.

First,MonteCarlosimulation[Ref.10].Itinvolvestheuseofdatapathswithreducedprecision,and the resultingerrorsarecorrectedbyauxiliarysampling.An analyticalmodel for speed and resource usage canbe developed to enableoptimisation based onmixed integer geometric programming to determine theoptimalreducedprecisionandtheoptimalresourceallocationamongtheMonteCarlo datapaths and correction datapaths. Experiments show that the mixedprecisionapproachrequiresupto11%additionalevaluationswhilelessthan4%ofalltheevaluationsarecomputedinfullprecision.Theresultingdesigns,withthe full-precision correctionoperations implemented in software, are up to 7.1timesfasterand3.1 timesmoreenergyefficient thanbaselinedouble-precisionFPGAdesigns,andupto163timesfasterand170timesmoreenergyefficientthan quad-core software designs optimised with the Intel compiler and MathKernelLibrary.Thisapproachalsoproducesdesigns forpricingAsianoptionswhich are 4.6 times faster and 5.5 timesmore energy efficient than NVIDIATeslaC2070GPUimplementations.

Second,functioncomparison[Ref.11].Ourapproachimprovescomparisonperformance by using reduced-precision datapathswhilemaintaining accuracyby using full-precision datapaths. The approach adopts reduced-precisiondatapaths for preliminary comparison, and full-precision datapaths when theaccuracy for preliminary comparison is insufficient. As in Monte Carlosimulation, an analyticalmodel for performance estimation can be developed,withoptimisationbasedonintegerlinearprogrammingusedfordeterminingtheoptimalprecisionandtheoptimalresourceallocationforeachofthedatapaths.The effectiveness of this approach is evaluated using a common collisiondetection problem with full-precision computation performed in software.Performance gains of 4 to 7.3 times are obtained over the baseline fixed-precisiondesigns for the sameFPGAs.Themixed-precisionapproach leads toFPGA designs which are 15.4 to 16.7 times faster than software running onmulti-coreCPUswiththesametechnology.

Third,mathematical optimisation [Ref. 12]. It involves the use of reducedprecision optimisers for searching potential regions containing the globaloptimum, and full-precision optimisers for verifying the results. An empiricalmethodisusedindeterminingparametersforthemixed-precisionapproach.Itseffectiveness is evaluated using a set of optimisation benchmarks, with full-precisionoptimizersimplementedinsoftware.Itisfoundthatonecanlocatetheglobaloptima1.7to6timesfasterwhencomparedwithaquad-coreoptimiser.Themixed-precisionoptimisationssearchupto40.3timesmorestartingvectorsper unit timewhen comparedwith full-precisionoptimisers, andonly0.7% to2.7%ofthesesearchesarerefinedusingfull-precisionoptimisers.Thisapproachallows us to accelerate problemswithmore complicated functions or to solveproblemsinvolvinghigherdimensions.

6.4.2.Multi-stageprocessing

If a computation can be partitioned into multiple stages such that each stagerequires different amounts of resources, then each stage can be captured in aconfiguration such that successive stages can be realised by runtimereconfiguration.Twoexampleswillbeusedtoillustratethisapproach.

First, short-read sequence alignment [Ref. 7]. This computation involvesdeterminingthepositionsofmillionsofshortreadsrelativetoaknownreferencegeneticsequence.Ourapproachconsistsoftwokeycomponents:anexactstringmatcher for the bulk of the alignment process, and an approximate stringmatcher for the remaining cases.Based on techniques similar to those in Sec.6.3, interesting regions of the design space, including homogeneous,heterogeneous and runtime reconfigurable designs, can be characterised toprovide performance estimations of the corresponding designs. AnimplementationofaruntimereconfigurablearchitecturetargetingasingleFPGAcan be up to 293 times faster than running the BWA algorithm on an IntelX5650 CPU, and 134 times faster than running the SOAP3 program on anNVIDIAGTX580GPU.

Second, adaptive particle filters [Ref. 13]. This is a statistical method fordealingwithdynamicsystemshavingnon-linearandnon-Gaussianproperties.Ithas been applied to real-time applications including object tracking, robotlocalisation, and air traffic control. However, the method involves a largenumberofparticlesresultinginlongexecutiontimes,whichlimitsitsapplicationin real-time systems.Anapproachhasbeendeveloped to adapt thenumberofparticles dynamically and to utilise runtime reconfigurability of the FPGA for

reducedpowerandenergyconsumption.Theperformanceofdesignsbasedonthis approach can be analysed using techniques similar to those in Sec. 6.3.Implementationsforsimultaneousmobilerobotlocalisationandpeopletrackingshow that adaptive particle filters can reduce up to 99%of computation time.Using runtime reconfiguration, 34% reduction in idle power and 26–34%reductionofsystemenergycanbeachieved.Moreover, theproposedsystemisupto7.39timesfasterand3.65timesmoreenergyefficientthantheIntelXeonX5650CPUwith12 threads, and1.3 times fasterand2.13 timesmoreenergyefficientthananNVIDIATeslaC2070GPU.

6.4.3.Eliminatingidlefunctions

Idle functions have a detrimental effect on performance, area and powerconsumption.Adesign approach has been developed to automatically identifyand exploit runtime reconfiguration opportunities [Ref. 14], while optimisingresource utilisation by eliminating idle functions. The approach is based on ahierarchicalgraphstructurewhichenablesruntimereconfigurabledesignstobesynthesised in three steps: function analysis, configuration organisation, andruntimesolutiongeneration.

The performance of designs based on this approach can be analysed in asimilar way as those in Sec. 6.4.1. In this case, more datapaths can beaccommodatedindifferentconfigurationsbyeliminatingidlefunctionsthanonewithout runtime reconfiguration and having functions thatmay be idle duringruntimedueto,forexample,datadependenceconstraints.

Three applications, targeting barrier option pricing, particle filter, andreverse time migration, are used in evaluating the proposed approach. Ourreconfigurable implementations are 1.31 to 2.19 times faster than optimisedstatic designs, and are up to 28.8 times faster than optimised CPU referencedesignsand1.55timesfasterthanoptimisedGPUdesigns.

6.5.FutureDevelopment

Runtime reconfigurability has shown good promise in producing efficientimplementations. However, much research remains to be conducted beforeruntime reconfigurable design can be adopted as a mainstream vehicle forrealisingcomputersystems.

First,itwouldbebeneficialtodevelopfurtherthetheoreticalfoundationsof

runtime reconfigurability; the approach in Sec. 6.3 can be regarded aspreliminaryworkinthisdirection.Suchtheorieswould,forexample,provideananalyticaltreatmentofoptimalitywhichwouldbeusefulfordevelopingdesigntechniques.An attempt has beenmade for providing such analytical treatmentthat models the combined effects of parallelisation and reconfiguration onperformance[Ref.8]andonenergyefficiency[Ref.9],whichrevealsthatthereis often an optimal design with the highest performance or with the highestenergyefficiency.Wehopetoextendandgeneralisetheaboveworkandrelatedtechniquestoprovidethebasisfortoolsthatgenerateoptimiseddesigns.

Second, while there is an increasing number of tools targeting runtimereconfigurable designs, developing and optimising such designs is still muchmoredifficult and tedious than, for example, softwareoptimisation.Promisingapproaches addressing this issue include aspect-oriented techniques [Ref. 15]thatpromoteseparationofconcernsanddesignre-use;incrementaldesign[Ref.3]thathelpstospeedupdesignimplementation;andtheuseofdomain-specificdescriptions[Ref.16]toraisethelevelofabstractionfordescriptionsthattargetapplicationbuilders.

Third, there appears significant synergy between reconfigurable computingandmachinelearning.Inparticular,machinelearningtechniqueshavebeenusedinspeedingupreconfigurabledesign[Ref.17]whilereconfigurabledesignhasalso been used in speeding up machine learning [Ref. 18]. We hope tounderstandmoredeeplytheconnectionsbetweenthesetwoareas,andhowsuchsynergycancontributetoadvancesinbothareas.

6.6.Summary

Reconfigurable computing in general, and runtime reconfigurability inparticular, have comea longway in the last 20years.Exciting advanceshaverecently been made, resulting in some of the fastest and the most energy-efficientdesignsforvariousapplications.Weshallacceleratetheprogressofourresearch on various aspects of reconfigurable computing and to extend itsinfluence,sothatwewouldcomeclosertorealisingthevisionofPeterCheungfor the next 20 years: that reconfigurability will be found in all integratedcircuits,notjustFPGAsasweknowtoday.

Acknowledgements

Many thanks to James Arram, Tobias Becker, Thomas Chau, Gary Chow,

AndreasFijeland,Qiwei Jin,MaciejKurek,PhilipLeong,QiangLiu,StephenMuggleton, Xinyu Niu, Henry Styles, David Thomas, Brittle Tsoi and SteveWilton for contributing to this paper. I am grateful to Peter Cheung for hisadvice,enthusiasmandfriendshipovermanyyears.Thisresearchissupportedinpart by the European Union Seventh Framework Programme under grantagreement number 257906, 287804 and 318521, by the UK EPSRC, by theMaxelerUniversityProgramme,byAltera,andbyXilinx.

References

1.T.Todmanetal.ReconfigurableComputing:ArchitecturesandDesignMethods, IEEComputerandDigitalTechniques,152(2),193–207,2005.

2. P. Lysaght et al. Enhanced Architectures, Design Methodologies and CAD Tools for DynamicReconfigurationofXilinxFPGAs,inProc.InternationalConferenceforFieldProgrammableLogicandApplications,pp.1–6,2006.

3.T.Frangiehetal.PATIS:UsingPartialConfigurationtoImproveStaticFPGADesignProductivity,inProc.IPDPSWorkshop,pp.1–8,2006.

4.E.El-Araby, I.Gonzalez, andT.El-Ghazawi.ExploitingPartialRuntimeReconfiguration forHigh-performance Reconfigurable Computing, ACM Transactions on Reconfigurable Technology andSystems,1(4),21:1–21.23,2009.

5. E. Holland, K. Nagarajan, and A. George. RAT: RC Amenability Test for Rapid PerformancePrediction,ACMTransactionsonReconfigurableTechnologyandSystems,1(4),22:1–22:31,2009.

6.K.Papadimitriou,A.Dollas,andS.Hauck.PerformanceofPartialReconfigurationinFPGASystems:ASurvey and aCostModel,ACMTransactions onReconfigurableTechnologyandSystems, 4(4),36:1–36:24,2011.

7.J.Arrametal.ReconfigurableAccelerationofShortReadMapping,inProc.InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.210–217,2013.

8.T.Becker,W.Luk,andP.Cheung.ParametricDesignforReconfigurableSoftwaredefinedRadio, inProc.InternationalSymposiumonAppliedReconfigurableComputing,pp.15–25,2009.

9.T.Becker,W.Luk,andP.Cheung.Energy-awareOptimisationforRuntimeReconfiguration,inProc.InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.55–62,2010.

10.G.Chowetal.AMixedPrecisionMonteCarloMethodologyforReconfigurableAcceleratorSystems,inProc.InternationalSymposiumonFieldProgrammableGateArrays,pp.57–66,2012.

11. G. Chow et al. Mixed Precision Processing in Reconfigurable Systems, in Proc. InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.17–24,2011.

12.G.Chow,W.Luk,andP.Leong.AMixedPrecisionMethodologyforMathematicalOptimisation,inProc. International Symposium on Field-Programmable Custom Computing Machines, pp. 33–36,2011.

13. T. Chau et al. Heterogeneous Reconfigurable System for Adaptive Particle Filters in Real-timeApplications, in Proc. International Symposium on Applied Reconfigurable Computing, pp. 1–12,2013.

14. X. Niu et al. Automating Elimination of Idle Functions by Runtime Reconfiguration, in Proc.InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.97–104,2013.

15. J. Cardoso et al. Specifying Compiler Strategies for FPGA-based Systems, in Proc. International

SymposiumonField-ProgrammableCustomComputingMachines,pp.192–199,2012.16. D. Thomas andW. Luk. ADomain Specific Language for Reconfigurable Path-basedMonte Carlo

Siumlations, inProc. International Conference on Field-Programmable Technologies, pp. 97–104,2007.

17.M.Kurek,T.Becker,andW.Luk.ParametricOptimisationofReconfigurableDesignsusingMachineLearning, in Proc. International Symposium on Applied Reconfigurable Computing, pp. 134–145,2013.

18.A. Fidjeland,W. Luk, and S.Muggleton.ACustomisableMultiprocessor forApplication-optimisedInductiveLogicProgramming,inProc.VisionsofComputerScience—BCSInternationalAcademicConference,2008.

Chapter7

CustomComputingorVectorProcessing?

SimonW.Moore,PaulJ.Fox,A.TheodoreMarkettosandMatthewNaylorComputerLaboratory,UniversityofCambridge

FPGAs are famously good for constructing custom arithmetic pipelines for video processing andotherdataintensivetasks,butbuildingthesecustompipelinesistime-consuming.Moreover,largecomputationtaskstypicallyrequirelargedata-setsandmanagingthememorybottleneckiscrucial.This chapter demonstrates that vector processing can not only deliver high computationalperformance,butthatthememorybottleneckcanbeeffectivelymanagedtoo.Whilstmoregeneralpurpose vector compute is unlikely to deliver the full performance of a custom pipeline, wedemonstratethatforanon-trivialneuralcomputationcasestudywecangetwithinafactoroftwoofacustomsolution.

7.1.Introduction

ThereisagreatdealofresearchonefficientlymappingalgorithmsontoFPGAsthatproducescustomcomputationpipelines,whichaimtoexploitthemassivelyparallel computation resources available on today’s FPGAs. Examples fromProfessorPeterCheung’sworkincludevideoprocessing[Ref.1],wavelets[Ref.2], elliptic curve cryptography [Ref. 3] and prime number validation [Ref. 4].Constructing complex custom pipelines is time consuming, though suitableabstractionssuchasC-to-gatesimproveproductivity,butoftenattheexpenseofperformance. Vector processing can be an attractive alternative to C-to-gates[Ref. 5], yielding good performance with the convenience of softwareprogramminganddebugging.

Applicationsdemandingmassivecomputeoftenrequirelargedata-sets,andthis can lead to streaming data frommemory external to a FPGAbecoming abottleneck for a broad class of applicationswhose pinnacle of performance isreachedwhenexternalmemorybandwidthissaturatedwithusefuldatatransfer,not when the FPGA compute resources are maximally used [Ref. 6]. This isknownasthe“memorywall”,andisanincreasingproblemforbothASICsand

FPGAs [Ref. 7], where compute resources are more plentiful than externalmemorybandwidth.Inthischapter,wefocusonthisclassofapplicationwithanin-depth case study of neural computation which has demanding datainterdependenciesandlargedata-setsthatneedtobeheldinexternalmemory.

Our previous work on custom neural computation pipelines for FPGAsresulted inahigh-performancedesigndescribedusingBluespecHDL[Ref.8],capable of efficiently streaming neuron and synapse parameters from DDR2memorytoachievereal-timeperformancewith64kneuronsand64MsynapsesperAlteraStratixIV230FPGA[Ref.9].

This custom pipeline implementation took around three man-years tocomplete, and resulted in us having a deep understanding of the Izhikevichspikingneuronmodel.Whilehighlyparameterised, this implementation is stillratherinflexible(e.g.iftheneuronmodelhastobechanged)andisthereforeoflessutilitytoneuroscientists(ourprospectivecustomers)thanexistingsoftware-based neural computation systems.However, thiswork did identify that givensome modest parallel compute, external memory bandwidth becomes aperformance bottleneck for this application. As a consequence, we explorevector processing for neural computation, with a particular focus on makingefficientuseofexternalmemorybandwidthusingbursttransfers.

Section10.2presentsBlueVec,avectorcoprocessorforanAlteraNIOSII.Section 7.3 uses neural computation as a case study to compare and contrastcustomcomputing,vectorprocessing,andmulti-core implementations,and theresults of this case study are given in Sec. 10.5. Section 10.7 providesconclusionsandconsiderstheirimplicationsonfutureresearchdirections.

7.2.BlueVecArchitecture

Recentwork at theUniversity of British Colombia has led to a series of softvector processors [Refs. 7, 10–12] allowing the rapid development of high-performance,low-area,programacceleratorsonFPGAs.Ofthese,VIPERS[Ref.7]isperhapsthemostinterestingforneuralcomputationapplications:itsupportslane-localmemoriesthatcanbeaddressedindependentlyandinparallelusingavector of addresses and, as we discover, this feature is ideal for paralleldistribution of synaptic updates to neurons scattered throughout memory.UnfortunatelyVIPERS lacks an important feature for these applications:burstmemoryaccess forhigh-performancestreamingofdata fromexternalmemory.While the successors to VIPERS [Refs. 10–12] have made great progresstowards optimising externalmemory bandwidth efficiency, they all omit lane-

localmemories.Thereforewehavedevelopedourownsoftvectorprocessor—BlueVec—tomeetbothrequirements.

7.2.1.Vectorwidth

BlueVec[Ref.13]isaminimalistvectorcoprocessor—writteninaround1,000linesofBluespecHDL—withtwoexternalinterfaces:(1)acustominstructionslave interface for connection to aNIOS II, and (2) amemorymappedmasterinterface for connection to external memory. Assuming a NIOS II clockfrequency of 200MHz, and aDDR2 externalmemory transferring 64 bits ofdataonbothedgesofa400MHzclock,themaximumdatatransferratebetweenprocessor and memory is 256 bits per NIOS II clock cycle. This motivatesprocessingvectorsof256bitsperclockcycle,whichcanbetreatedaseither:

•8×32bitwords(Winstructions)or•16×16bithalf-words(Hinstructions)or•32×8bitbytes(Binstructions).

7.2.2.Registerfileandinstructionset

Given that a NIOS II custom instruction is defined as containing three 5-bitregister operands, the obvious design choice for BlueVec is a three-operandvector instruction setwith a 32-element register file.We take this option, butthere are alternatives, e.g. a large scratchpad in place of a register file withsupport for long vectors,whichwould allow a greater number of vector lanesandreduceloopoverhead[Ref.11].

An illustrative portion of theBlueVec instruction set is shown inFig. 7.1.Note the use of v and s prefixes to distinguish BlueVec vector registers andNIOSIIscalarregistersrespectively.AllBlueVecinstructionsareimplementedas C macros which expand to inline assembly code. Hence any valid Cexpression or variable can be used in place of a scalar register, but vectorregistersmustbeconstantsintherangev0...v31.ThefollowingsectionsdiscusspartsoftheBlueVecinstructionsetinmoredetail.

7.2.3.Externalmemory

Vectorscanbeloadedfromexternalmemoryusingtheinstruction

Load(vDest,sAddr,burstLength).

Fig.7.1.An illustrativeportionof theBlueVec [Ref. 13] instruction set.Registernamesprefixedwithvdenote256-bitvectorsandthoseprefixedwithsdenote32-bitscalars.Indexirangesfrom0to31,15,and7forbyte(B),half-word(H),andfull-word(W)instructionsrespectively.LOCALidenoteslanelocalmemoryi,andMEMdenotesexternalmemorywitha256-bitdatabus.

Whenexecuted,aburstLength-elementsequenceof256bitvectorsbeginningataddresssAddrisreadintoregisters

vDest,(vDest+1),...,(vDest+burstLength−1).

However, the register file isnotmodified until a Commit instruction is issued.ThisallowsthelatentLoadinstructiontobeanonblockingoperationthatcanbeissued well before its result is actually needed, where need is signified by ablocking Commit. For example, data for the next iteration of a loop can befetched while data for the current iteration is processed. In principle, Commitinstructions can be inferred in hardware using register scoreboarding, but wehaveoptedforanexplicitdesigntokeepthehardwaresimple.

Thecorrespondingstoreinstructionisalreadyanonblockingoperationand,atthetimeofwriting,doesnotsupportbursts:

Store(vSrc,sAddr).

7.2.4.Lane-localmemories

Eachhalf-wordvectorlanehasitsownlocalBlockRAMgivingalocalmemorythatisaccessiblebyavectorof16addresses.Theinstruction

LoadLocalH(vDest,vAddr)

loads LOCALi[vAddr[i]] into vDest[i] for each of the lane local memoriesLOCALiwhere i∈{0 . . .15}.ThesizeofeachLOCALi isaBlueVecdesignparameterthatcanbealteredonaper-applicationbasis.Thecorrespondingstoreinstructionis

StoreLocalH(vSrc,vAddr).

Only half-word versions of these instructions are supported since 16-bitaddressesaremoreappropriateformedium-sizedblockRAMsthan8or32bits.Full-wordvariantscanbecodedbyduplicatingeach16-bitaddresstoforma32-bitaddress.

7.2.5.Pipelining

Inordertoachieveaclockfrequencyabove200MHz,i.e.notinhibittheNIOSIIclockfrequency,BlueVecusesa3stagepipeline:

•F:operandfetch(fromregisterfile)•E:executeinstruction•W:writebackresult(toregisterfile).

Mostinstructionsexecuteinasinglecycleandprovidearesultthatcanbeusedimmediately. This is made possible by register forwarding: the result anddestination register of stages E andW are inspected by stage F and used tooverride,ifnecessary,thevaluesfetchedfromtheregisterfile.

Therearethreeinstructionswhichdonotfullycompleteinasinglecycle:

•LoadLocalHcompletesinsinglecyclebutthecallermustwaitonefurthercyclebefore reading the result; register forwarding isnotpossibleat stageE sincetheoutputofblockRAMisnotyetavailable.WaitingcanbeachievedusingNoOporanyotherinstructionthatdoesnotreadthedestinationregister.

•Mulcompletesintwocyclesbutthecallermustwaittwofurthercyclesbeforereadingtheresult.Thisisduetoa3cyclelatencyonFPGAmultiplierblocksclockedatover200MHz.Adeeperpipelinecouldalleviatethisdelay.

•Indextakes3cyclestocompletesinceitmustpassthroughthewholepipelinebeforearesultcanbereturnedtotheNIOSII.

7.2.6.Record/playback

WeobservedthattheNIOSIIisunabletoissuecustomvectorinstructionsatthemaximum possible rate of one per cycle. As a workaround, we introduced arecord/playbackfacilitywhichallowssequencesofinstructionstobewrittentoalocalinstructionmemoryinsideBlueVec,andplayedbackatthemaximumratebyissuingasingleinstruction.

Toillustratethismechanism,Record(start);

//Sequenceofvectorinstructions

Record(end);

recordsaninstructionsequencethatcanbeplayedbackbycalling:Playback(start,end).

7.3.NeurocomputingCaseStudy

Asacasestudywecompareourcustomneuralcomputationpipeline[Ref.9]fortheIzhikevichspikingneuronmodel[Ref.14]toanimplementationofthesamealgorithm using BlueVec. External memory bandwidth is critically importantsince the FPGA has insufficient BRAM to hold all of the neural parameters.Figure7.2illustratesthedatathatneedstobestored:

(1) The neuron firing rule equation parameters are stored sequentially inmemory.

(2)When the rule fires a pointer is used to deference a list of fan-out tuplescontaining(delay,pointer).

(3)Aftertheappropriatedelayalonglistof(neuronID,weight)pairsarereadwhich need to be processed by summing the weight for the appropriateneuron’sI-value.

These data structures are intended to optimise the size of burst reads toexternalmemorymade by each phase of the algorithm, and hence bandwidthusageefficiency.Furtherdetailscanbefoundin[Ref.15].

Fig.7.2.Layoutofdatainexternalmemory.

Thecustomneuralcomputationpipelineusesaseparatecomputationunitforeachof the threephasesof theneuralcomputationalgorithm,eachperformingburst reads to externalmemory, andwith pointers beingpassedbetween themusingFIFOcommunicationchannels.WhiletheuseofBluespecSystemVerilog[Ref. 8] rather than conventional Verilog, improved productivity, thisimplementationstillrequiredthreeman-yearstocomplete.

TheBlueVecimplementationoftheIzhikevichspikingneuralmodelusesthesame data structures in external memory as used by the custom pipelineimplementation. Its implementation required one man-week to complete inadditiontothetimerequiredtoimplementtheBlueVecarchitectureitself.Whileproductivity for this implementation was aided by reuse of the design effortneeded for data structures and common components such as DDR2 memorycontrollers,thisstillrepresentsastrikingdifference.

Toprovideamorespecificexampleofthereductionindesigneffortaffordedby using a a BlueVec vector coprocessor rather than a custom computationpipeline, we will focus on the application of synaptic updates phase of theIzhikevich spiking neuron model (I-value accumulation). Details of the fullalgorithmcanbefoundin[Ref.15].

7.3.1.I-valueaccumulation

A spiking neural network consists of neurons connected by synapticconnections. In a typical biologically plausible network, each neuron hassynapticconnectionstoaround103otherneurons.Eachsynapticconnectionhasanassociateddelayandaweight,whichsignifiesthestrengthoftheconnection.Collectivelythecombinationoftargetneuron,delayandweightareknownasasynapticupdate.Whenaneuronspikeseachsynapticupdateneedstobedelayedand then summedwith other synaptic updates targeted at the same neuron toproduceatotalinputcurrent(termedasanI-value)foreachneuron.Wereferto

thisprocessofsummingsynapticupdatesasI-valueaccumulation.IftheI-valueofneuronnisdenotedivalues[n],anditstargetconnections

andassociatedweights are stored in arraystargets andweights respectively,thentheI-valueaccumulationprocessrequiredifneuronnspikesisdefinedbythefollowingloop:

for(i=0;i<numTargets;i++)

ivalues[targets[i]]+=weights[i];.

In both the custom pipeline and BlueVec implementations, each array haselementswhichare16bitsinsize.SincethenumberofI-valuesisequaltothenumberofneurons,thereisamplecapacityforivaluestobestoredinon-FPGABlockRAM,whichhasatotalsizeof2MBonaStratixIV230.However,asthenumber of synaptic connections is typically 103 × the number of neurons, thetargetsandweightsarraysforeachneuronmustbestoredinexternalmemory.

7.3.2.ImplementationA:Custompipeline

Assumingthetargetsandweightsarraysareinterleavedinmemorytogiveanarray of (target, weight) pairs called update tuples, our custom pipelineimplementation of the I-value accumulation loop (known as the accumulatorblockinpreviouswork)isshowninFig.7.3.Theon-FPGABlockRAMusedtostoretheI-valuesispartitionedintoeightbanks,sinceeightupdatetuplescanbeobtained (in a single 256-bit DDR2 memory transfer) per clock cycle whenefficient burst reads are used. Each bank is surrounded by a pipeline whichprocessesupdatetuples.Eachupdatetupleinanincomingwordisthenallocatedto thebank thatholds the I-value for the targetneuron,withFIFOqueuesandarbitersusedtoallowmultipleupdatetuplesinthesame256-bitwordtotargetthesamebank.

While theFIFOqueuesdoprovidesometoleranceofunevenloadbetweenbanks, in practice it was found that highest performance was achieved whenupdate tuples were arranged in 256-bit words such that they are effectivelystaticallyscheduled,withtheupdatetupleinpositionxofa256-bitwordalwaystargetinganeuronwhoseidentifiermodulo8isequaltox(orelsebeingempty,denotedbyzeroweight).

Fig. 7.3. Custom pipeline implementation of the I-value accumulation phase. Four banks are shown forclarity—thereareactuallyeightbanks.

As a result of this static scheduling, the complex accumulator blockeffectively becomes a vector of independent blocks, and hence the function itperformsisamenabletoimplementationusingtheBlueVecvectorprocessor.

7.3.3.ImplementationB:Vectorprocessing

Figure7.4showsaBlueVecimplementationofI-valueaccumulation.Whilethisvectorisedloopgivesgoodspeed-upoverthesimplescalarloop,itismarkedlyimprovedbychangingtheburstLengthargumentofeachLoadinstructionfrom1to8.Consequently,theloopincrementchangesfrom16to128andeachofthefourinstructionsattheendoftheloopisperformedeighttimesasfollows:

Fig.7.4.BlueVecimplementationoftheI-valueaccumulationphase(withoutburstsandrecord/playback).I-valuesarestoredinlane-localmemories.

LoadLocalH(v0,v8);NoOp;

AddH(v0,v0,v16);StoreLocalH(v0,v8);

...

LoadLocalH(v0,v15);NoOp;

AddH(v0,v0,v23);StoreLocalH(v0,v15);.

Thisresultsinaverylongsequenceofvectorinstructionsthatcanbeefficientlyissuedatarateofoneperclockcycleusingtherecord/playbackfeature.

7.4.Results

Wenowdiscusstheperformanceandproductivityofourcustomcomputingandvector processing approaches to neural computation. Each approach wasimplementedonaTerasicDE4evaluationboardwithaStratix IV230FPGA,usingasingleDDR2externalmemorybank.

7.4.1.Customcomputingperformance

Figure7.5illustratestheneuralspikepatternforasyntheticneuralnetworkwithbiologically plausible numbers of synapses and firing rate running on the fullcustompipelinesystem.Theaimwastoachievereal-timeperformancewhichisdemonstrated by Fig. 7.6 which plots the number of neurons spiking in eachmillisecondtimeinterval(intopdiagram),thenumberofclockcyclesneededtocompletetheresultingwork(inbottomdiagram),andthetimebound(horizontalline in bottomdiagram) given that the system is running at 200MHz and theneuralmodelisrunningatasamplingintervalof1ms.Althoughnotplotted,wecanreportthattheDDR2bandwidthutilisationisaround75%forthismodel.

Fig.7.5.Spikepatternforacomputationof64kneurons.

Fig. 7.6. Neurons spiking per sampling interval and clock cycles of work per sampling interval forcomputationof64kneurons.

Wecansee that real-timeperformance ismet, so1sofneuralmodel timetakes 1 s to complete.We’ll see for the vector processor versions that 1 s ofneural model time takes longer to complete due to inefficiencies introducedusingamoregeneral-purposesoftware-programmableapproach.

7.4.2.Single-coreperformance

Table 7.1 shows the times taken by our scalar (without BlueVec) and vector(withBlueVec)Izhikevichneuralcomputationsystemstocompute1sofneuralactivityinabenchmarknetworkconsistingof64kneuronswith64Msynapticconnections [Ref. 15]. Note that the scalar version has been optimised forperformance:I-valuesarestoredinalargeBlockRAM,andtheNIOSIIhasa4kB data cache with 256-bit cache lines that can be filled by single DDR2memory transfers. Our custom pipeline implementation operates in real-time,around 80× faster than the scalar version, but only 4× faster than the vectorversion.

Theperformanceprofiles inTable7.1are split into threephases.The timefor the I-value accumulation stage (discussed in Sec. 7.3) is reduced by 40×using vector processing. The neuron update and spike delay phases have notbeendiscussedhere,butdetailscanbefoundin[Ref.15].

WeobservedthatDDR2bandwidthutilisationforthesingle-threadedvectorversionwas16%.ThefactthatasingleBlueVecisnotabletosaturatememorybandwidth is a consequence of an imbalance between memory access andcompute. For example, while the states of 16 neurons can be fetched in 6memory transfers,44 instructionsare required toprocess them.There is scopefor improvement by increasing the number of operational units in the vectorprocessor.Butinthemeantime,weexploretheuseofmultiplecorestosaturatememorybandwidth.

Table 7.1.Run time and%of total for phases of Izhikevich neuronmodelwith andwithout aBlueVeccoprocessor.

7.4.3.Multi-coreperformance

Neuralcomputationisahighlyparalleltask,andourbenchmarkneuralnetworkis easily split into smaller networks that can be processed in parallel withnegligible communication. Table 7.2 shows the performance improvementobtained with multiple NIOS II and BlueVec cores accessing shared DDR2

memory and distributed Block RAMs for stack and instruction memory. Thecores are connected in a star network, with a master core connectedbidirectionallytoallslavecores.Notably,thequad-coreBlueVecconfigurationgives performance that is well within a factor of two of our custom pipeline.Interestingly,thetwohavealmostidenticallogicutilisation.

Both the scalar and vector versions show good scaling to multiple cores.However, the sheer number of scalar cores required to keep up raises severalconcerns:

• The number of scalar cores that would be required to match the quad-coreBlueVecimplementationiswellbeyondthecapacityofasingleFPGA.

• As the number of cores increases, memory accesses become increasinglyfragmented,andtheperformanceofDDR2memorydrops.

• Inter-processor communication requirements grow as a neural network isdividedintoincreasinglysmallchunks.Ingeneral,asimplestarnetworkwillnotscale,andmorelogicwillbeneededtoefficientlyconnectthemultitudeofcorestogether.

Table7.2.Runtimeandlogicandbandwidthutilisationforamulti-threadedIzhikevichneuronsimulatorwithvaryingnumberofcoresandvectorcoprocessors.

7.4.4.Productivity

Table7.3showsthenumberoflinesofcodeinourneuralcomputationsystems.The almost 3 k lines needed for the custom pipeline implementation isparticularlystriking,indicatingtheextralevelofdetailthatahardwaredesignermustexpress.Infact,thislinecountwouldbeevenhigherifitweretoincludegeneral-purposelibrariesdevelopedin-house.Hardwaredevelopmentcyclescanbe slow for other reasons too, such as long synthesis times, trial-and-error

refinementsneededtomeettighttimingconstraints,andalackofconvenientI/Omechanismsfordebugging.

The convenience of a software-based approach has allowed us to developotherefficientneuralcomputationsystemsinaveryshorttime.Figure7.7showsa screenshot of our leaky integrate-and-fire (LIF) system running the Nengo[Ref.16]modelfordigitrecognitiononaDE4FPGAwithtouch-screen.Usingasinglevectorprocessor,wewereable to achievea20× speed-upover a scalarimplementationwithjusttwodayswork.WedonotbelievethatimplementingacustompipelineLIFsystem in this time-frame ispossible, evenwith re-useofcomponentsfromtheIzhikevichsystem.

Table7.3.LinesofcodeforvariousimplementationsofIzhikevichandLIFneuronmodels.

Fig. 7.7. An implementation of the Nengo digit-recognition model on a DE4 FPGA board with touch-screen.

7.5.Conclusion

Wehave demonstrated that a quad-coreBlueVec (NIOS II + vector)machinehasperformancethatfallswithinafactoroftwoofthecustompipelineforourneuralcomputationcasestudy,andhassimilarlogicusage.TheBluespeccodeforBlueVectogetherwithvectorisedCcodeprovedtobemuchmorecompactandeasytodevelopthanthecustompipelinedespiteusingahigh-levellanguage(Bluespec).

Vectorprocessingalsoallowsthememorybottlenecktobemanagedthroughefficient useof burst access toDDR2/DDR3memory.Given that thememorybandwidthcaneasilybecomethebottleneck,inefficiencyinthecomputevectorcompute structures on FPGA becomesmostly irrelevant. Perhaps it is not toosurprising that a vector multiprocessor can achieve excellent performance onFPGA given that GPUs (which are also vector multiprocessors) are highlycompetitive [Ref. 17]. FPGAs do, however, offer the flexibility to tailor thevector unit to the application, e.g. using custom arithmetic operations. Butperhaps more importantly for massively parallel systems, FPGAs offer theability to customise the inter-chip communication network to suit theapplication.

Acknowledgements

Many thanks are due to our colleagues Prof. S.B. Furber (University ofManchester), Prof. A.D. Brown (University of Southampton) andMr. StevenMarsh (University of Cambridge) for collaborating on neural simulationtechniques. The UK research council, EPSRC, provided much of the fundingthroughgrantEP/G015783/1.

References

1. N.P. Sedcole et al. Run-Time Integration of Reconfigurable Video Processing Systems, IEEETransactionsonVLSISystems,15(9),1003–1016,2007.

2.M.E.Angelopoulouetal.AComparisonof2-DDiscreteWaveletTransformComputationScheduleson FPGAs, in Proc. International Conference on FieldProgrammable Technology, pp. 181–188,2006.

3.R.C.C.Cheung,W.Luk,andP.Y.K.Cheung.ReconfigurableEllipticCurveCryptosystemsonaChip,

inProc.Design,Automation&TestinEuropeConference&Exposition,pp.24–29,2005.4. R.C.C. Cheung et al. A Scalable Hardware Architecture for Prime Number Validation, in Proc.

InternationalConferenceonFieldProgrammableTechnology,pp.177–184,2004.5.J.Yuetal.VectorProcessingAsaSoftProcessorAccelerator,ACMTransactionsonReconfigurable

TechnologyandSystems,2(2),12:1–12:34,2009.6.Q.Liuetal.CompilingC-likeLanguagestoFPGAHardware:SomeNovelApproachesTargetingData

MemoryOrganization,TheComputerJournal,54(1),1–10,2011.7.S.A.McKee.ReflectionsontheMemoryWall, inProc.ConferenceonComputingFrontiers,p.162,

2004.8.R.S.NikhilandK.R.Czeck.BSVbyExample,CreateSpace,2010.9.S.W.Mooreetal.Bluehive—AField-ProgramableCustomComputingMachine forExtreme-Scale

Real-Time Neural Network Simulation, in Proc. International Symposium on FieldProgrammableCustomComputingMachines,pp.133–140,2012.

10. P. Yiannacouras, J.G. Steffan, and J. Rose. VESPA: Portable, Scalable, and Flexible FPGA-basedVectorProcessors, inProc. InternationalConferenceonCompilers,ArchitecturesandSynthesis forEmbeddedSystems,pp.61–70,2008.

11. C.H. Chou et al. VEGAS: Soft Vector Processor with Scratchpad Memory, in Proc. InternationalSymposiumonFieldProgrammableGateArrays,pp.15–24,2011.

12.A.SeveranceandG.Lemieux.VENICE:ACompactVectorProcessorforFPGAApplications,inProc.InternationalConferenceonFieldProgrammableTechnology,pp.261–268,2012.

13.M.Naylor et al.Managing the FPGAMemoryWall: CustomComputing orVector Processing?, inProc.InternationalConferenceonFieldProgrammableLogicandApplications,2013.

14.E.M. Izhikevich. SimpleModel of SpikingNeurons, IEEETransactions onNeuralNetworks, 14(6),1569–1572,2003.

15.P.J.Fox.MassivelyParallelNeuralComputation.TechnicalReportUCAMCL-TR-830,UniversityofCambridge, Computer Laboratory, 2013. Available at: http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-830.pdf.

16. C. Eliasmith et al. A Large-ScaleModel of the FunctioningBrain, Science, 338(6111), 1202–1205,2012.

17.B.Copeetal.PerformanceComparisonofGraphicsProcessorstoReconfigurableLogic:ACaseStudy,IEEETransactionsonComputers,59(4),433–448,2010.

http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-830.pdf

Chapter8

MaximumPerformanceComputingwithDataflowTechnology

MichaelMunday*,OliverPell*,OskarMencer*,†andMichaelJ.Flynn*,‡*MaxelerTechnologies

†ImperialCollegeLondon‡StanfordUniversity

Reconfigurable computers, generally based upon field programmable gate array (FPGA)technology, have been used successfully as a platform for performance critical applications in avariety of industries. Applications targeted at reconfigurable computers can exploit their fine-grainedparallelism,predictable low latencyperformance andveryhighdata throughput perwatt.Traditional techniques for designing configurations are, however, generally considered time-consuming and cumbersome and this has limited commercial reconfigurable computer usage. TosolvethisproblemMaxelerTechnologiesLtd,workingcloselywithImperialCollegeLondon,havedeveloped powerful new tools and hardware based on the dataflow computing paradigm. In thischapterweexplorethesetoolsandprovideexamplesofhowtheyhaveenabledthedevelopmentofanumberofhighperformancecommercialapplications.

8.1.Introduction

Thecontinuouslyincreasingspeedandmemorycapacityofsupercomputershas,over the past decades, allowed for the creation of ever more complex andaccurate mathematical simulations. There are challenges facing highperformance computing (HPC) however. Chief among these are themonetaryandenvironmentalcostsinvolvedinpurchasingandrunningaHPCsystem.Theelectricity costs alone for an exascale supercomputer are estimated tobemorethan$80million[Ref.1]ayear.

Tocreateasupercomputerthatachievesthemaximumpossibleperformancefor a given power/space budget, the architecture of the system needs to betailored to the applications of interest. This involves optimally balancingresourcessuchasmemory,datastorageandnetworkinginfrastructurebasedon

detailedanalysisof theapplications.Aswellas thesehigh-leveloptimizations,thearchitectureofthechipsinthesystemneedstoprovidebothspeedandlowpowerconsumption.

Currently the top 500 supercomputers [Ref. 2] are built from relativelygeneralpurposeserverswhichrelyonCPUs(andmorerecentlygeneralpurposeGPUs)forcomputation.Thearchitecturesusedbythesechipsaresuitableforawiderangeoftaskshoweverthisalsomeansthattheirlow-levelarchitecturesarenotnecessarilyoptimalfortheapplicationsthesupercomputerisdesignedtorun.Figure8.1showshowlittleofamodernCPUisdedicatedtoactualcomputation.Therestofthechipisdedicatedtosubsystemssuchascaches,branchpredictorsand schedulers designed to speed up programs. Far higher performance andefficiencycanbehadbydesigningthearchitecturesuchthatitisaperfectfitforanapplication.

Fig. 8.1. Simplified diagram of an IntelWestmere 6-core processor chip, highlighting the approximateportionofthechipperformingcomputationversusotherfunctions.

The underlying architecture of a computer system can be optimized bydevelopingapplication-specificintegratedcircuits(ASICs).DesigningASICsis,however, a very costly exercise and limits how the supercomputer can beadaptedandimprovedovertime.PeterCheunghasmademanycontributionstoreconfigurablecomputersbasedonchipssuchasfieldprogrammablegatearrays(FPGAs); they are a lower cost way of unlocking the gains that architecturalcustomizationcanbringwhileretainingtheprogrammabilitythatmakesgeneral

purposecomputingchipssopopular.

8.2.DataflowTechnology

Maxeler’s dataflow computing paradigm is fundamentally different tocomputingwithconventionalCPUs.Thetechnologyisanevolutionofdataflowcomputer[Ref.3]andsystolicarrayprocessor[Ref.4]conceptsdevelopedinthe1970sand1980s.

Figure 8.2 shows how computingwith CPU cores compares to themodelusedinadataflowcomputer.Ratherthanconstantlyfetchingandwritingtothemainmemory,dataisreadonceandthenmovedthroughdataflowcoresplacedexactlywhererequired.Theunderlyingarchitectureisadaptedtotheapplicationandthedatamovementisminimized.

8.2.1.Dataflowengines

Maxelerhasdevelopeddataflowengines(DFEs)toprovideahigh-performancereconfigurabledataflowcomputingplatform.ADFEhas at its heart amodernhigh-speed reconfigurable chip.Wired to this chip is a significant quantity ofmemory for storing application data andvarious high bandwidth interconnectsthat allow the DFE to communicate quickly with other DFEs, networks andgeneralpurposecomputers.

Fig.8.2.Computingwithacontrolflowcore(a)comparedtodataflowcores(b).

The reconfigurable chips used so far in DFEs have been FPGAs. FPGAshave a very flexible fabric in which dataflow graphs can be emulated. Thedevelopment of reconfigurable integrated circuits developed specifically forDFEs could bring significant performance improvements to applications andimprove programmability. There are also modifications that could beincorporatedintoFPGAstomakethemmoresuitablefordataflowapplications.The buffers used to enforce schedules in dataflow graphs, for example, arecurrently implementedusingdedicatedFPGAmemoryresourcessuchasblockRAMswithseparateaddressinglogicimplementedinconfigurablelogicblocks(CLBs).Thiscreatesroutingcongestionaroundthememoryblockandreduces

thefrequencyatwhichtheycanbeclocked.Autonomousmemoryblocks[Ref.5] could solve this problem by creating a dedicated configurable addressingcircuit within the memory block itself. As well as reducing congestion thiswouldreduce theamountofworkrequiredfromcompilation tools toplace theCLBs.

8.2.2.MaxCompiler

DFE configurations are designed in the Java-like MaxJ language and arecompiledusingMaxCompiler.Figure8.3showshowMaxCompilerfitsintothecompilationsystemusedforanapplication.

Fig.8.3.CompilationflowofaC/FortranbasedMaxCompilerapplication.

MaxCompiler designs use an architecturewhich is conceptually similar tothegloballyasynchronous locally synchronous (GALS)architecturepreviouslyused in FPGA design [Ref. 6]. Designs consist of a number of locallysynchronous kernels connected together and to other asynchronous DFEresources using themanager. This architecture allows the clock rate for eachkernel to be individually tweaked to provide the optimumdata throughput vs.routabilitytradeoff.

Once a configuration has been compiled and loaded onto a DFE it is

communicatedwithusingasimpleautomaticallygeneratedAPIaccessiblefromavarietyoflanguagessuchasC,RandPython.

8.2.2.1.Kernels

KernelsareattheheartoftheMPCconcept.Kernelsdescribecomputationanddatamanipulationintheformofdataflowgraphs.

MaxCompiler transforms kernels into fully pipelined synchronous circuits.Figure 8.4 shows how a simple MaxJ kernel maps to a dataflow graph. Thecircuitsarescheduledautomaticallytoallowthemtooperateathighfrequenciesand make optimal use of the resources on the DFE. As well as enabling themaximumexploitationof low-levelparallelization the fullypipelinednatureofkernelsallowsthemtobeeasilymodeledatahighlevelwhichhelpstoavoidthewasted programming effort sometimes seen with other less predictableprogrammablesystems.

Fig.8.4.MaxJsourcecodefor(a)asimplemovingaveragekerneland(b)theresultinggenerateddataflowgraph.

8.2.2.2.Themanager

ThemanagerdescribestheasynchronouspartsofaDFEconfigurationwhichareconnected together using a simple point-to-point interconnection scheme [Ref.7]. Themanager allows developers to connect resources such asmemory andPCIestreamstokernelstokeepthemfedwithdata.

8.2.2.3.SLiC(SimpleliveCPU)API

DFEs are controlled from a CPU using the SLiC API. This API comes in avarietyofflavoursrangingfromthebasicandadvancedautomaticallygeneratedstaticAPIstothemoreflexibledynamicAPI.AswellasaCAPIMaxCompilercangeneratebindingsinavarietyofotherlanguagessothatDFEscanbeusedtoaccelerateportionsofMatlab,RandPythoncodewithouttheneedforuserstowritetheirownwrappers.

8.3.FinancialValuationandRiskAnalysis

There is a need in the finance industry to compute ever more complexmathematical models in order to accurately price instruments and calculateexposure to risk.Theseapplicationsoften relyoncomplexcontrol flowwhichcanbedifficulttoaccelerateusinghighperformancesingle-instructionmultiple-data(SIMD)architectures.

8.3.1.Tranchedcreditderivatives

Credit derivates are financial instruments used to transfer some of the riskassociatedwith an asset, such as a bond, to an entity other than the lender inreturnforregularpayments.

Collateralized default obligations (CDOs) are a type of credit derivativecreated from a pool of assets. This portfolio is divided into layered tranches.Eachtrancherepresentsariskclass; junior trancheshaveahighriskofdefaultandsoprovidehigherreturnswhereasseniortrancheshavealowriskofdefaultand provide lower returns. This differential exists because if the value of theunderlyingassetsdrop,investorsholdingtheseniortrancheswillbepaidbeforethoseholdingjuniortranches.

DataflowtechnologyhasbeenusedtoaccelerateCDOvaluationandshowsa31 × speedup when compared to 8-core Xeon servers [Ref. 8]. The random-factor loading model [Ref. 9] is used to value the CDO tranches. Thisapplication highlights the power savings that can be acheived using dataflowtechnology,acheivingthe31×speedupwhileusingonly6%morepowerthanthereferencesystem.

8.3.2.Interestratederivatives

Interest rate derivatives are financial products which allow investors toincorporateinterest-rate-basedinstrumentsintotheirportfolios.Oneexampleofan instrument which can be the underlying asset behind an interest ratederivative isan interest rate swap. Ina swaponeparty iscommonlypayingafloating rateof interest linked to the interest rate setbyanauthority suchasacentralbankandasecondpartypaysafixedrateofinterest.Oftenthefixedrateinterestispaiddirectlytothepartypayingthefloatingrate.Inthiscasethefirstparty is taking a risk in return for the opportunity tomake a profit should thefloatingratebeloweronaveragethanthefixedrate.

Calculatingthevalueofderivativesbasedoninstrumentssuchasswapsandother assets is complicated and needs to take into account a wide range ofvariablessuchasforeignexchange,interestandequityrates.ItisimpracticalforfinancialinstitutionstosimulateeverypossiblescenarioandsotheMonteCarlomethodisoftenadopted,simulatingmanypossiblescenariosandcombiningtheresultstocalculateaprice.

TheMonteCarlomethodhasbeenusedasthebasisforaderivativepricingsystemwhichachievedwallclockspeedupsofupto200×versussinglecores[Ref. 10]. The use of Maxeler’s dataflow technology meant that a complexdesigncouldbedevelopedincorporating:

•Multi-assetclasses.•Flexibilitytoaccommodateidiosyncraticpayofffunctions.•Accurateandstablefinitedifferenceriskcalculations.

8.4.Geophysics

Dataflowtechnologyiswellsuitedtoanumberofgeophysicsapplicationsandspeedupsofapproximately200×comparedtosingleCPUcoreimplementationshave been recorded [Ref. 11, 12]. Geophysics applications often involvemanipulating largedatasets andmanyof the algorithmsused canbe conciselydescribedinkernels.ApplicationsbasedonfastFouriertransforms[Ref.13]andexplicit finite difference [Ref. 14] have been successfully implemented usingdataflowtechnology.

8.4.1.Reversetimemigration

Reversetimemigration(RTM)isacomputationallyintensivealgorithmusedin

theoilandgasindustryforsubsurfaceimaging.Aspartoftheexplorationprocessacousticwavesarebroadcastthroughthe

Earth’ssurfaceandthereflectionsrecorded.Geophysicistsusethesereflectionstocreatemodelsofstructuresbeneaththeground.ComputersrunningRTMarethenused tomodel thewavepropagation forwards from the source (mirroringthephysicalprocess) followedby thepropagationof the reflectionsbackwardsfrom the receiver (the reverseof thephysical process).The twoare correlatedandcombinedintoanimagewhichisthenusedtofurtherrefinetheearthmodel.

Finite difference is a well known way of solving the partial differentialequations used tomodel acousticwave propagation. Figure 8.5 shows a basicformofthealgorithmused.RTMbasedonfinitedifferenceandusingMaxeler’sMaxGenFD tool has been accelerated using dataflow technology, achievingspeedupsofaround200×whencomparedtosingleCPUcores[Ref.15].Thesespeedupsenabletheuseofhigherfrequencymodelingimprovingthequalityoftheresults.

Fig.8.5.Pseudo-codeforafinitedifferencewavepropagator.

8.4.2.MaxGenFD

MaxGenFD [Ref. 14] is a framework built on top of MaxCompiler whichsimplifiesthedevelopmentofhighperformanceexplicitfinitedifferencesolversfor seismic processing applications. MaxGenFD manages domaindecomposition, boundary conditions, convolution and data set handling.Applicationssuchasforwardmodeling,RTMandfullwaveforminversioncanbequicklydevelopedandoptimizedusingMaxGenFDandputintoproduction.

Figure 8.6 shows how data is moved through the DFE in a typicalMaxGenFD application. The Earth model remains constant throughout asimulationwhereasthewavefieldsareupdatedaspartofeachtimestep.

MaxGenFD extends the flexible stream typing system provided byMaxCompiler(firstshowcasedbyMaxeleraspartofthePhotoncompiler[Ref.16]).The typingsystemallowsdevelopers toprototypeadesignusingfloatingpoint types and then quickly transition to a variable-width fixed pointimplementation of the same design. Additionally MaxGenFD globally scalesfixed point values at runtime tomaximise precision.One of the limitations ofthistechniqueisthatgeophysicistsneedtoknowtherelativemaximumsofthewavefieldinputstotheirkernelinordertomaximisetheirindividualprecision.LightweightdynamicscalingsystemssuchasDualFiXed-point[Ref.17]couldbeadoptedinthefuturetomakethissystemmoreflexiblewithoutresortingtocomputationally expensive floating point operations. For example the secondexponentcouldbeusedtocatchpeaksinthewavefieldallowingthebulkofthewavefieldtobecomputedinahigherprecisionusingthefirstexponent.

Fig.8.6.DiagramshowingtheflowofdatainatypicalMaxGenFDapplication.

Research intomultiple-wordlength optimization (for example [Ref. 18]) isongoingandisimportantfortoolssuchasMaxGenFDasitallowsthemtomake

the best of use of the available resources on reconfigurable chips. AsMaxGenFD designs are a relatively high-level description of a mathematicalalgorithmitmightevenbepossible in thefuture toautomaticallyworkout theoptimal types given an acceptable error constraint. A similar concept beendemonstrated for floating point implementations of Fourier transforms usingautomaticdifferentiation[Ref.19].

8.4.3.CRSseismictracestacking

CRSstacking [Ref.11] is analgorithmused toprocess seismic surveydata tocompute zero-offset traces. These can be easily computed from the input datagiven eight parameters to the CRS equation, however the values for theseparametersareunknown.Thestackingapplicationmustdeterminegoodvaluesfor the eight parameters before it can compute the stack. This is done byperformingasearchin8-Dspace.Afitnessfunctionisevaluatedateachpointinthespacetodeterminethequalityofthecurrentparameterset.

OnconventionalCPUbasedcomputers,theexecutiontimetocomputeCRSforatypicalsurveycanbeoftheorderof1monthusing1,000CPUcores,andthis runtime is dominated by the computation of the semblance (fitness) andtraveltime. The traveltime function calculates the location that should be readfrom a data trace based on the eight CRS parameters, while the semblancefunctioncomputesonawindowofdatavaluesfromthatlocation.

The application is accelerated by moving the semblance and traveltimefunctionsontoaDFE.ThecodewhichcontrolsthesearchthroughtheparameterspaceremainsuntouchedontheCPU.

TheextracomputationalcapabilityoftheDFEscomparedtotheCPUmeansthat some algorithmic changes are required to reach the full potentialperformance.Whenasinglesearchisperformedmanydataitemsareread,butrelatively little computation is performed on each item. By transforming thealgorithm so multiple sets of computation are performed in parallel thecomputationalintensitycanbeincreasedandthisallowstheDFEsperformancetobefullyexploitedwithoutbeinglimitedbymemorybandwidth.

WithbothsemblanceandtraveltimescomputationontheDFE,over99%ofthetotalruntimefromtheCPUisaccelerated.Thefinalimplementationgivesaspeedupofapproximately230×comparedtoasinglecoreforlanddatasetsand190×formarinedatasets[Ref.11],approximately30×greaterperformanceperwatt.

8.5.Conclusions

Researchhasshownthatdataflowcomputersusingreconfigurable logiccanbeused to both significantly improve the speed at which computation can beperformedwhilealsoreducingpowerconsumptionwhencomparedtotraditionalcomputer technology. Dataflow technology is a proven way to push theperformanceofreconfigurablecomputerswhilemaintaininganeasytouseandflexibledevelopmentsystem.

References

1. O. Pell and O. Mencer. Surviving the End of Frequency Scaling with Reconfigurable DataflowComputing,SIGARCHComputerArchitectureNews,39(4),60–65,2011.

2.Top500SupercomputerTypes.[Online]Availableat:http://www.top500.org.[Accessed4April2013].3.J.B.Dennis.DataFlowSupercomputers,Computer,13(11),48–56,1980.4.H.T.Kung.WhySystolicArchitectures?,Computer,15(1),37–46,1982.5. W.J.C. Melis, P.Y.K. Cheung, and W. Luk. Autonomous Memory Blocks for Reconfigurable

Computing,inProc.InternationalConferenceonCircuitsandSystems,vol.2,pp.581–584,2004.6.A.Royal andP.Y.K.Cheung.GloballyAsynchronousLocallySynchronousFPGAArchitectures, in

Proc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.355–364,2003.7.T.S.T.Maketal.On-FPGACommunicationArchitecturesandDesignFactors,inProc.International

ConferenceonFieldProgrammableLogicandApplications,pp.1–8,2006.8.S.Westonetal.AcceleratingtheComputationofPortfoliosofTranchedCreditDerivatives, inProc.

WorkshoponHighPerformanceComputationalFinance,pp.1–8,2010.9. L. Andersen and J. Sidenius. Extensions to the Gaussian Copula: Random Recovery and Random

FactorLoadings,JournalofCreditRisk,1(1),29–70,2004.10. S.Westonet al.RapidComputation ofValue andRisk forDerivativesPortfolios,Concurrency and

Computation:Practice&Experience,24(8),880–894,2012.11.P.Marchettietal.Fast3DZOCRSStackAnFPGAImplementationofanOptimizationBasedonthe

Simultaneous Estimate of Eight Parameters, in Proc. European Association of Geoscientists andEngineersConference,2010.

12. T. Nemeth et al. An Implementation of the Acoustic Wave Equation on FPGAs, SEG TechnicalProgramExpandedAbstracts,27(1),2874–2878,2008.

13.C. Tomas et al.Acceleration of theAnisotropic PSPI ImagingAlgorithmwithDataflowEngines inProc.AnnualMeetingandInternationalExpositionoftheSocietyofExplorationGeophysics,2012.

14.O.Pelletal.FiniteDifferenceWavePropagationModelingonSpecialPurposeDataflowMachines,IEEETransactionsonParallelandDistributedSystems,24(5),906–915,2013.

15.D.Oriatoetal.FDModelingBeyond70HzwithFPGAAcceleration, inProc.SEGHPCWorkshop,2010.

16. J.A.Boweret al.A Java-basedSystem forFPGAProgramming, inProc.FPGAWorldConference,2008.

17. C.T. Ewe, P.Y.K. Cheung, and G.A. Constantinides. Dual Fixed-point: an Efficient Alternative toFloating-point Computation, inProc. International Conference on Field Programmable Logic and

http://www.top500.org

Applications,pp.200–208,2004.18. G. Constantinides, P.Y.K. Cheung, and W. Luk. The Multiple Wordlength Paradigm, in Proc.

InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.51–60,2001.19.A.Gaffaretal.Floating-pointBitwidthAnalysisviaAutomaticDifferentiation,inProc.International

ConferenceonField-ProgrammableTechnology,pp.158–165,2002.

Chapter9

FutureDREAMS:DynamicallyReconfigurableExtensibleArchitecturesforManycoreSystems

LesleyShannonSchoolofEngineeringScience,SimonFraserUniversity

In today’s world of cloud computing and petascale computing facilities (with discussions ofexascale computing system design), manycore systems are indeed a reality. The challenge ofincreasing computational power, while managing/reducing power consumption, is leading to theconsiderationofnewcomputesolutions.Thischapterproposeshowreconfigurabilitymaybecomepart of the solution for these new systems and whether it is time to reconsider some of theunderlyingcomponentsintraditionalcomputingsystemdesign.

9.1.Introduction

Since the sale of the first commercially available microprocessors in 1971(Intel’s 4004) [Ref. 1], computing system design has been on an exponentialtrajectory, leveraging the exponential increase in transistor counts and clockfrequencies.InnovationsrangefromRISCinstructionsets,tocaches,tobranchpredictors; however, the fundamental CPU architecture proposed by VonNeumann and others has changed very little. Although more innovativearchitectures have been proposed (e.g. dataflow machines, dynamicallyreconfigurable computers), they have not garnered the commercial success ofthosebasedontheoriginalVonNeumannstylearchitecture.

Recently, computing design innovations have come in the form ofmulticore/manycore architectures and networks-on-chip, stamping out anincreasing number of traditional computing cores on a single die to increaseperformance. However, this has not lead to corresponding increases incomputationalefficiencyduetotheoverheadsandchallengesinherentinparallelprogramming and resource sharing. Given the future of systems with

thousands/hundreds of thousands of processing elements, we have anopportunitytoreexaminecomputingsystemdesignandperhapsrevisitsomeofthearchitectural fundamentals thathavechangedminimallyover thepast fortyyears.

Thischapterquestionsthefutureofcomputingsystemdesignformanycoresystems and discusses the place of reconfigurability and reconfigurablecomputinginthisfuture.First,isabriefdescriptionofcomputinginnovationinthe 1900s. Next, the changes and challenges of the past twelve years arepresented.Finally,avisionforthefutureofcomputingandtheopportunitiesandproblemsthatlayaheadareoutlined.

9.2.Yesterday

SincethefirstcommercialCPUsofthe1970s,computerarchitectureshavebeenonanexponentialgrowthtrack,leveragingMoore’sLaw.Datapathsgrewfrom4 bits to 8 bits, followed by 16 bits, 32 bits, and now 64 bits (even 128 inspecializedcases).Additionalhighlightsfromthesemanyyearsofresearchastohowtoimprovetheefficiencyandprocessingthroughputofasingleprocessor’sarchitectureincludeinvestigationsof:

•effectiveinstructionsetarchitectures(ISAs)(e.g.CISCversusRISC),•pipelining,•howtoleverageadditionalon-chipmemories(i.e.caches),•howtoincreaseasystem’smemoryusingvirtualmemory,and• how to efficiently guesswhat to dowhen encountering a conditional branch(i.e.branchpredictors).

The impact of shrinking process technology sizes, resulting in theexponentialincreaseinthenumberoftransistorsthatcanbefabricatedonachipwith each new generation, combined with all the other new technologicaladvances has had an astounding impact on computing system design. As asimpleexample,theeverpopularCommodore64[Ref.2]fromtheearly1980shad64kBofmemory,whereastoday’spersonalcomputingsystemseasilyhaveaterabyteworthofsystemmemory(increasingbymorethan15milliontimesinsizeoverthelast30years)and8GBofmainmemory(anincreaseofmorethan130thousandtimes).Computingclockfrequencieshavealsosimilarlyincreased,ifatasomewhatslowerrate,upuntil2004.Concurrently,improvementsintheaccompanyingsoftwaretechnology(e.g.compilersandoperatingsystems)have

providedabstractions thathavemadetheseplatformsaccessible toa largeuserbase.

9.2.1.Alternativecomputingsystemdesigns

In conjunction with the more traditional avenues of investigation, researchershave also explored other computing models that relied more on spatialcomputing than on the temporal sharing of resources. One form of spatialcomputingexamineddataflowarchitecturessuchasdataprocessnetworks[Ref.3],Kahnprocessnetworks[Ref.4],andYAPI[Ref.5].Thesearchitectures,likesystolic arrays, assume that computing elements are networked together viacommunication links that areused tomovedata through the system.Althoughthis style of architecture is very useful for data-intensive applications (e.g.streamingapplications), theyarenotgood forapplications thathavenumerousconditional branches as the resources for both options in the branch must beallocatedeventhoughtheyareneverused.Furthermore,unlesstheconnectivitypatternbetweencomputingelementscanbedynamically reconfiguredbetweenapplications, the architecture is effectively hard coded to perform one set ofoperations.

Inspired by the reconfigurable hardware platforms provided by fieldprogrammable gate arrays (FPGAs), researchers have also considereddynamically reconfigurable computing architectures. These architectures haverangedfromdynamicarraysofcomputingelementsthatcanbereconfiguredatarelativelyfinegrainforbit-wiseoperations[Refs.6and7],tomorecoarsegrainfabricsreconfigurablefabricsthatoperateondatawords[Refs.8and9].Thesereconfigurable fabrics can then be integrated into the computing system as areconfigurable functional unit (RFU). TheRFUmay be tightly integrated intothe processor’s pipeline [Refs. 6 and 7] or more loosely integrated as a co-processingunitthatcanoperatesomewhatindependentlyoftherestoftheCPU[Refs.8and9].

Since the spatial portion of these computing architectures is dynamicallyreconfigurable,theirresourcescanbeadaptedandreusedtosolveanycomputeproblem, making themmore area efficient than a fixed dataflow architecture.However,reconfigurationcomesatacostintermsoftimeandspace.Thetimerequired toreconfigurea fabric isdirectlyproportional to thegranularityof itsconfiguration.Fine-grainfabricsrequirethelargestnumberofconfigurationbitsand, therefore, have the largest area overhead for their configurationinfrastructure as well as the longest configuration times. Conversely, coarse-

grain fabrics need fewer configuration bits and do not require as long toreconfigure.However,theirareaoverheadcostsarisefromunusableresourcesorinefficient use of the fabric as the additional configuration bits in fine-grainfabrics make it easier to customize them more precisely to the needs of thecomputation.

Although reconfigurable computing architectures can accelerate certainapplications,theoverheadofdynamicreconfigurationisgenerallytoocostlytomakethisapproachworthwhile.Assuch, inspiteof thepotential forhardwareacceleration, dynamic partial reconfigurable computing architectures have nothadmuchsuccesstodate.

9.2.2.Summary

Fromoneperspective,evenatraditionalCPUcanbeviewedasdynamicallyandpartially reconfigurable: the instruction decoder “reconfigures” the CPU’sdatapathforeachinstructiontobeexecuted.Thekeyisthatthisconfigurationissocoarse-grainthat the“configurationbits”(i.e.machinecodeinstruction)canbe stored in relatively few bytes and the datapath can be “reconfigured” on acycle by cycle basis for each instruction. Single-instruction-multiple-data(SIMD), multiple-instruction-multiple-data (MIMD), and instruction windowsenable theCPU to share itsdatapath for the executionofmultiple instructionsconcurrently, to increase instruction level parallelism. Furthermore, theirprogrammingmodelsandcompilerflowsareconsiderablysimplerthanthoseofactual dynamic partially reconfigurable computing architectures as theirparallelism is limited. However, this restricted parallelism, due to limitedcomputingresourcesandlackofapplication-specificcustomization,meanstheycannot compare with the efficiency of application execution using customhardware.

9.3.Today

Today’s high-performance processors rely on parallelism in lieu of increasingclockratestoimprovecomputingefficiency.NotonlycanthreadsshareaCPU’spipeline through symmetricmulti-threading, butmostmodern processors havemulticorearchitectures,wheremultipleCPUcores(two,four,oreightcores)arestamped out on the same die. Although these architectures originally had asharedbus format,network-on-chiparchitecturesarebecomingmoreprevalent

withtheincreasingnumberofcores(16,32,64).

9.3.1.Manycorecomputing

As these systems become larger (hundreds/thousands of cores) and moreprevalent, sharing resources efficiently among users, applications, and threadsbecomesmore challenging.Realizing that the cost ofmoving data around thenetworkfromdifferent“domainsofcoherency”canmorethandoubleexecutiontime[Ref.10],nottomentionenergyconsumption,meansthatefficientmethodsof dynamically allocating resources to different applications are essential. Tomakethesecomplexplatformsmoreaccessibletousers,abstractionssuchasthe“cloud” have been created. However, recent reports indicate that informationtechnology represents 10% of the world’s electricity consumption [Ref. 11].Therefore, as we look forward, we need to question how we design thesesystemstobebothcomputationallyandenergyefficient.

Basedonpowerconsumptionalone,exascalecomputingsystemswillnotbeable tobebuilt fromsolelyhighperformanceprocessors.Onepossibility is tobuild these new systems using low end processors with lower powerconsumption, such as those popular in embedded computing devices.Anotherpossibility, is to consider a more heterogeneous system, with FPGAs toimplement hardware accelerators, general purpose graphic processing units(GPGPUs)forfloating-point-intensivemodules,andgeneralpurposeprocessors.Althoughprogrammingandallocatingresourcesinahomogeneoussystemisaneasier problem, the heterogeneous solution may provide better overallperformance at comparable costs, assuming the appropriate softwareinfrastructureisinplace.

9.3.2.Challengesandopportunities

Manycoresystemsprovidebothchallengesandopportunitiesforresearchersandprogrammers. What types of computing components should they comprise?Whattypeofsynthesisandcompilationflowisneededtomapdesignstothesesystems?Howwillthesystem’sscheduleradapttothevariousworkloads?Whatadditionalsoftwareinfrastructure/changesinprogrammingmodelsareneededtoexpose an application’s parallelism and facilitate its mapping to a manycoresystem’s resources. The remainder of this chapter will focus on discussing avisionofasolutiontosomeofthesechallenges.

9.4.AVisionofTomorrow

This section proposes a vision for the programming, compilation, scheduling,anddesignofadynamicallyreconfigurableextensiblearchitectureformanycoresystems (DREAMS). To allow for the greatest flexibility in this model, it isassumed that these systems may be comprised of heterogeneous computing(processing)elements.

9.4.1.Programmingandcompilation

The programming model for the proposed system should rely on high-levellanguages.Themodelshouldsupportdomain-specific languagesaswellas themoregenericonespopularwithprogrammingexpertsandnovices.Althoughnotalllanguagesmaybecompileddowntothesameefficiencyofexecutionontheplatform,theyshouldallwork.Instead,theexpectationisthatgreatadvancesincompilertechnologyareneeded.

Specifically, the compiler will have to uncover parallelism and generateexecutablesubcomponentsthatcanbeexecutedonmultipleprocessingelement(PE) architectures to support the platform’s heterogeneity. As shown in Fig.9.1(a), current synthesis/compilation flows translate multiple source modulesintoasingleexecutableformatforaspecifiedplatform.However,asshowninFig. 9.1(b), the proposed solution for heterogeneous manycore systems is tocompile the sourcemodules intoa setofexecutablemodules. In this scenario,notonlyshouldtheseexecutablemodulesbemappabletoavarietyofPEtypes,but theyshouldalsobemappabletoplatformswithavariednumberofPEssothatthedesign’sfinalexecutableformatremainsportabletodifferentplatforms.Finally,itisexpectedthatthecompilerwillacceptadditionalinput,alongwiththe traditional program source files, that will be evaluated by an additionalanalysisphasetowhatisfoundincurrentcompilers.

Fig.9.1.(a)Currentsynthesisandcompilationmodel.(b)Proposedsynthesisandcompilationmodel.

Theadditionalinputtothecompilerwillbeintheformofmetadatafromtheuser. Itcanbeused to indicate informationrangingfromthecriticalityofhighspeedexecutionto thenumberofPEresourcesauserwishes toallocate to theproblem (as this may impact the cost of performing cloud-based processing).Additionally, informationsuchascompilation timeandeffortmaybe includedalong with the user’s perspective as to key or bottleneck processingsubcomponents.

Theadditionalanalysisphasewillbeusedtoperformastaticanalysisoftheprogramand theperspective libraries that itcanuse. Ideally, thiswillgenerateadditionalmetadata(i.e.theanalysismetadatashowninFig.9.1(b))thatcanbeused by the platform’s scheduler to effectively allocate resources so as toincrease processing efficiency.This informationmay affect the number and/ortype of resources allocated to the application’s execution. Finally, instead ofonly generating application threads based on the programmer’s softwaredescription of the implementation, the compiler will be able to generateexecutablemodulescomprisedof task-level threadsofexecution tobemappedto various PEs, adding the necessary synchronization primitives needed to toensure that the final implementation is functionally equivalent to the originaldescription.

9.4.2.Scheduling

To schedule multiple different applications on a heterogeneous fabricconcurrently and efficiently, the scheduling software needs information aboutthe programs currently being scheduled.As described in the previous section,thiscancomeintheformofmetadataabouttheprogramfromthecompileranduser.Infact,itisexpectedthatthistypeofinformationwillbeembeddedintheexecutable as a header that can be used to allocate PEs and schedule taskexecution.

To assist the scheduler and enable it to adapt unknown workloads at runtime,runtimeperformance-profilinghardwareshouldbebuiltintothecomputingfabric. This can be used to obtain additional run time statistics about theapplication’s threads—what resources theyneed,howefficiently they runondifferenttypesofPEs,etc.Ideally,theseprofilingresourcescanbeusedbyboththeprogrammertounderstandtheprogram’sruntimebehavioursoastobetteruse processing resources and by theOS to adapt its resource allocation to theneedsofallthreadsinthecurrentworkload.

9.4.3.Proposedarchitecture

Inmanycoresystemswithhundreds/thousandsofprocessors,thecloudabstractsthe “sea” of computing resources from the user. However, a hardwareabstraction is also needed to at the execution level to “hide” the platform’scomplexityfromtheindividualapplicationsexecutingonit.Figure9.2providesasimpleoverviewoftheproposedDREAMSarchitecturewithahomogeneousseaofPEs,reflectiveofcurrentmanycoresystems.Ideally,thisseawillinsteadbe a heterogeneous mix of PEs, as shown in Fig. 9.3, comprising PEs withasymmetric configurations (APEs), or GPGPUs (GPEs), and even FPGAs(FPEs). This variety of processing elements ensures that different applicationtypes(e.g.streaming,floating-pointintensive,etc.)canexecutemoreefficientlybymappingthemtotheirappropriatePEtypes.

Fig.9.2.Apotentialhierarchicalarchitectureforahomogeneousseaofprocessingelements.

Fig.9.3.Apotentialhierarchicalarchitectureforaheterogeneousseaofprocessingelements.

Theproposedarchitectureishierarchical,withafabricallocator&extenderat the highest level. The function of this unit is twofold. First, it provides amethodofconnectingexistingcomputingfabricresourcestonewlyaddedunitstoextendthecomputingfabric,withoutrequiringallofthePEsthemselvestobeawareof thenewadditions.Secondly, it reads themetadata in theheader,andonlytheheader,andusestheprofilingdataintheheaderinconjunctionwitharecordoftheworkloadcurrentlyallocatedtothefabricandbeingmonitoredbytheresourceallocators&schedulers.

To ensure that the fabric allocator & extender unit does not become thebottleneckforprocessingonthefabric,itonlyreadsthedataintheapplicationexecutable’sheader;asthisisapointofserialization,itdoesnotreadanyoftheexecutable code, but directly routes it to the second tier of the hierarchy, theresourceallocatorandschedulermodules.Theresourceallocatorandscheduleruses a portion of the application header to determine howmany resources ofeach type it would like to allocate to the application and to generate anapplication-specificcommunicationnetworkbetweenthePEs(seeFig.9.4).Ifaresourceallocatorwouldlikeadditionalresourcestoallocatetotheworkloaditismanaging,itcanrequestthatthefabricallocatoreither:(1)reassignsomeoftheworkload to another portion of the fabricmanaged by a different resourceallocator,or(2)getaresourceallocatortosharesomeofitsPEresourceswithanapplicationrunningontheadjacentPEfabriccontrolledbyanadjacentresourceallocator(seeFig.9.5).

Fig.9.4.Anapplication-specificcommunicationnetworkwithinthePEfabricdynamicallygeneratedbytheresourceallocator.

The resource allocator& scheduler uses the information in the applicationheader in conjunction with the information on the workload it is currentlymanagingonitsportionofthefabrictodecidehowmanyandwhattypesofPEsto allocate to the application. Depending on the nature of the workload it isexecuting,thePEresourcesmaybesharedbetweenthreadsandworkloads.Foreach application, the resource allocator & scheduler will use the on-chipperformance-profilinghardwaretomonitortheworkloadexecutingonthePEsinitsfabric.Thisdatawillberecordedonaper-applicationbasisandtoenablethe

allocatortoadapttoitsworkload’sneedsatruntime.Furthermore,thisdatawillbeaggregatedataworkloadlevelandfedbacktothefabricallocator&extenderto help the upper level of the platform ensure a balancedworkload across thefabric.Finally,theprofilinghardwarewillbeusedtodetectfailuresinPEs.Theresource allocatorwillmark these devices as offline (see Fig. 9.6) and ensurethatnofutureapplicationsarescheduledtousethemasprocessingresources.

Fig.9.5.Resourceallocatorsextendingtheirfabricforapplicationexecutionby“borrowing”PEsfromanadjacentresourceallocator.

Fig.9.6.DynamicallydetectingPEfailuresandremovingthemfromtheresourceallocator’slistofusableprocessingresources.

ThefinalelementoftheprocessinginfrastructurearethePEs.ThePEsinthefabricare responsible forexecuting the tasks for theworkloadallocated to theresourceallocator.ThePEsmaybeheterogeneousorhomogeneouswithrespectto each resource allocator. However, to ensure better locality and fastercommunicationanddata throughput, the taskswillbeallocatedspatially to thePEs and the underlying communication fabric that connects the PEs will beconfiguredspecificallyforeachapplication(recallFig.9.4).ThisfabriccanbereconfiguredonaperapplicationbasisandthenetworkandPEresourcescanbetime-sharedamongapplications.

9.5.Conclusions

Therearenumerousresearchopportunitiesthatbecomeapparentwhentryingtoenvisionhowtomakeaseaofprocessingunitsworkeffectivelyandefficiently.The abstraction of the cloud provides userswith away to use these resourcesthat hides the hardware. However, the question of what that hardware shouldreally look like is an important considerationas is thedesign flow required tofacilitatetheirprogrammingandensuretheirefficientoperation.

While repeatedly stamping out the same resources or simply combininghundreds or thousands of the same chips into a fixed structure (i.e. ahomogeneous solution)might be easy to build, it is not efficient (in terms ofpower and performance) as it cannot adapt. Furthermore, given the cost ofbuildingthesesystems,it isimportanttoensurethattheyareabletocopewithcomponent failures, etc. and to ensure their longevity and reliability for theirusers. These types of computing systemswill not only have to adapt to theirapplicationsbuttheywillalsohavetobeabletoself-monitorandself-heal(orattheveryleastadapttounitfailuressothatthesystemoperatesefficientlyuntilitis repaired).Addressing these research challenges promises to provide another40yearsofexcitinginnovationincomputingsystemdesign.

Acknowledgements

Peter:Yourdrive,enthusiasm,bodyofwork,andlifebalanceareinspirational.You are a wonderful mentor, edifying even the most confused soul. Yourgenerosityoftimeandspiritareupliftingandmygratitudeforyourfriendshipisimmeasurable.

Inshort,“ThankyouPeter,foreverything.”IamgreatlyindebtedtoyouforhostingmeatImperialCollegeLondonandgivingmethisopportunitytowork

withsomanywonderfulpeople; thistimeandexperiencearepriceless.IknowthatmysabbaticalherewillbeamemorythatIwillcherishforever.

I look forward tobeingyour friend formanyyears to comeandwishyoumanyhappy and exciting adventures in the future.All thebest on this specialday; I hopewe, your friends, are able to convey at least some of the joy youbringintoourlives.Mybestalways,Lesley.

References

1. The Story of the Intel 4004. [Online] Available at:http://www.intel.com/content/www/us/en/history/museum-story-of-intel-4004.html. [Accessed 23April2014].

2.Commodore64—1982. [Online]Availableat:http://oldcomputers.net/c64.html. [Accessed23April2014].

3.E.LeeandT.Parks.DataflowProcessNetworks,Proc.IEEE,83(5),773–799,1995.4. G.Kahn. The Semantics of a Simple Language for Parallel Programming, inProc. IPIFCongress,

1974.5.E.Kocketal.YAPI:ApplicationModelingforSignalProcessingSystems,inProc.DesignAutomation

Conference,pp.402–405,2000.6.S.Haucketal.TheChimaeraReconfigurableFunctionalUnit, IEEETransactionsonVLSISystems,

12(2),206–217,2004.7. T.J. Callahan, J.R. Hauser, and J.Wawrzynek. The Garp Architecture and C Compiler, Computer,

33(4),62–69,2000.8. S. Goldstein et al. PipeRench: A Coprocessor for Streaming Multimedia Acceleration, in Proc.

InternationalSymposiumonComputerArchitecture,pp.28–39,1999.9. C. Ebeling, D.C. Cronquist, and P. Franklin. RaPiD: Reconfigurable Pipelined Datapath, in Proc.

InternationalWorkshoponField-ProgrammableLogicandApplications,pp.126–135,1996.10.S.Zhuravlev,S.Blagodurov,andA.Fedorova.AddressingSharedResourceContention inMulticore

Processors via Scheduling, in Proc. Conference on Architectural Support for ProgrammingLanguagesandOperatingSystems,pp.129–142,2010.

11.J.Clark.ITNow10PercentofWorld’sElectricityConsumption,ReportFinds.[Online]Availableat:http://www.theregister.co.uk/2013/08/16/it_electricity_use_worse_than_you_thought/. [Accessed 23April2014].

http://www.intel.com/content/www/us/en/history/museum-story-of-intel-4004.html

http://oldcomputers.net/c64.html

http://www.theregister.co.uk/2013/08/16/it_electricity_use_worse_than_you_thought/

Chapter10

CompactCodeGenerationandThroughputOptimizationforCoarse-GrainedReconfigurableArrays

JürgenTeich,SrinivasBoppu,FrankHannigandVahidLariHardware/SoftwareCo-Design,DepartmentofComputerScience,

UniversityofErlangen-Nuremberg,Germany

We consider the problem of compact code generation for coarse-grained reconfigurablearchitectures consisting of an array of tightly interconnected processing elements, each havingmultiplefunctionalunits.Switchingfromonecontexttoanotheroneisrealizedonacyclebasisbyemployingconceptssimilarasinprogrammableprocessors, i.e.aprocessingelementpossessesaninstruction memory, an instruction decoder, and a program counter and thus can be seen as asimplified VLIW processor. Such architectures support both loop-level parallelism by havingseveral processing elements arranged in an array and instruction-level parallelismby theirVLIWnature.Theyareoften limited in instructionmemorysize to reduceareaandpowerconsumption.Hence, generating compact code while achieving high throughput for such a processor array ischallenging. Furthermore, software pipelining allows such an array to execute more than oneiterationatatime,makingthecodegenerationandcompactionevenmoreambitious.Inthisrealm,wepresentanovelapproachforthegenerationofcompactandproblemsizeindependentcodeforsuchVLIWprocessorarraysonthebasisofcontrolflowgraphnotations.Forselectedbenchmarks,we compare our proposed approach for code generation with the Trimaran compilationinfrastructure.Asaresult,ourapproachmayreducethecodesizebyupto64%andmayachieveupto8.4×higherthroughput.

10.1.Introduction

In high-performance embedded computing systems, coarse-grainedreconfigurablearrays(CGRAs)canbeusedtospeeduppartsofanapplication.Asanexample, theycanserveas loopacceleratorsformanysignalprocessingalgorithmsandthusboosttheoverallperformanceoftheapplication.ExamplesofCGRAsinclude,forinstance,PACTXPP[Ref.1]orADRES[Ref.2],bothofwhich are arrays that can switch between multiple contexts by runtime

reconfiguration.Othercommercialprocessorarrayexamples thatcontainsmallVLIWprocessorsarethepicoArrayarchitecture[Ref.3],whichconsistsof43016-bit3-wayVLIWprocessors,orAdapteva’sEpiphanyarchitecturewithupto1,024dualissueprocessors[Ref.4].

Ourworkstartswiththespecificationofanarbitrarilynestedloopprograminahigh-levellanguageandmappingitontoaCGRAarchitecture,respectingaset of given resource and throughput constraints. Here, one of the mainchallenges is to generate the programs (codes) for each of the computation-centric, but instruction-memory limited VLIW processors automatically. ThegeneralperceptionofVLIWprocessorsisthattheymaysufferfromanoverheadin code size, e.g., when not always employing all functional units. Therefore,numeroussoftwarepipelining techniquessuchas [Ref.5]havebeenproposed,whichallowasingleprocessortoexecutemorethanoneiterationatatimeandthusoptimizetheresourceutilizationandconsequentlyalsothethroughput.Asabasisofourwork,weconsideranextensionofaclassofCGRAscalledWPPA(weaklyprogrammableprocessorarray)asproposedbyKisslerandothers[Ref.6]. Here, instead of computing loop bound control in each processor andproducingcorrespondingoverheads ineachprocessor,acontroller isproposedthatgeneratesbranchcontrolsignalsthatareissuedonceandpropagatedtotheindividualreceivingprocessorsinapipelinedmanner.ThistypeandsharingofCGRAloopcontroltherebyprovidesauniquefeatureofmulti-dimensionalzero-overheadloopprocessing.

We use partitioning techniques [Refs. 7–9] to map a given loop programonto a given fixed size processor array tomeet the given resource constraintssuch as number of processors, number of functional units in each VLIWprocessor, register and program memory restrictions, etc. [Ref. 10]. Oncepartitioning isdone, thewholearray isdivided intoasetofprocessorclasses,where all processors that belong to the same class execute the same assemblyprogramonlywithpossiblyadelayedcontrol.Inthefollowing,weconsidertheproblemofcodesizecompactionsubject toagivenscheduleof loopiterationsassigned to a processor. For reasons of scalability, two levels of hierarchy(symmetry)areexploited:first,suchaminimizationmustonlybedoneonceforeachprocessor class as eachprocessorbelonging to a classwill be configuredwiththesamebinarycodetoexecute.ThecodegenerationapproachisthereforeindependentincomplexityofthenumberofprocessorsofaCGRA.Moreover,the compaction method proposed is shown to generate programs of lengthindependent of the number of iterations to be executed. For this purpose, aspecialcontrol flowgraphwillbe introducedwhosenodes representblocksofinstruction sequences and the edges represent the control flow between the

blocks.Insummary,ourcontributionisanapproachforcompactcodegenerationfor

loop programs targeting large scale CGRAs consisting of an array of VLIWprocessorcoresandprovidingthefollowinguniquefeatures:

•Scalabilitybyreductionofthecodegenerationproblemtoindividualprocessorclassesonlyinsteadofeachprocessor(CGRAsizeproblemindependence).

•Single-corecodegenerationandcompactionpreservingagivenmappingandscheduleofloopiterations.

•Generatedcode is independent in lengthof (loop)problemsizebyextractionandexploitationof repetitive instructionsequencescalledprogramblocks inan intermediate control flowgraph specification (loop specification problemsizeindependence).

•Exploitationofmulti-dimensionalzeroloopoverheadbranching.

Therestofthechapterisorganizedasfollows:inSec.10.2,ourconsideredclassofCGRAarchitecture is briefly explained. InSec. 10.3, fewbasicdefinitions,our mapping flow, and background of the problem of code generation areexplained. InSec.10.4,ourapproachofscalableandcompactcodegenerationandanalgorithmtogeneratetheoutlinedcontrolflowgrapharedescribed.CasestudiesarepresentedanddiscussedinSec.10.5.RelatedworkisgiveninSec.10.6,andfinally,Sec.10.7concludesthechapter.

10.2.Coarse-GrainedReconfigurableArrays

WeconsiderCGRAsconsistingofanarrayofVLIWprocessorsarrangedina1-Dor2-Dgridwithlocalinterconnectionsasproposedin[Ref.6],seealsoFig.10.1 [Ref. 11]. The main application domain of such CGRAs are highlycompute-intensive loop specifications. Such tightly coupled processor arraysmayexploitbothloopaswellasinstructionparallelismwhileprovidingoftenanorderofmagnitudehigherareaandpowerefficiency thanstandardprocessors.Thearchitectureisbasedonahighlyparameterizabletemplate,henceofferingahighdegreeofflexibility,inwhichsomeofitsparametershavetobedefinedatsynthesis-time and some others can be reconfigured at runtime. For example,differenttypesandnumbersoffunctionalunits(e.g.adders,multipliers,logicalunits,shiftersanddatamovers)canbeinstantiatedasseparatefunctionalunits.The processor elements (PEs) are able to operate on two types of signals, i.e.data signalswhosewidth canbedefined at synthesis-timeandcontrol signals

whicharenormallyone-bitsignalsusedtocontrol theflowofexecutioninthePEs.Therefore,twotypesofregistersarerealizedinsidethePEs:dataregistersandcontrolregisters.aTheseregistersconsistoffourkindsofstorageelements:generalpurposeregisters,inputandoutputregisters,andfeedbackregisters.ThegeneralpurposeregistersmaystorelocaldatainsideaPEandarenamedRDxincaseofdataandRCx incaseofcontrol registers. Inputandoutput registersaremainlyportstocommunicatewithneighboringPEsandarenamedIDx,ODxfordata input/output registers, respectively, and ICx, OCx for control input/outputregisters. Input registers are implemented as shift registers of length l, whoseinput delay can be configured at runtime from1 to l. So-called feedback dataregistersFDx are special rotating registers that are provided for the support ofloop-carried data dependencies and cyclic data reuse. Each PE consists of asmall instruction memory whose size can be defined at synthesis time.Moreover, a multi-way branch unit is employed in each PE, capable ofevaluatingmultiplebranchconditions,basedonfunctionalunitflagsandcontrolsignals,inasinglecycle.Thebranchunittakesoneof2ntargetsincaseofann-waybranchunit.

Fig.10.1.ConsideredCGRAarchitecture templateand innerstructureofacustomizableVLIWPE[Ref.11], includingmultiplefunctionalunits,aVLIWinstructionmemory,andasetofregistersfor localdataprocessingRD,inputregistersID,outputregistersOD,andfeedbackregistersFDforcyclicdatareuse.Thedark box surrounding each PE is called a wrapper. It allows the flexible configuration of the circuit-switchedinterconnectionstoneighborPEs.

10.2.1.Reconfigurableinter-processornetwork

In order to provide support for many different interconnection topologies, astructureofmultiplexersinsideaso-calledwrapperunit[Ref.6]aroundeachPEisprovided,whichallows the reconfigurationof inter-PEconnections flexibly.Thereby,manydifferentnetworktopologies,someofthemshowninFig.10.2,may be realized. As illustrated in Fig. 10.1, the interconnect wrappersthemselvesareconnectedinameshtopology.

Fig.10.2.Differentnetworktopologies,suchas,(a)4-Dhypercube,(b)2-Dtorus,and(c)asystolicarray,are implementedon a4×4processor array. In each case, the top figuredepicts the logical connectionsbetweendifferentnodes in thenetwork,and thebottomfigure illustrates thephysicalconnectionson theprocessorarrayestablishedbyconfigurationofinterconnectwrappers.

Thanks to such a circuit-switched interconnect, a fast and reconfigurablecommunicationinfrastructureisestablishedamongPEs,allowingdataproducedin a PE to be used by a neighboring PE in the next cycle. To configure anyparticular interconnect topology, an adjacency matrix is specified for eachinterconnect wrapper in the array at synthesis-time. Each adjacency matrixdefineshowtheinputportsofitscorrespondingwrapperandtheoutputportsoftheencapsulatedPEareconnectedtothewrapperoutputportsandthePEinputports.Ifmultiplesourceportsareallowedtodriveasingledestinationport,thenamultiplexerwithanappropriatenumberofinputsignalsisgenerated[Ref.6].The select signals for such generated multiplexers are stored in configurationregisters and can be changed even dynamically, i. e., different interconnecttopologies—alsoirregularones—canbeestablishedandchangedatruntime.

Figure10.2showssomewell-knownnetwork topologyconfigurationssuchas4-Dhypercube,2-Dtorusandconnectionsofasystolicarrayimplementedona4×4processorarray.Ineachcase,thelogicalconnectiontopologyisdepictedon the topand thecorrespondingarray interconnectconfiguration is shownonthe bottom of Figs. 10.2(a)–(c). Here, each interconnect wrapper has threeconnection channels to each neighboring PE. Depending on the desiredtopology,someof themareactivated(shown indifferentcolors)andsomeareinactive (shown in light gray color). It is worth mentioning that for someconnections,wrappersmayalsobeused tobypassaneighborPE(connectionswithalengthofmorethanonePEdistance).Thiscapabilityishighlyuseful,e.g.toimplementa2-Dtorus.Asmentionedabove,datamaybetransmittedoneachportconnectingtwoPEswithinasinglecycle.However,bypasscapabilitiesareusuallyrestrictedtopassamaximumoftwotofourhops.Figure10.2(c)showshowdiagonal connectionsmaybe realizedonCGRAs.This capability is usedespecially when optimally mapping (embedding) nested loop programs ontoCGRAswhichis themainfocusofourwork.Inthenextsection,wethereforeexplainhowloopapplicationsmaybesystematicallymappedandscheduledonCGRAs.

10.3.MappingApplicationsontoCGRAs

10.3.1.Frontend

In this section, we give an overview of the front end (see Fig. 10.3) of ourapproach for the mapping of loop nests onto massively parallel CGRAarchitectures. The approach strongly relies on themathematical foundation ofthepolyhedronmodel[Ref.12].InStep(1),analgorithmisspecifiedinsingleassignment code (SAC); hence the parallelism is explicitly given.TheSAC iscloselyrelatedtosetofrecurrenceequationscalledpiecewiselinearalgorithms[Refs.7and13]thataredefinedasfollows:

Fig.10.3.FlowformappingaloopspecificationontoaCGRA.

Definition10.1. (PLA)Apiecewise linearalgorithm, see,e.g. [Refs.7and9]consistsofasetofXquantifiedequationsS1[I],...,SX[I]whereeachSi[I]isoftheform

Pi,Qj are constant rational indexing matrices and fi, dji are constant rationalvectorsofcorrespondingdimension.I∈ℐiscallediterationvectorandℐ⊆ℤnisan iterationspacecalled linearlybounded lattice [Ref.7]defining thesetofiterationvectors Iof the loopforwhichaquantifiedequationSi [I] isdefined,respectivelyneedstobecomputed.NotethateachquantifiedequationSi[I]maybe defined over a different iteration space through the definition of iteration-dependent conditionals in Eq. (10.1).b A PLA is finally called piecewiseregularalgorithm (PRA)if thematricesPiandQjaretheidentitymatrix.If inadditionfiiszero,analgorithmisinoutputnormalform.Inthiscase,thedjiarealso called dependence vectors, which express a data dependency each fromeachvariableinstancexj[I−dji]tovariableinstancexi[I].

Finally, in order not to be restricted to loop programs with static datadependencies, runtime dependent conditionals have been introduced in [Refs.10,14].There, itwas shown that a certain classofdynamicdatadependenciesmay be expressed and incorporated into data-dependent functions of the form

ifrt(condition, branch1, branch2), extending PLAs to so-called dynamicPLAsorsimplyDPLAs.

For the textual representation of PLAs, we use the PAULA programminglanguage[Refs.10,14]inourcompilerforCGRAs.Asourrunningexample,weintroduceanFIRfilterexampleinthefollowing.

Example10.1.AnFIR(finite impulseresponse) filtercanbedescribedby thesimpledifferenceequation with0≤i<T,Ndenotingthenumberoffiltertaps,Tdenotingthenumberofsamplesovertime,a(j)thefiltercoefficients, u(i) the filter inputs, and y(i) the filter results. The differenceequation after parallelization, localization of data dependencies [Ref. 15], andembedding of all variables in a common iteration space can bewritten as thefollowingPAULAprogramwiththeiterationdomainℐ={I=(ij)T∈ℤ2|0≤i≤T−1∧0≤j≤N−1}.

InStep(2)ofourmappingflowdepictedinFig.10.3,varioushigh-levelsource-

to-source transformations such as constant and variable propagation, commonsubexpression elimination, loop perfectization, loop unrolling, expressionsplitting, dead-code elimination, affine transformations of the iteration space,strength reduction of operators (usage of shift and add instead ofmultiply ordivide), and loop unrolling are performed. Algorithms with non-uniform datadependenciesareusuallynotsuitableformappingontoregularprocessorarraysastheyresultinexpensiveglobalcommunicationormemory.Forthatreason,awell-knowntransformationcalledlocalization[Ref.15]exists,whichallowsthetransformationofaffinedependenciesintouniformdependencies,i.e.aPLAintoa PRA. In our mapping flow, a PRA is internally represented by a reduceddependencegraph(RDG).

Definition10.2.(Reduceddependencegraph).Foragivenn-dimensionalPRA,itsRDGG=(V,E,D,ℐ)isanetworkwhereVisasetofnodesandE⊆V×Visa set of edges. Each edge e = (vj, vi) corresponding to a data dependency isannotatedwith thecorrespondingdependencevectordji∈ℤn.Nodes thereforerepresent variablenames and there is an edgebetweenvj andvi if andonly ifthereexistsaquantifiedequationSi[I]oftheformxi[I]=ℱi(...,xj[I−dji],...)if .

Example 10.2. For the runningFIR filter specification introduced inExample10.1,thecorrespondingRDGisshowninFig.10.5(b).Foreachvariablename,thereisanodeandforeachdatadependency,thereisacorrespondingdirectededge.Apart fromdatadependencevectors (not shown in the figure), theRDGserves as an important intermediate data structure for subsequent codegenerationandwillbedecoratedwithmuchadditionalinformationinthecourseofcompilation.

Next,webrieflyexplainpartitioning,ahigh-leveltransformationthatisusedtomapagivenloopiterationspaceontoafixedsizeprocessorarray.

10.3.1.1.Partitioning

Loop partitioning is a well-known transformation which covers the iterationspace of computation using congruent hyperplanes, hyper-quaders, orparallelepipedscalledtiles.Onepurposeofpartitioningistomapaloopnestofgiven sizeonto an arrayof processors of fixed size and thereforemap tiles toprocessorsandscheduletheseaccordingly.However,thetransformationhasalsobeen studied in detail for compilers with the purpose of acceleration through

bettercachereuseonsequentialprocessorsandknownaslooptilingorblocking[Ref. 16]) targeting implementations on parallel architectures ranging fromsupercomputerstomulti-DSPsandFPGAs.Well-knownpartitioningtechniquesaremulti-projection,LSGP(local sequentialglobalparallel,oftenalso referredto as clustering or blocking) and LPGS (local parallel global sequential, alsoreferredas tiling) [Ref.8].Formally,partitioningdecomposesagiven iterationspaceℐusingcongruenttilesintothespacesℐ1andℐ2,i.e.,ℐ↦ℐ1⊕ℐ2.cℐ1⊂ℤnrepresentsthepoints(iterationvectors)withinthetileandℐ2⊂ℤnaccountsfortheregularrepetitionoftiles,i.e.thespaceofnon-emptytileorigins.

Example10.3.FortherunningFIRfilterexample,theiterationspaceisshowninFig.10.4foranN=32tapfilter.InordertomapthisfilterontoaCGRAwithfour processors using LSGP partitioning, the iteration space needs to bepartitioned into exactly four tiles each being assigned to one processor. Thetilingmatrix chosen for partitioning the filter onto a 1 × 4 processor array is

For instance, it can be verified that the index point I = (10 9)T isuniquelymappedtoI1=(101)T,I2=(01)T,whereI1=(i1j1)T∈ℐ1andI2=(i2j2)T∈ℐ2withℐ1={I1ℐℤ2|0≤i1<8,0≤j1<8},andℐ2={I2∈ℤ2|i2=0,0≤j2<4}.

Fig.10.4.IterationsofanN=32tapFIRfilterincludingdatadependenciespartitionedtobeexecutedonalinear4PE(1×4)CGRAusingLSGPpartitioning.Forthispurpose,theiterationspaceisdividedinto4tiles. The iterations of each tilewill then be executed in a pipelinedway by each PE. Shown also is aclassificationofiterationsofthesametypeofinstructionstobeperformedateachiterationbycolorsandnamedprogramblocks.

In Step (3) of our mapping flow according to Fig. 10.3, a space–timemapping (allocation and scheduling) of the transformed program needs to becarriedout.

10.3.1.2.Space–timemapping

A space–timemapping assigns each iteration point I∈ ℐ to a processor (PE)indexp∈ (allocation)andatimeindext∈ (scheduling)maybeobtainedbythefollowingtransformation

whereI∈ℐ,Q∈ℤ(n−1)×nandλ∈ℤ1×n. ⊂ℤisasetcalledtimespacethatcontainsthesetofstarttimestofalltheiterationsI∈ℐ. ⊂ℤ2isasetcalledprocessorspaceandthevectorpprocessorindex.

InthecaseofLSGPpartitioning,allcomputationsofiterationswithinatileare executed in a pipelinedway by the one processor assigned to the tile andmultipletilesmaybeexecutedinparallelaswellbythedifferentprocessors.Thecorrespondingspace–timemappingisgivenasfollows:

Ascanbeseen,theprocessorspacehasthecardinalityofℐ2,thenumberoftilestocoveralliterations.Moreover,theso-calledschedulevectorλ=(λ1λ2)hastobe determined such that intra-tile as well as inter-tile data dependencies aresatisfied.

Basedontheabovedefinitions,wemayfinallydeterminethestarttimet(vi,I)ofeachoperation(=computationoftheleft-handsidevariablexi[I])withinthegiven loop body of iteration I as t(vi, I) = λ · I + τ(vi)where τ(vi) denotes anindividual offset of the computation of the variable corresponding to vi withrespecttothebeginningofvariablecomputationsbelongingtoloopiterationI.

10.3.1.3.Scheduling,resourceallocationandbinding

ThepurposeofschedulingistodetermineaschedulevectorλaccordingtoEq.(10.2) such that the global latency GL (total execution time) of a loopspecification isminimized.Ashasbeensaid, schedulingmayalsodeterminea

relativestarttimeτ(vi)foreachnodeviintheRDGcorrespondingtovariablexiin the program. The execution latency required for computing all variables xidefinedatan iteration I iscalled local latencyLL.However, the timebetweenstartingthecomputationoftwosuccessiveiterationsassignedtoaprocessormaybetypicallymuchsmallerandiscallediterationintervalII.AniterationintervalII<LLthereforeimpliesthe(software)pipeliningofiterations.

The purpose of resource binding is to assign to each node of the RDG afunctional unit (resource) which computes the corresponding variable byevaluating the expression on the right-hand side of the corresponding loopspecification.d The problem of optimally scheduling a loop nest underconstraintsoflimitedphysicalresources(PEsaswellasfunctionalunitswithineach PE) is an instance of a resource-constrained scheduling and bindingproblem. It is solved here using amixed integer linear programming (MILP)formulation [Ref. 13] and [Ref. 10]. Note that this approach obtains a globallatency-optimal schedule for the entire loop nest instead of traditionalmoduloscheduling techniques [Ref. 5] which may schedule only the innermost loop.Additionally,beside thenumberof consideredPEsand functionalunitswithinthem,our approachencompasses also register and interconnect constraints.Bysolving theMILPusinga state-of-the-artLPsolver suchasCPLEX[Ref.17],bothaschedulevectorλandtheoffsetτ(vi)ofeachnodeviareobtained,givenaniterationintervalII.

Example10.4.FortherunningFIRfilter,afeasiblespace–timemappingoftheLSGP-partitionediterationspaceisdefinedbyEq.(10.4).

AnabstractviewoftheobtainedprocessorarrayarchitectureforcomputingtheN=32filtertapsisshowninFig.10.5(a).Theresultingarchitectureisa1×4processor array and the correspondingprocessing elements are denotedPE(0),PE(1), PE(2) and PE(3), respectively. The arrows between the PEs in Fig.10.5(a)showrequiredcommunicationsofcomputedvariableresults.

Fig. 10.5. In (a), abstract view of a 4 PE (1 × 4) processor array showing also the interconnectionconfigurationbetween theshownPEs. In (b),RDGof theFIR filter specification introduced inExample10.1.In(c),apossibleregisterallocationforthefirstfourconsecutiveloopiterationsisshownusingfourregisters.

In Fig. 10.5(b), a simplified version of the corresponding RDG is shown.After scheduling, the RDG is annotated with the local schedule (schedulingoffsets)ofnodes,denotedas τ (seeFig.10.5(b)).Frombothschedulevectorλand offsets τ, the exact execution time of any operation (statement in theprogram) for each iteration becomes known. For instance, variable x at loopiterationI1=(77)T(I2=(00)T)iscomputedbyprocessorPE(0)attimeλ1·I1+τ2·I2+τ(x);i.e.at(81)·(77)T+(08)·(00)T+2=65cycles.Moreover,forthe FIR filter as scheduled above, an iteration interval (II: the time in clockcycles between the evaluation of two successive iterations) of II = 1 cycle isassumed, anda local latency (LL: the time in clockcycles required to executeone iteration)of four cyclesobtainedon thebasisof agivenVLIWprocessorspecification. The optimized total time to execute the FIR filter (the globallatency)(GL)thenevaluatestoGL=8(T−1)+7+24+4=8T+27cycles.ItisdeterminedbytheintervalbetweenthestartofthefirstiterationofPE(0)andthelastiterationbeingfinishedbyPE(3)whichhasthecoordinatesI1=((T−1)7)T.

Duringthesolutionoftheresource-constrainedschedulingproblem,onlythenumber of available functional units and registers is considered (allocation),

whereas the binding is performed in a subsequent step. Since our consideredCGRA architecture supports single cycle operations and thanks to functionalpipelining, a new operation can start in each cycle in each functional unit.Moreover, the binding of scheduled operations to functional units becomesprettysimple.Incontrast,theregisterbindingismoredifficultbecausevariablesmayhavehavelifetimesofmultiplecycles.

10.3.1.4.Registerbinding

Registerbindingistheprocessofassigningeachprogramvariablearegisteroftheprocessorwithoutviolatingany lifetimeconstraints.Foreachvariablexofthegiven loopprogram, a lifetime intervalLTI (the time intervalbeginningattheproductionofthevariableLTIlbandendingatthetimeLTIubwherenomoreoperationsconsumeit)andthelifetime(LT=LTIub−LTIlb)canbeannotatedintheRDG(seeFig.10.5(b)).This information isnecessary for theallocationofregisters.

Example 10.5. For instance, let variable a of our scheduled FIR filterspecificationhavealifetimeofLT(a)=1.Hence,evenfortheminimaliterationintervalofII=1,thesameregisterRD0isusedineachiterationasdepictedintheGanttchartinFig.10.5(c).

Inourapproach,registerconstraintswithinaprocessormaybeconsideredaswell. Here, our assumption is that enough registers are available to support adesireduser-specifiediterationintervalII.Otherwise,schedulingwill failsinceregister spilling, which is available in the case of standard processors, is notsupported.Assigningproper input (ID),output (OD),generalpurpose registers(RD)andfeedbackdataregisters(FD)tovariablesisnotatrivialtaskbutcanbeachievedautomaticallybasedonthespace–timemapping,dependenceanalysis,andusingamodified left edgealgorithm [Ref. 18].Forour runningFIR filterspecification introduced in Example 10.1, register binding is explainedqualitativelybelow.

Example10.6. (FIR-filtercontinued)Theinputvariablesa_inandu_in in theFIR filter program are assigned to input registers ID0 and ID1, respectively.Variableaisusedtostoretheinputa_in(filtercoefficient)forreuse.However,it is assigned not only to one internal processor register RD0, but also to afeedbackregisterFD0forthefollowingreasons:

•RD0:Everycycle,anewiterationisstartedaccordingtotheiterationintervalII

= 1. This implies that a new instance of all RDG operations will startexecutingat eachcycle. In the first eight iterations andhencecycles, anewcoefficient is read from the input ID0 (variable a_in). This is copied(assigned)tovariablea.Here,RD0isusedtostorevariableaforPE-internaluse within an iteration. To move the data from the input, functional unitDPU0eisused.

•FD0: Feedback data registers are used to store variables that are cyclicallyreused.Theyareusedtoimplementloopcarrieddependencies(dependenciesspanning multiple iterations). For example, in case of processor PE(0) iscomputingiteration(i1j1)T=(10)T,thevalueofaisthesameasaatiteration(i1 j1)T = (0 0)T (due to localization for efficient data reuse); hence, whencomputingaatiteration(i1j1)T=(00)T,ithastobestoredinafeedbackdataregister so that itmay be used again at the cyclewhere (i1 j1)T = (1 0)T iscomputed within the same processor. The length (adjustable delay in clockcycles)ofthefeedbackdataregistermaybedeterminedeasilybymultiplyingthe intra-tileschedulevectorλ1and thedependencevectordassociatedwithvariablea. Inourexample, (cf.Fig.10.4).That is,eightcoefficientsoftheFIRfilterarereadoncefromaninput,andarethenreusedcyclicallybyreadingfromfeedbackregisterFD0.

Variableu isused topropagate the inputfilter inputu_in. It isassigned to theregistersRD1,FD1forsimilarreasonsasexplainedforvariablea.However,umustevenbeassignedtoathirdregisterOD0(outputregister).

•OD0:Duetopartitioning,dependenciesaresplitbetweenlocalcommunicationand data to be propagated to neighbor processors. For instance,PE(1)mustreceive the filter inputs u from PE(0) (see Fig. 10.4). Hence, in order tosupport this communication,variableu is alsoassigned toanoutput registerOD0 that is physically connected to input register ID1of the right neighborprocessor,seealsoFig.10.5(a).

VariablexisassignedtoregisterRD2onlyasitisusedinternallyonlyandonlyduringthecomputationofaniteration.Finally,variabley isusedtocomputeapartially accumulated sum of the filter output y_out that is output at therightmost processor (PE(3) in our example).y is assigned to registerRD3.Asthispartialfilteroutputisalsocommunicatedtotherightneighborprocessor,yisassignedalsotoregisterOD1.

Eventually, the functionalunitand registerbinding isalsoannotated to theRDG(shown inFig.10.5(b)).From this taggedRDG,onecould select finallythe instructions for each node by looking at the incoming source nodes, i. e.,assignedregisters,thefunctionalandregisterbindingforthatnode.Forinstance,the instruction fornodex inFig.10.5(b)wouldbeMUL0MULRD2RD1RD0 inwhich MUL0 is the used functional unit, MUL is the mnemonic. RD2 is thedestinationoperand,RD1issourceoperand1,andRD0issourceoperand2.

10.3.2.Backend

Inthissection,wegiveanoverviewofourbackendmappingflowasdepictedinFig.10.3.Thebackendgenerates(a)theconfigurationcodeforinterconnectionof processors and (b) the assembly code (programs) for the processors asexplainedbelow.

10.3.2.1.Interconnectconfiguration

Inordertosynthesizetheconfigurationofthecorrectprocessorinterconnectionstructure, the data dependencies of the PRA have to be analyzed. For thispurpose,weintroducethetermsprocessordisplacementand timedisplacementbyconsidering thefollowingPRAequationand thespace–timemappinggivenbyEq.(10.2).

Thesynthesisofaprocessorinterconnectionforthedatadependencydjifromxj[I − dji]→ xi [I] is done by first determining the processor displacement asfollows:

Foreachpairofprocessorsptand withps,pt∈ ,acorrespondingconnection has to be configured from the source processor ps to the targetprocessorpt.Thetimedisplacementdenotesthenumberoftimestepsthevalueofvariablexj[I−dji]mustbestoredbeforeitisusedtocomputevariablexi[I].Letviandvjdenote thecorrespondingRDGnodesforxiandxj.Then,givenascheduleλ,thetimedisplacementmaybecomputedasfollows:

where wj denotes the computation time for computing variable xj. In theprocessor array, the time displacement corresponds to the number of delayregistersontherespectiveinputportofthetargetprocessorpt.

10.3.2.2.Scalablecompactcodegeneration

Obviously,codegenerationinvolvesthegenerationof binarycodes,oneforeachprocessoroftheCGRAarchitecturetobeloadedandthenexecutedinsyncwith the others. However, many processors will receive exactly the sameprogramas (a) the schedule given is global, including iteration interval II andlocal latency LL. Hence, the code optimization may not be performedindependentlyforeachsetsofiterationstobeassignedtoanindividualprocessorbutmustobeyaglobalscheduleof iterations. (b)Often, thereareno iteration-dependentconditionalsspecifiedintheloopbody.Asaresult,thecomputationstobeperformedarenotonly iteration independentbutoftenevennotdifferentfromprocessortoprocessor.

Unfortunately,our loops(PLAspecifications)allowfor iteration-dependentdefinitionsofcomputations.Therefore,thenumberofdifferentcodestogeneratemaybemorethanone,butwewillshowneverthelessthatthroughthedefinitionof so-called processor classes, we may distinguish which minimal set ofdifferentcasesdoexistthatarerequiringtogenerateindividualassemblycodes.Informally, a so-called processor class is given by a unique combination ofiteration-dependentconditionsofloopvariablestobecomputed.Theyaregivenimplicitly by the possible intersections of iteration spaces of variables in thegivenloopnest.

Example10.7.For theFIR filter shown inFig.10.4, threedifferentprocessorclassesmaybedistinguished.The leftmostprocessorPE(0)distinguishes itselffromothersbyreadinginthefilterinputsu_infromtheleftborderofthearray.Similarly,therightmostprocessorPE(3)istheonlyonetocomputeequationS9(variabley_out) in addition toS8 (variabley). The processors assigned to thetilesinbetweenallhavethesamesetofstatementstocompute.Hence,wemayhave togenerateonly three typesof codes, one for eachprocessor class.NotethatthisnumberisindependentoftheproblemsizeNofthefilterspecification.

We are therefore restricting ourselves next to compact code generation foreach processor class individually. Here, based on a given scheduled loop

program,flatyetlengthycodecouldbegeneratedautomaticallybysynthesizingassemblycode for iterationafter iteration tobeexecuted.Even if allowing theoverlapoftheiterationexecutionincaseII<LL(softwarepipelining),thecodesize would explode easily and require LL + det(P) · II − 1 VLIW words—making the generated code not only problem size or iteration space sizedependent. Indeed, thecodegeneratedforoneprocessormightnoteasilymeetstringentconstraintsoftinyinstructionmemoriesinsideeachPEofaCGRA.

Moreover, in the case of our FIR filter specification, imagine that thespecified number of samples T might not even be finite. In that case, thetechnique above cannot be used at all as det(P) =T ×N/PEnum also becomesinfinitewithPEnumdenotingthenumberofprocessors.

In the following,we therefore address theproblemofcode compaction by(a) identifying and grouping together code sequences that are executedrepeatedly through the notion of so-called program blocks (PB) and (b)generating as compact as possible code for a considered processor class byloopingbetween theseuniqueblocks at provably zerooverhead (no additionalcycles)sotopreservethegivenscheduleλ.

Definition 10.3. (Program block) A program block is a unique sequence ofassemblyinstructionswhoselengthamountsexactlythelocallatencyLL.Setsofsuch iterations characterized by the same sequence of instructions (sameprogram block)may be described by an iteration space called program blockiterationspaceℐPBi.Eachiterationwithinthisspacerequirestheexecutionofthesameinstructionsequence.AllprogramblockiterationspacesℐPBipartitionthetileiterationspaceassignedtooneprocessor:ℐ1=ℐPB0∪ℐPB1∪...∪ℐPBm−1andℐPBi∩ℐPBj=∅forall0≤i,j,≤m−1,i≠j.

Weexplain thenotionofprogramblocks for processor class1 (PE(1) andPE(2))withthefollowingexample.Asimilarexplanationisvalidfortheothertwoprocessorclasses.

Example 10.8.PE(1) of the CGRA shown in Fig. 10.5(a) has to execute sixdifferent program blocks as shown in Fig. 10.6(b), differentiated by differentrectangles and is denoted by PB0, PB1, PB2, PB3, PB4, and PB5. For eachprogramblock,itsRDG,programblockiterationspaceℐPBi,anditsLL=4cycleVLIWassemblycodecanbeseeninFig.10.7.

Fig.10.6.(a)RDGfortheprocessorclassofthemiddleprocessors(PE(1),PE(2))oftheFIRfilter.In(b),partition of the iterations of the tile specific for this processor class into program blocks. The numbersannotatedtotheiterationsshowthestarttimeoftheeachiterationbasedonagivenscheduleλ1.Fromtheschedule,theprogramblockcontrolflowgraphshownin(c)maybeextracted.Withonenodeperprogramblock, the edges indicate the traversal of control flow based on a scanning of the tile iteration space asprescribedbytheschedule.Theedgeweightsindicatehowoftenthecorrespondingtransitionbetweentwoprogramblocksistaken.

One notable difference between PB0 and PB1 is that although y in bothprogramblocks isdefinedbyequationS8 (y=x+y), inPB0,oneof the input isID2whereas the same input inPB1comes froman internal registerRD3.TheassemblylevelcodeforcomputingofywouldbeADDRD3RD2ID2 incaseofPB0andADDRD3RD2RD3incaseofPB1,respectively.Moreover,inPB0,u_inisaccessed froman inputport,but inPB1,u_in isalwayszero (constant), i.e.theydifferintheinstructionwhereu_inisread.

ThedifferencebetweenPB1andPB2 is that inPB1, thepartial sum(y) iswrittentotheinternaldataregisteroftheprocessor(ADDRD3RD2RD3).InPB2,thepartial sum iswritten to theoutput registerof theprocessor (ADDOD1RD2RD3) for transfer to the right neighbor processor.Hence, theywill differ in atleastoneinstruction.

Fig.10.7.ProgramblocksofPE(1)oftheFIRfiltershowninFig.10.5(a).(a)RDGofPB0,programblockiteration space, andnaivecode.Since the local latencyLL=4, eachprogramblockconsistsof4VLIWinstructions denoted as 0, 1, 2 and 3, which may contain NOPs as well. Similarly, respective RDGs,iterationspacesandtheircodesforprogramblockPB1in(b),PB2in(c),PB3in(d),PB4in(e),andPB5in(f)areshown.DifferentuniqueVLIWinstructionsfromallprogramblocksarelabeledasA,B,C,D,...todifferentiateeasilyfromoneanother.

ForvariableainPB3,itsvaluehastobereadfromafeedbackdataregisterinsteadofaninputport(incaseofPB0).InPB4,bothuandavalueshavetoberead from feedback shift registers as there is no direct access to inputs.Moreover, inPB5,variableyhas tobewritten toanoutputport incontrast toPB3andPB4; thus,PB3,PB4, andPB5differ in instructions fromeachotherand from PB0, PB1, and PB2, respectively (see Fig. 10.7). As a result, theiterationspaceispartitionedintoatotalofsixdifferentsegmentsofcode.

Putting it all together, given a nested loop specification, its iteration spacegets partitioned into as many tiles as available CGRA processors duringpartitioning.Foreachidentifiedprocessorclass,acodecompactionproblemhasto be solved preserving a given schedule of iterations, functional unit andregister assignment. Each tile belonging to a processor class itself consists ofdifferent program blocks that describe unique sequences of instructions to

execute one iteration of the partitioned loop program. In the following, wedescribetheprocessofcodecompactionbasedonacharacterizationofprogramblocks.

10.4.CompactCodeGeneration

Obviously, the execution of iterations in a tile assigned to one processor shallfollowagivenschedule.NotethatintypicalcaseswherethelocallatencyofaprogramblockLLislongerthantheiterationintervalII,correspondingiterationexecutions do overlap (software pipelining). Given this understanding, weexplaininthefollowinghowtogeneratethemostcompactcodeforsuchagivenschedule of iterations by not generating flat code for each iteration but byexploitingrepetitivesequencesofprogramblockexecutionsbytheintroductionofloopingbetweenprogramblocksundertheassumptionthatsuchintroductionof branches between program block codes does not create any additionalexecutioncycles(zerooverheadlooping).

Since each iteration uniquely belongs to one program block, the programexecution according to a given schedule can be represented also visually as atraversal of a program block graph in which nodes correspond to programblocks.fBasedonthisobservation,theprocedureofgenerationofcompactcodemainlyinvolvesthefollowingthreesteps:

(1) Generation of a program block control flow graph which represents theexecutionorderofsuccessiveprogramblockexecutions.

(2)Minimallyunfoldingofcyclesinthisgraphwhichmayberequiredtoreflectrestrictionscausedbyregisterbindingandvariablelifetimes.

(3)Generationofassemblycodeonceforeachblockincludingtheinsertionofproper branching subwords at block boundaries to guarantee a correctscheduleofprogramblockexecutions.Thesehavetobescheduledinsideaprocessor’sbranchunitatzerotimeoverheadsotopreservethethroughputandlatencyofthegivenschedule.

In aprogram block control flow graph (GCF, see Fig. 10.6(c)), each nodecorresponds to aprogramblock.Thenumber annotated to eachedge indicateshow many times the execution is transferred from the predecessor to thesuccessorprogramblockintheoveralltraversaloftheiterationspaceassignedtotherespectiveprocessor.

10.4.1.Generationofprogramblockcontrolflowgraph

In this section,wepresent an algorithm togenerate theprogramblockcontrolflowgraphGCFonwhosebasisCGRAcodeisemitted.Aseachprocessormustexecutetheiterationsassignedtoitinascanningorderasimposedbythegivenintra-tile schedule λ1, we introduce the notation of a path stride matrix tosimplifythecomputationofGCFasfollows.

Definition 10.4. (Path stride matrix) The path strides of an n-dimensionalparallelotopemayberepresentedbyamatrixS∈ℤn×nconsistingofn(column)vectors andgivenas Thepathstridematrixdefinesthescanning(executionorder)ofiterationswithinatile.

Example10.9.FortheLSGP-partitionedFIRfiltershowninFig.10.4,thepathstride matrix respecting the intra-tile schedule given by λ1 = (8 1) is

Each tile is processed by a processor in the orderimposedbyS.ForPE(1)inourexampleaccordingtoFig.10.6(b),theiterationspaceisexecutedstartingat(i1j1)T=(00)Tto(i1j1)T=(07)T,scanninginthedirectionof untilthetileendisreached.Then,adding ,thenextiterationscannedis(i1j1)T=(10)Tandexecuted.Thenagain,thescanninginthedirectionof continues,andsoon.

Inthefollowing,weassumethatthescheduleλ1hasbeendeterminedsotominimizethegloballatencyGLforagiventhroughput(iterationintervalII)andthat the corresponding stride matrix S reflecting this schedule has beenconstructedaccordingly.

WeproposenowAlgorithm10.1 thatconstructs theprogramblockcontrolflow graph for a given processor class according to a given schedule λ1,respectivelycorrespondingstridematrixS.Thegraphcontainsasmanynodesasprogramblocks.Inordertodeterminewhichtransitionsdooccurbetweenpairsofprogramblocks,weneedtofindoutwhetheritispossibleandhowoftenthescanningofthetilespaceinvolvesatransitionfromonetoanotherblock.ThisisachievedinAlgorithm10.1bycheckingforeachprogramblockPBiwhetherthespaceobtainedwhenaddingeachtimeacolumnvector ofthestridematrixSto its iterationspaceℐPBihasanonzero intersectionwith the iterationspaceofotherprogramblocks,includingitsown.Iftheintersectionisempty,thereisnotransitionbetweentheprogramblocks.Else, thereisa transitionbetweenthemand an edge is inserted between the corresponding two program block nodes.Moreover, by counting the number of points in each intersection, thecorresponding edge weights denoting the number of times the execution istransferred from one program block to the othermay be also determined and

annotatedtotheedgesofthegraphGCF.

Example10.10.FortheFIRfiltershowninFig.10.4,processorPE(1)isshownin Fig. 10.6(b) and itsGCF in Fig. 10.6(c). Here, the execution enters PB0,executes corresponding iteration, then enters PB1 and executes six times thecodeof thisblockconsecutively.Then, thecontrol flowentersPB2,executingone iteration. Afterwards, the execution enters PB3, again executing oneiteration, then executes six times PB4 consecutively before entering andexecutionof theone iterationbelonging toPB5.Finally, theexecutionfollowsthesequencePB3→6×PB4→PB5againandagain.

10.4.2.Graphtransformationandfinalcodeemission

Before theemissionof final code, cycles in thecontrol flowgraphGCFmightneed to be partially unfolded first in order to satisfy given register bindingconstraints [Ref. 19]. More specifically, instructions from differentconsecutively executed iterations belonging to a programblock (node inGCF)arearrangedcontiguouslyaccording to the schedule inorder to formso-calledoverlappedcodes.Afterwards,theseoverlappedcodesofaprogramblockhaveto be combined with all direct successor nodes in theGCF. Although we areusing the term overlapped code, note that only in the last step “real” code isemitted.

Fig. 10.8. Overlapped codes of program blocks (a) PB0OC, (b) PB1OC, (c) PB2OC, (d) PB3OC, (e)PB4OC, (f)PB5OC for the partitioned and scheduled FIR filter example and for the processor class ofmiddleprocessors(PE(0),PE(1)).Notethatin(a)thereisonlyoneiterationinthisprogramblock.Thustheoverlappedcodeforthisprogramblockissameasfornon-overlappedcode.(g)OptimizedoverlappedcodeofPB1,PB1OC.(h)OptimizedoverlappedcodeofPB4,PB4OC.(i)Resultingoverlappedcodesfromsteptwo,i.e.combiningofoverlappedcodesfromdifferentprogramblocks.(j)Finaloptimizedcodeconsistingof12VLIW instructionwords.The arcs annotated to the code sequencedenote runtime selectedbranchtargets.

Example 10.11. An example of overlapped codes of the program blocks isshowninFig.10.8.TheoverlappedcodeofPB0,denotedasPB0OCisshowninFig.10.8(a).There isonlyone iteration thatgetsexecutedwhen theexecutionenters PB0; hence there is no possibility for contiguous arrangement ofinstructions from different iterations. Whereas for PB1, the overlapped code(PB1OC) is shown in Fig. 10.8(b) in which instructions from six differentiterations are arranged contiguously since six iterations are executedconsecutivelywhenever the execution entersPB1.Similarly, overlappedcodesofotherprogramblocksPB2,PB3,PB4,andPB5,denotedbyPB2OC,PB3OC,PB4OC,andPB5OC,areshowninFigs.10.8(c),(d),(e),and(f).

Acloserlookat theoverlappedcodeofPB1,PB1OCshowninFig.10.8(b)

revealsthatthereisredundancyintheinstructions.Particularly,theinstructions(4)and(5)arethesameasinstruction(3).TheoverlappedcodeforPB1canbefurther optimized as shown in Fig. 10.8(g). In this figure, a branch (jump)instructiondenotedbyanarc is introducedfrominstruction(3) to(3). It is theresponsibilityofthecontrollertogenerateappropriatecontrolsignalssothatthisbranching happens at the appropriate iteration. However, introducing such abranch instructiondoesnotaffect theotherdata flowoperationsandpreservesthe given schedule.Afterwards, these optimized overlapped codes of programblocks are arrangedcontiguously againwith respect to thegiven schedule andGCFasshowninFig.10.8(i).

For thecode inFig.10.8(i), there is still a furtherpossibility to reduce thecode size by folding the epilogue [Ref. 20] instructions into the prologueinstructions as shown in Fig. 10.8(j). However, this would add an extrabranch/jump instruction from instruction 11 to instruction 0. In case of therunning FIR filter example, our approach finally emits a final optimized code(FOC)of12VLIWinstructionsonlyasshowninFig.10.8(j)whereasflatandlengthy codewould require 4 + (T×N/PEnum) − 1VLIW instructionswhichwouldevaluateto67alreadyforjustT=8samples(N=32filtertaps,PEnum=4processors).Itcanbeverifiedeasilythatthecodesizewillremainconstantandindependentofthesizeoftheiterationspaceaslongasscheduleandallocationremainthesame.Therefore,thiscodeisascompactaspossibleandatthesametime iteration-space-independent.For instance, either increasing thenumberoffiltertapsNorthenumberofinputsamplesTwillonlychangetheconfigurationparameters of the controller, but not the length of the assembly codes of theprocessors.

10.4.3.Controller

In the considered CGRA architecture, the reconfigurable controller (see Fig.10.1) plays an important role: the entire static loop control flow (i.e. thebranching between program blocks) of the given nested loop program iscapturedbythiscontroller.Thepurposeofitistoscantheentireiterationspaceaccording to a given schedule and to issue a set of control signals to theprocessorseachcycleencodingtheinformationwhentojumpandtowhichnextprogramblocksegment.Thisdeterminesbasicallythestaticcontrolflowwithinthe programs of each PE. The control signals values thereby depend on thecurrentiterationIandtheprogramblockitbelongstoasshowninthefollowing

Algorithm 10.2. According to this generic algorithm, the number of controlsignalsneeded isequal to thenumbermofprogramblocks(one-hotcodingofeachbinarycontrolsignalissuedwithaprogramblock).InthebranchingunitofeachPE,thesecontrolsignalsarereceivedanddecodedinparallel.Adecisionistakeninthesamecyclewhetherandwheretojump,i.e.whichprogramblocktoexecutenext.

In the hardware implementation of the control algorithm, registers areallocated and initialized first with the iteration variables and their bounds.Depending on schedule and iteration interval, appropriate iteration variablesneed tobe incremented/decremented in each cycle so that the current iterationvector (I) and the program block to which it belongs are known and propercontrol signals determined. Note that the hardware overhead of such aprogrammable controller unit implementation occurs only once. However, theimplementationimposesasinglecyclecomputationandemissionrequirementtocomputethebundleofcontrolsignalsdenotedIC0,1,...,m−1.

Inthefuture,wewillshowthatthroughoptimizationandwhenusingamoreefficientencodingofbinarycontrolsignals, thenumberofcontrolsignalsmaybegreatlyreducedtothebinarylogarithmofthemaximumnumberofoutgoingedgesofaprogramblockintheprogramblockcontrolflowgraph,see,e.g.Fig.10.6(c).Thisapproachwillalsoreducethehardwarecostofthecontroller.

Example 10.12. In the continuation of our FIR filter example, considerinstruction(10)inFig.10.8(j).Whenthisinstructionisexecuted,therearetwopossibilities for the next instruction to be executed.The executionmust eitherjumptoinstruction(10)againifthenextiterationvectorIbelongstoiterationsofprogramblockPB4again(1≤i1≤T−1,1≤j1≤6).Elsewise,programblockPB5needstobeexecuted(1≤i1≤T−1,j1=7).Obviously,thecontroldecisionwhich way to branch is dependent only on the separating hyperplane j1 = 6resulting in the following segment of control code that will be executed inhardware,producingonesetofcontrolsignalseveryIIcycles.

ICxisabinarycontrolsignalwhichisdecodedbythebranchunitaspartoftheexecutionofinstruction(10)asfollows:IFICx,JMP11,JMP10.IfICx=1,the execution will proceed with the next instruction (11), else it will repeatinstruction(10)again.

10.5.ResultsandDiscussion

Ourapproachforcompactcodegenerationhasbeenintegratedintoanexistingloopcompiler[Ref.14],writteninC++.AsaCGRAplatform,aprocessorarrayofVLIWcoreswithreconfigurableinterconnectisselectedandconfigured.Inafirstsetofevaluations,thescalabilityofourproposedcodegenerationtechniquefordifferentproblemsizesandalgorithmsisstudied,seeTable10.1.

Inthefirstsetofexperiments,theFIRfilterlatencyisexploredfordifferentnumbers N of filter taps (problem size) assuming a constant number ofprocessorsPEnum(1-Dprocessorarrayofsize1×PEnum).Here,itcanbeseenthat if the problem size N increases, the latency increases but the code sizeremains the same.The primary reason for this behavior is that if the problemsize increases, the number of program blocks remains the same— however,

programblocksmayneed tobe repeatedmoreoften,which isorchestratedbythe controller. In other words, the number of iterations that are assigned andexecutedbyaprocessorisincreasedbutnotattheexpenseofanincreasedcodesize. Second, it can also be seen how a tradeoff in global latencyGLmay beachieveddependingonthenumberofprocessorsPEnumavailableintheCGRA.The same experiment is carried out for amatrixmultiplication example beingmappedtoa2-Darrayconfigurationofdifferentproblemandarraysizes.

Table10.1.Scalabilityandcodesizeindependencefordifferentapplicationsforvariable(a)problemsizeand(b)CGRAprocessornumberPEnum.

In the second experiment, we consider again the FIR filter application, amatrix multiplication (MM), and in addition a sum of absolute differences(SAD) application stemming from H.264 video coding. The generated codesweresimulatedonacycle-accuratesimulator.In thefollowingexperiment,ourcode generation approach is tested and compared for achievable throughput,code size and overheads for single-processor targets such as the HPL-PDarchitecture and the Trimaran compilation and simulation infrastructure [Ref.21].g

In Trimaran, modulo scheduling [Ref. 5] as well as rotating register fileswere used. Since our CGRA architecture is a streaming architecture, specialaddress generators (see Fig. 10.1) that are responsible for fetching the correctdatafromtheinputbuffersandwritingtheresultsbacktotheoutputbuffers,areneededatthebordersofthearray.IncaseofTrimaran,onlyload/storeunitsareneeded.InordertomatchourtestenvironmentwithTrimaran,wehavemodified

theHPL-PD architecture description in the followingway:we specified threememoryunitsasalltheapplicationsthatweconsideredneedtwoinputsandoneoutput.Also, tomake a fair comparison,we have specified three extra adders(integer units) in Trimaran for address generation and for loop counterincrementation. Finally, all instructions are assumed to be single cycleoperationsintheHPL-PDarchitectureaswellasinourarchitecture.Withthesemodifications, the Trimaran environment matches well with our testenvironment.

Table10.2.SingleprocessorcomparisonofproposedapproachwithTrimaran[Ref.11].AN=32tapFIRfilter and amatrix-matrixmultiplication (MM)are analyzed for twomatrices of size 16×16.Finally, aSAD application is analyzed and compared based on a block size of 16 × 16with respect of achievedaverageiterationinterval,codelengthandoverhead.

In Table 10.2, (a) the average iteration intervalh (IIavg), (b) the number ofVLIW instructions, and (c) the overheadsi for different applications of ourapproach are compared against Trimaran. Our experimental results reveal thefollowingfacts:

(1)TheaverageiterationintervalIIavgisreducedinourapproachbyupto88%overTrimaranwhichmeans8.4×higherthroughput.Thisismainlyduetothecontrollerwhichgeneratesallbranchingcontrolsignals for loopboundcontrol without any timing overhead. The same configurable controllergeneratesthecontrolsignalsforallprocessorclasses.Therefore,thecostofthis controller is incurred only once. The extra hardware cost of thecontroller is almost compensated by extra comparators (for loop boundchecking) and adders (for loop index incrementation) that are needed inTrimaran.ThegloballatencyisalwayslessinourapproachasweareabletoachieveabetterIIavg.

(2)ThenumberofVLIWinstructionsisgreatlyreduced.Thisismainlyduetothe separation of the control flow from the data flow in our consideredarchitecture.

(3)Table10.2showstheoverheadsinTrimaranfordifferentapplications.Inourapproach, the overhead is always zero whereas in Trimaran the totalexecution time may increase up to 26.36% for the set of consideredexamples.

Despite the little overhead of the controller and address generators inhardware,ourapproachispromisingsinceitscalesverywellwiththesizeoftheiteration spaceof a loopprogramand thenumberof instructionsneeded for agivenarbitrarilynestedloopapplicationisalwayslesscomparedwithTrimaran.In summary,we are able to achieve amuch smaller average iteration intervalIIavg(higherthroughput),seeFig.10.9[Ref.11].

10.6.RelatedWork

Often, there is only a fine line between on-chip processor arrays andCGRAssincetheprovidedfunctionalityissimilar.Wereferto[Ref.22]foranexcellentoverview on CGRAs. Even though there exists only little research work thatdealswiththecompilationtoCGRAs,wewanttodistinguishoursfromit.Theauthorsin[Ref.23]describeacompilerframeworktoanalyzeSA-Cprograms,performoptimizations,andautomaticallymapapplicationsontotheMorphoSysarchitecture[Ref.24],arow-parallelorcolumn-parallelSIMDarchitecture.Thisapproachislimitedinthesensethattheorderofthesynthesisispredefinedbythelooporderandnodatadependenciesbetweeniterationsareallowed.Anotherapproach for mapping loop programs onto CGRAs is presented by Dutt andothers[Ref.25].Remarkableintheircompilationflowisthetargetarchitecture,the Dynamically Reconfigurable ALU Array, a generic reconfigurablearchitecture template, which can represent a wide range of CGRAs. Themapping technique itself isbasedon looppipeliningandpartitioningofadataflowgraphintoclusters,whichcanbeplacedonalineofthearray.However,allthe aforementioned mapping approaches follow a “traditional” compilationapproachwherefirsttransformationssuchasloopunrollingareperformed,andsubsequently the intermediate representation in the formofacontrol/data flowgraph is mapped by using placement and routing algorithms. Unique to ourapproachisthat,thankstousinglooptilinginthepolyhedronmodel[Ref.12],the placement and routing is implicitly given— i.e. for free andmuchmoreregular.

Fig.10.9.RelativereductionintermsofVLIWinstructionsandthroughputforapplicationsFIR,MM,andSADcomparedtoTrimaran.

Inembeddedapplicationssuchassignalprocessingandmultimedia,mostoftheexecutiontimeisspentinnestedloops.However,noloopexecutionwithoutany loop control (e.g. loop bound condition checking, incrementing, andbranching). This control overhead can be minimized by employing zero-overheadloopingschemes[Ref.26],whichareavailableinmostmodernDSPsfor innermost loops. In [Ref. 27], Kroupis et al., proposed a single processorcontroller and compilation techniquewhich not onlymaps the innermost loopcontrolbutalsotheouterloopcontrolontoacontrollerrealizedinhardwaresotosupportmulti-dimensionalzeroloopoverhead.

Our approach for mapping loops onto a processor array is different from[Ref.27]inthefollowingways:first,weconsidermultiprocessorarraysandnotsingle processors. Second, we strictly separate the control flow from the dataflow (i.e. scheduling is as tight as possible and restricted only by datadependenciesandfunctionalunitandregisterresourceconstraints).Branchingisperformed based only on the values of control signals that are generatedcompletely outside a processor (propagated from the controller located at theborderoftheCGRA).Thisseparationofcontrolflowisuniquetoourproposedcode generation approach and finally also more energy-efficient as controlsignals needed for branching are computed only once and are propagated in adelayed fashion unmodified to the proper processors. Furthermore, ourscheduling andmapping approach is able to exploit both instruction and looplevelparallelism, respectivelypipeliningof iterations (softwarepipelining).Asthismaytypicallyresultinincreasedcodesizes,weproposedasystematicwayofcodecompactionbasedontheideatocompressnaiveloopcodebyexploitingrepetitivesegmentsofcodecalledprogramblocksandmulti-dimensionalzero-overheadlooping.

10.7.Conclusions

Wepresentedanapproach forcompactcodegenerationandoptimization foraclass ofCGRAs.For the fast and energy-efficient execution ofmany types ofnested loop programs, the proposed CGRA offers many configurationpossibilitiesatthelevelofVLIWcorespecificationandalsoaflexibletopologyconfigurability. Unique in the processor architecture specification are alsoreconfigurable feedback registers. Systems with 100s of cores are, however,limited in instruction and data memory available per processor.We thereforepresented amethodology for compact code generation for large scaleCGRAsstartingwith a quite general class of loopprograms.Ourmain results are thattypically,codedoesnotneedtobegeneratedforeachprocessorseparately,butonly for a small number of so-called processor classes that distinguishthemselves by different functionality. Hence, each processor belonging to thesameprocessorclasswillreceivethesameprogram.Theprogramloadingmaybe achieved in parallel using amulti-cast configuration method presented in[Ref. 6]. Subsequently, in order to avoid the explosion of code for a singleprocessor, repetitive patterns of instruction sequences are extracted calledprogram blocks. Moreover, the presented code generation and optimizationmethods based on program blocks are able to emit amost compact assemblycode by preserving a given schedule of iterations and branching between theprogram block codes emitted only once instead of being replicated. The finalcodehasbeenshowntobeindependentoftheloopproblemsizeandalsosizeofthetargetCGRA.

Our approach shows promising results also compared to the Trimarancompilation infrastructure. It may be used also for fine-tuning an applicationwith respect to the instruction memory requirements, hence can be used alsoduringdesignspaceexploration[Ref.28].

In the future, we would like to investigate symbolic code generationtechniques[Refs.20and30],whichareespeciallybeneficialwhenthenumberof available processing elements in a CGRA becomes known only at runtime[Ref.31].

10.8.Acknowledgment

ThiscontributionisdedicatedtoProfessorPeterY.K.Cheungattheoccasionofhis60thbirthdayandinrecognitionofhisoutstandingcontributionstothefieldof reconfigurable computing in general and field-programmable gate arrays(FPGA) in particular. The first author has followed the pioneering work ofProfessorCheung already since being a PhD studentmore than 20 years ago.

Beingconstantlyfascinatedbyhisinspiringthoughtsandwithadmirationofhissteadysuccessinsolvingfundamentalproblemsthathaveadvancedthefieldofreconfigurablecomputingtremendously,weareveryproudtocongratulatewiththischapterofourcurrentresearch.

ThisworkwassupportedbytheGermanResearchFoundation(DFG)aspartof the Transregional Collaborative Research Centre “Invasive Computing”(SFB/TR89).

References

1.V.Baumgarteetal.PACTXPP—ASelf-ReconfigurableDataProcessingArchitecture,TheJournalofSupercomputing,26(2),167–184,2003.

2.F.Bouwensetal.ArchitectureEnhancementsfortheADRESCoarse-grainedReconfigurableArray,inProc. International Conference onHigh Performance Embedded Architectures andCompilers, pp.66–81,2008.

3. A. Duller, G. Panesar, and D. Towner. Parallel Processing— the PicoChipWay!,CommunicatingProcessArchitectures,pp.125–138,2003.

4.A.Olofsson.A 1024-core 70GFLOP/WFloating PointManycoreMicroprocessor, inProc. AnnualWorkshoponHighPerformanceEmbeddedComputing,2011.

5. B.R. Rau. Iterative Modulo Scheduling: an Algorithm for Software Pipelining Loops, in Proc.InternationalSymposiumonMicroarchitecture,pp.63–74,1994.

6.D.Kissleretal.ADynamicallyReconfigurableWeaklyProgrammableProcessorArrayArchitectureTemplate, inProc. InternationalWorkshop on Reconfigurable Communication Centric System-on-Chips,pp.31–37,2006.

7.J.Teich,L.Thiele,andL.Zhang.SchedulingofPartitionedRegularAlgorithmsonProcessorArrayswithConstrainedResources,JournalofVLSISignalProcessing,17(1),5–20,1997.

8.H.Dutta,F.Hannig,andJ.Teich.HierarchicalPartitioningforPiecewiseLinearAlgorithms,inProc.InternationalConferenceonParallelComputinginElectricalEngineering,pp.153–160,2006.

9. J. Teich and L. Thiele. Exact Partitioning of Affine Dependence Algorithms, in Proc. EmbeddedProcessorDesignChallenges,pp.135–153,2002.

10.F.Hannig.SchedulingTechniquesforHigh-ThroughputLoopAccelerators,Dissertation,UniversityofErlangen-Nuremberg,Germany,2009.

11. S. Boppu, F. Hannig, and J. Teich. Loop Program Mapping and Compact Code Generation forProgrammable Hardware Accelerators, in Proc. International Conference on Application-specificSystems,ArchitecturesandProcessors,pp.10–17,2013.

12. P. Feautrier and C. Lengauer. “Polyhedron Model”, In ed. D. Padua, Encyclopedia of ParallelComputing,pp.1581–1592,Springer,NewYork,NY,USA,2011.

13.F.HannigandJ.Teich.ResourceConstrainedandSpeculativeSchedulingofanAlgorithmClasswithRunTimeDependentConditionals,inProc.InternationalConferenceonApplication-specificSystems,Architectures,andProcessors,pp.17–27,2004.

14.F.Hannigetal.PARO:SynthesisofHardwareAcceleratorsforMulti-DimensionalDataflow-IntensiveApplications, inProc. InternationalWorkshoponAppliedReconfigurableComputing,pp.287–293,2008.

15. L. Thiele and V. Roychowdhury. Systematic Design of Local Processor Arrays for Numerical

Algorithms, in Proc. International Workshop on Algorithms and Parallel VLSI Architectures, pp.329–339,1991.

16.M.Wolfe.HighPerformanceCompilersforParallelComputing,Addison-Wesley,Boston,MA,USA,1996.

17.ILOG,CPLEXDivision.ILOGCPLEX12.1,User’sManual,2011.18. F.Wu et al. Simultaneous Functional Units and Register Allocation Based PowerManagement for

High-level Synthesis of Data-intensive Applications, in Proc. International Conference onCommunications,CircuitsandSystems,2010.

19. S. Boppu, F. Hannig, and J. Teich. Loop Program Mapping and Compact Code Generation forProgrammable Hardware Accelerators, in Proc. International Conference on Application-SpecificSystems,ArchitecturesandProcessors,pp.10–17,2013.

20.B.R.Rau,M.S.Schlansker,andP.P.Tirumalai.Codegenerationschemaformoduloscheduledloops,ACMSIGMICRONewsletter,23(1–2),158–169,1992.

21. Trimaran. An Infrastructure for Research in Backend Compilation and Architecture Exploration.[Online]Availableat:http://www.trimaran.org/.[Accessed23April2014].

22.T.Todmanetal.ReconfigurableComputing:ArchitecturesandDesignMethods,IEEProc.ComputersandDigitalTechniques,152(2),193–207,2005.

23.G.Venkataramanietal.AutomaticCompilationtoaCoarse-grainedReconfigurableSystem-on-Chip,ACMTransactionsonEmbeddedComputingSystems,2(4),560–589,2003.

24.H.Singhetal.MorphoSys:AnIntegratedReconfigurableSystemforData-ParallelandComputation-IntensiveApplications,IEEETransactionsonComputers,49(5),465–481,2000.

25.J.Lee,K.Choi,andN.Dutt.AnAlgorithmforMappingLoopsontoCoarse-GrainedReconfigurableArchitectures, inProc.ConferenceonLanguages,Compilers,andTools forEmbeddedSystems,pp.183–188,2003.

26. G.-R. Uh et al. Effective Exploitation of a Zero Overhead Loop Buffer, in Proc. Conference onLanguages,Compilers,andToolsforEmbeddedSystems,pp.10–19,1999.

27. N. Kroupis et al. Compilation Technique for Loop OverheadMinimization, inProc. of EuromicroConferenceonDigitalSystemDesign,Architectures,MethodsandTools(DSD),pp.419–426,2009.

28.F.Hannig and J.Teich.DesignSpaceExploration forMassivelyParallelProcessorArrays, inProc.InternationalConferenceonParallelComputingTechnologies,vol.2127,pp.51–65,2001.

29.S.Boppuetal.TowardsSymbolicRunTimeReconfigurationinTightly-CoupledProcessorArrays,inProc.InternationalConferenceonReconfigurableComputingandFPGAs,pp.392–397,2011.

30.J.Teich,A.Tanase,andF.Hannig.SymbolicParallelizationofLoopProgramsforMassivelyParallelProcessorArrays, inProc. International Conference on Application-specific Systems, ArchitecturesandProcessors,pp.1–9,2013.BestPaperAward.

31.J.Teich.InvasiveAlgorithmsandArchitectures,IT-InformationTechnology,50(5),300–310,2008.

aFor the sake of better visibility, control registers and control I/O ports are not shown in Fig. 10.1.However,itshouldbenotedthatsimilartodata-pathspecificresourcessuchasregistersRDx,IDx,andsoon,thereexistalsoequivalentcontrolpathresources,e.g.registersRCx,ICx,etc.bThisisthereasonwhytocallthempiecewiselinear,respectivelyregular.cForparallelotope-shapedtiles,ℐ1⊕ℐ2={I=I1+P·I2|I1∈ℐ1∧I2∈ℐ2∧P∈ℤn×n}.Here,Piscalledtilingmatrix.

http://www.trimaran.org/

dNotethatincasethisexpressionshouldbemorecomplexthanaunaryorbinaryoperatorthatistypicallymappable to a single processor instruction, complex expressions may be split into simple equations byintroducingintermediatevariablesthatareassignedsubexpressionswithoneortwooperandsonly.eADPUisafunctionalunitthatmaymovedatafromasourcetoadestinationregister.fKeepinmindthattheirexecutionsmay,however,overlap.gHence,theiterationspaceofthewholeloopprogramispartitionedintoasingletileonlytoreflectthatalliterationswillbeassignedtoasingleprocessor(LSGPpartitioningused).hTheaverageiterationintervalIIavgistheaveragetimebetweenthestartoftwosuccessiveloopiterations.Itiscalculatedbydividingthetotalexecutiontimeofaloopnestbythetotalnumberofiterationsexecuted.iTheoverheadisevaluatedastheamountoftimethatisspentinexecutingotherthantheinnermostloopbodycomparedtototalexecutiontime.

Chapter11

SomeStatisticalExperimentswithSpatiallyCorrelatedVariationMaps

DavidB.ThomasEEEDepartment,ImperialCollegeLondon

ThischapterusesGaussianrandomfieldsasameansofmodellingdelayvariationwithinchips.Itisshownthatforcertainscalesofcorrelation,itispossibletoincreaseyieldbyspatiallydecorrelatingtheplacementofcomponentswithinapath.

11.1.Introduction

Peter’sworkonvariationhasalwayspiquedmyinterest,particularlyduetothestatistical nature of the processes involved. Whenever Zhenyu or Justinbrandishesagraphofvariationmaps[Ref.1],I’vealwayswonderedwhathigh-level impact we should expect — beyond the practical place-and-routeimplications,howdoestheexistenceofvariationchangewhatweshouldexpectfromthecircuits?

Thispaperisahopelesslynaiveattempttotranslatethepracticalresultsanddiscussions I’ve seen in Peter’s research discussions, into a more statisticalapproach.Theonlyrealresultistoshowthatifspatialcorrelationinvarianceisanimportantfactor,thenlogicelementsshouldbespatiallydecorrelatedinordertoincreaseyield.Thisisaratherobviousconclusion,andI’msurewellknownamongstsiliconpeople,butasasoftwareengineeritwasfuntodiscover.

11.2.PropertiesofUncorrelatedFields

Assumearectangularn×mgridofcomponents,eachofwhichhassomedelaydi,j. To start, we’ll consider the delays to be independent and identicallydistributed(IID)Gaussian,so

whereΦ(·)isthestandardGaussiancumulativedistributionfunction(CDF).Nowwe’llconsideranyw×hsub-grid,1≤w≤n,1≤h≤m.Becausethe

delayswithinthegridareIID,anyw×hgridshouldbestatisticallyidentical,sotheexactlocationdoesn’tmatter,andwecanjustconsidertherectanglestartingat(1,1).

Firstletusconsiderthedistributionoftheaveragedelaywithinthesub-grid:

This is simply the sum of wh IID Gaussians, so the average delay is alsoGaussiandistributed:

Fromacircuitpointofview,whatismoreinterestingishowbadthingscanget,whichintermsofdelayisdeterminedbytheworstdelaywithinanarea.Ifwedefinethisas

thenwecandescribetheCDFof intermsoftheCDFofdi,j

Similarly,thesmallestdelayis

11.3.IntroducingCorrelation

SofarwehaveconsideredIIDvariance,sothedelayofoneelementisunrelatedtothedelayofanyotherelement.Onehypothesis,whichrunsthroughZhenyu’s

work, is that there is a strong spatially correlated component: if one area ofsiliconhashigherdelay,thenitislikelythatclosecomponentsalsohavehigherdelay.Toputthisinastatisticalframework,letusassumethatthedelayofeachelement is coupled toall immediatelyadjacentelements (left,up, right,down)withcorrelationρ.Giventhiscorrelationstructure,weneedtodeterminesomefeasiblecovariancematrixΣovertheelementdelays.

The correlation matrix Aw,h will be a positive-definite matrix, with allelements in [−1, +1], and ones along the diagonal. For aw ×h rectangle, thematrixwill havewh rows and columns, as it needs to express everypair-wisecorrelation.

For1×2and2×1thecovariancematrixistrivial,butforlargermatricesweneedtomakesurethatthematrixisconsistent.Letusstartwiththematrixfora2×2rectangle:

Wehaveanumberofentrieswhicharenotdefinedbyoursimpleleft-up-right-down correlation pairs, specifically the (1, 1) ↔ (2, 2) and (1, 2) ↔ (2, 1)correlations.

Intuitively these unknown correlations should be related to the Euclideandistanceacrossthediagonals,butitisn’tclearwhatexactlytheyshouldbe.Wecouldfindsomesortofconsistentmatrix,usingaKirchoff’sLaw-likeapproach,but this would have the unfortunate side-effect that correlations at the edgeswould be different to correlations in the centre.Assuming our originalm ×nrectanglewascutoutofamuchlargerpieceofsilicon, itmakessensethat theoriginallocalcorrelationprocesscrossedtheboundaries,thenwassawninhalf.

For this reason we’ll assume that the local variation is represented by aGaussian random field [Ref. 2]. Absent any better model for variation, anexponentialcorrelation functionwillbeused,whichhas thedesirablepropertythatcorrelationsdecreasemonotonicallyfromone.Nonegativecorrelationsareever introduced, but I’m not aware of any particular physical process whichwouldintroduceanti-correlations(thoughI’msuretheyexist).

Thecontinuouscorrelationsbetweentwopointsisnowgivenas

with λ acting as a scale parameter. For the elements of our grid,we’ll simplediscretiseatintegerco-ordinatestogetthematrix

Armedwithournewcorrelation structure foragrid,wecannowreturn tothestatisticalpropertiesof rectangles (woo!).Thecorrelatedelements followamultivariatenormaldistribution,determinedbyacombinationofthecorrelationmatrixandouroriginalparametersσ,andµ:

Intermsoftheexpectedmean,nothinghaschanged,itisstilljustµ.Theminimumandmaximumbecomemorecomplicated,asthespreadingof

thedistributionwillvarywithcorrelation.Inparticular,asλisincreased,makingcorrelationsstretchlongerandlongerdistances,thevariationwithinthesamplereduces.Figure11.1givesexamplesof randomcorrelated fields for increasingvaluesofλ.

Thisiseasytounderstandintuitivelybyconsideringthetwoextremesforλ.Whenλ=∞wefind

whichissimplysayingthatthecorrelationmatrixistheidentitymatrix,makingeach element independent. If all elements are independent then we aremaximisingthechancethatoneofthemwillmanagetoproduceavalueoutinthetails.Bycontrast,withλ=0wefind

Fig.11.1.ExamplesofvariationasmodelledbyaGaussianrandomfield,withthesameunderlyingrandomfactors,butchangingdifferentcorrelationsdistances.

soallelementsareperfectlycorrelated.Perfectcorrelationmeans there isonlyoneglobalsourceofuncertainty,so it ismuch less likely thatasingle randomsamplewillreachintothetailsofthedistribution.

Figure11.2showsthiseffectvisually,witheachcolumnhavingadifferentspatial correlation, and the random realisations within the column sorted byaverage delay. Looking at the realisations half-way up the right column, itappearsasifthevarianceismuchlowerthanintheleftcolumn.However,whentheentire setof realisations is considered, theoverallpicturebecomesclearer,withsome instancesat the tophavingvery lowdelay,andsomeat thebottomhavingveryhighdelay.

Fig.11.2.Eachcolumnshowsdifferentrealisationsofaparticularspatialcorrelation,forλ=(0.125,1,8,64).Withineachcolumnthefieldsaresortedbymeandelay.

11.3.1.Behaviourofpathsunderspatialcorrelation

Rectanglesareuseful,but amore importantproblem is thebehaviourofpathswithin a circuit.Wewill define a pathP =p1 · · ·pkwhich passes through kelements as a sequence of k integer pairs, with each pair representing thelocationoftheelements.Fornowweimposenorestrictionsonthepathexceptthateachelementcanonlyappearonce.

Wecandefinetheexpecteddelayasthesumofthedelaysofallcomponents:

Undertheuncorrelatedmodel(λ=0),thenallelementsareIIDGaussian,whichmeansthevariancesadd:

At the other end,with perfect correlation (λ =∞), all elements have the samedelay,sowehave

An equivalent statement is to put it in termsof the standard deviation of pathdelay

The perfectly correlated version has much higher variance for large k, whichseemsworse, but the truth is more subtle. The perfectly correlated version isessentiallydescribedbyonerandomnumber—ifthatnumberishigh,thentheentirechipisprettymuchuseless,butthereisalsoa50%chancethattheentirechipwillbe“aboveaverage”, inwhichcaseeverysinglepathon thechipwillrunfasterthanaverage.Bycomparison,theuncorrelatedversioncontainsmanyfastandslowelements,sothedifferentelementswithinapathwilltendtoevenout the delay. However, across many paths in a chip, the likelihood that anysinglepathisslowgrowsquitequickly.

Toexamine0<λ<∞,itisnecessarytoconsiderwhatpathstocheck,asthespatialnatureof thepathswill affect thedelaydistribution for thepaths.Herewe’ll consider an abstract design consisting of four elements connected in achain,andouryieldmetricwillbebasedontolerabledeviationfromthemean.

Based on static analysis, with no knowledge of the actual chip variation, thetoolscouldassumethatthestandarddeviationofeachpathwillsimplybe ,sointhiscase2σ.Letusassumethetoolsworktoaslacks,andanygivenfourelementcircuitwillworkaslongasP<4µ+s.

Thetargetdesigncontainsmultiplepaths,andagivenchiponlyworksifallpaths meet the slack target. Because each chip will have a different delaydistribution,we’lldefineayieldmetricY(s).Theyieldestimatesthenumberofchipswhichwillfunctioncorrectlywhendesignedforaparticularslack.

In this simple study, there are only four elements in each path, and it isnatural to pack them together into a square. Figure 11.3(a) shows this layout.Another common layout is to arrange them into columns (e.g. carry-chains),shownin(b).Whilethesearrangementsshouldbeequivalentinanuncorrelatedorperfectlycorrelatedfield,inspatialfieldsthereisapotentialdifference:ifoneoftheelementsina2×2blockhappenstogetalargedelay,itisalsolikelythatthe other elements will have high delays, so the chance of the entire pathexceedingslackismuchhigher.Incontrast,thecolumnhasagreaterchanceofstayingwithin slack, as if the topof the columnhas ahighdelay there is stillsomehopethatthebottomwillbefarenoughawaytobedecorrelated,andhaveamuchsmallerdelay.

Theideathatspacebetweenelements increasesthelikelihoodofanygivenpathworkingsuggeststhatotherarrangementsmightbeevenmoreefficient.Forexample,ifweweretotakepairsofcolumnsandswapeveryotherelement,thenthe zig-zag arrangment shown in Fig. 11.3(c) arises. Although superficiallysimilar, nowall elements in a path are at least a distance fromeachother,which means the maximum pair-wise correlation drops from exp(−λ) to

Fig.11.3.Layoutsforpathsoffourelements,containedwithina4×4grid.

Goingfurther,itispossibletointerleavethepathsasshownin(d),thoughatthe cost of more routing. Now the maximum correlation is exp(−λ2), a greatimprovementovertheoriginal.

Given these four arrangements,we can calculate their yield for different λand sizes empirically, by generating lots of different variation maps. Themaximum path delay within any variation map will determine the minimumallowable slack, and counting the number of random instanceswith slack lessthanswillprovideanestimateofY(s).

Figure11.4showsthechangingyieldforadevicewith4×4elements,witheachdeviceused tohold fourpaths.Thegentlycurvingyellow lineacross themiddle is thecompletelycorrelatedcase,whichdoesn’t reachhighyieldsevenforvery largeslack.Theupper line is thecompletelyuncorrelatedcase,whichprovides the best possible yield. Both these cases are independent of theplacement.Theremaininglinesshowtheeffectofthedifferentplacementswhenλ = 1. Each placement provides a small increment in yield, for example, at aslack of 2, moving from dense placement to interleaved placement providesabouta10%increaseinyield.

Fig.11.4.Yieldvs.λfora4×4device.

Fig.11.5.Yieldvs.λfora32×32device.

Figure 11.5 performs the same experiment, but now with a 32×32 grid.Because there arenowsomanymore independentpaths to test, the chanceofany one of them requiring large slack is much higher, so the benefits ofremovingspatialcorrelationare larger.Ataslackof3.5 theyield increasesby50%ifthedenseplacementisreplacedwiththeinterleaved.Eventhemuchlessdrastic conversion from linear to zig-zag has a significant effect, with yieldrisingbyjustover10%.

11.4.Conclusion

We assumed that delay due to variation could be modelled with a spatiallycorrelated Gaussian random field, and then looked at the yield of chipscontaining many identical paths. Superficially, the maps look like thoseextractedfromtheDE0boardsbyPeter’sresearchgroup[Ref.1],butitwouldbeinterestingtotestthehypothesismorerigorously.Attemptingtocharacterisethescaleofvariationwouldalsobeinteresting,astheabilitytodealwithspatialcorrelationdependsalotonhowfaritreaches.Thereisalsoaninteractionwiththereliabilityandvariationoftheinterconnect[Ref.3],somodellingbothatthesametimemightsuggestthatspatialdecorrelationisabadidea.

My original goal in writing this chapter was to look at the statisticalproperties of self-repair mechanisms in the context of spatially correlatedvariation, using some of the approaches used for wear-levelling [Ref. 4].However, timedidnotpermit, so this is left for futurediscussionwithPeter’sresearchgroup.

References

1.Z.Guanetal.ATwo-stageVariation-awarePlacementMethodforFPGAsExploitingVariationMapsClassification,inProc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.519–522,2012.

2.P.Abrahamsen.AReviewofGaussianRandomFieldsandCorrelationFunctions.[Online]Availableat:http://www.math.ntnu.no/omre/TMA4250/V2007/abrahamsen2.ps.[Accessed23April2014].

3. N. Campregher et al. Yield Modelling and Yield Enhancement for FPGAs Using Fault ToleranceSchemes, in Proc. International Conference on Field Programmable Logic and Applications, pp.409–414,2005.

4.E.StottandP.Y.K.Cheung.ImprovingFPGAReliabilitywithWear-levelling, inProc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.323–328,2011.

http://www.math.ntnu.no/omre/TMA4250/V2007/abrahamsen2.ps

Chapter12

On-ChipFPGADebuggingandValidation:FromAcademiatoIndustry,andBackAgain

SteveWiltonDepartmentofElectricalandComputerEngineering

UniversityofBritishColumbia

InrecognitionofPeterCheung’s60thbirthdayandhisextensiveexperienceindigitalsystemdesignresearch and education, this chapter discusses recent developments in on-chip debugging with avisiontowardsthefuture.Wefocusprimarilyontechniquestoincreaseobservabilitybutalsotouchuponotheraspectsoftheproblem.Weendwithadiscussionofthepotentialofextendingdebuggingtechniquestotheworldofhigh-levelsynthesisgenerateddesigns.

12.1.Introduction

Unprecedentedadvancesinintegratedcircuitfabricationtechnologyhaschangedour way of life. Mobile computing, the high speed internet, and computingequipment that analyses andmanipulates informationhas changed thewaywedobusiness,relax,andinteractwitheachotheracrosstheplanet.Theimpactofincreasedcomputingpowerislikenootherrevolutionoursocietyhaseverseen.Despitetechnologychallenges,theseadvancesshownosignofslowingdown;infact, thedramatically increased rateofcommunicationhasactuallyacceleratedthedesignofnewtechnologythatunderpinsfutureadvances.

Akeychallengewiththesenewtechnologies,however,iskeepingthecostofexploiting the technology affordable. As on-chip integration increases, thecomplexity of the underlying technology has also increased, leading todramatically increasedcostsandfinancial risk.Today,veryfewcompaniesarecapable of creating high-end integrated circuits. For everyone else, field-programmablegatearrays(FPGAs)havebecometheimplementationmediumofchoice for many digital circuits. FPGAs can be configured to implement anydigitalcircuit, allowingdesigners to immediately testdesignswithout thecost,

risk and delay of producing aVLSI implementation. They provide companieswith large-scale integration without requiring access to a state-of-the-art chipfabrication plant. The improvement of FPGA technology, the associatedCADalgorithms, andways inwhichFPGA technologycanbeused,has formed thebackdropformuchofPeterCheung’sresearchoverthepastseveraldecades.

Oneofthekeychallengesindesigninglargedigitalsystemsisverifyingthatthedesignsarecorrect(verificationandvalidation),andfindingtherootcauseofdesignerrorswhentheyareobserved(debugging).ArecentstudyfromMentorGraphics showed that half of all design effort was used for functionalverification [Ref. 1], and that the situation is getting worse — designerproductivity doubles only every 39 months [Ref. 1], while silicon densitydoubles every 18months due toMoore’sLaw.This has ledmany researchersaround the world to investigate techniques to accelerate the verification,validation,anddebuggingofdigitalsystems.

Inthispaper,wefocusononeaspectofthispuzzle:debugging.Aswewilldescribe, a primary challenge in debugging large digital systems is lack ofvisibilityintotheinternalstateofasystem.Thishasmotivatedmuchwork,andwewilldiscusssomeofithere.Tomakethediscussionconcrete,wewilluseanexample case study that traces a particular technology development from anacademic lab, toan industrystartup, throughasuccessfulacquisition,and thenbacktotheindustriallab.Althoughthefocuswillbeonourownresearch,thistypeof research isverymuch related to theoverallproblemsbeingstudiedbyPeter Cheung; we hope that by reading this paper, the audience can betterunderstandsomeofthechallengesinthisfield—challengesthatPeterworkstoaddresseveryday.

12.2.TheNeedforOn-ChipDebugging

Whendigitalsystemsweresmall,itwaspossibletotestanddebugthesecircuitsentirely in simulation. Today, this is rarely possible [Ref. 2]. Using today’stechnology,thetaskofbootingLinuxonaSoCwouldtakeroughly2,000yearstosimulate.Toputthisincontext,whentheRomanswereinLondon,iftheyhadstartedasimulationinPeter’slab,itwouldbecompleting“anydaynow”(thisisassuming theRomans had computing technology equivalent to today’s,whichseems unlikely). If the simulation showed an error, and had to be re-run, wewouldneedtowaitanother2,000yearstogettheresults,significantlydelayingthegraduationofthestudentsinvolved.Yet,itiseasytoimaginethattherearemanybugs thatcannotbe“activated”withoutbootinganoperatingsystem(at

least).Clearly, simulation (evenwhenaccelerated) is inadequate to thoroughlyexerciseanylargedigitalsystem.

Asecondchallengewithsimulationisthatitisdifficult,ifnotimpossible,totestthesystemin-situ.Manybugswillonlybeapparentwhenthedigitalsystemreceives real-world stimulus. Trying to recreate the stimulus in a testbench isoftendifficult, and suffers from theproblem that it is theunexpected stimulusthatoftencauseserrors,andthosetypesofinputsareunlikelytobeencodedinatestbench. Secondly,most digital systems operate in an ecosystemwith eitherotherchipsorotherembeddedintellectualproperty(IP)blocks;bugsoftenoccurbecausedesignersfailtounderstand(exactly)theinterfacerequirementsoftheseblocks (often undocumented) or the peculiarities of how the blocks should beused(again,oftenundocumented).

Finally, any interactionwith embedded software or firmware is a commonsource of buggy behaviour, yet those interactions are often difficult orimpossibletoadequatelycoverwithoutrunningarealsystem.

Forallthesereasons,theonlywaytocompletelytestasystems,andtheonlywaytofindmanybugs,isusinganactualworkingchip.

12.3.KeyChallenge:On-ChipVisibility

A key challenge when debugging a real working system is observability.Moore’s Law continues to provide more transistors (despite many obstacles);however, the rateof increaseof I/Oconnectionsonachip isnot increasingasfast.Thesituationmaygetworsewith theemerging2.5-Dand3-Dpackagingtechnologies,simplybecausethereismorelogicinsideonepackage.

Insimulation,visibilityisnotanissue,sincetheusercan“probe”anysignalheorshewants toobservein thesoftwaremodelof thesystem.This isavitalstepinthedebuggingofacircuit;observinginternalsignalsisthebestwayforanengineerto“narrowdown”theirunderstandingofthepotentialstateacircuitmaybe in, inorder tohelpdeduce thecauseofunexpectedbehaviour(indeed,this ishowweteachourundergraduates todebug; lookat importantsignals tohelpnarrowdownthecauseofanerror).

In a hardware system,unless a signal is connected to an external pin, it isimpossibletoobserve(chipprobingequipmentdoesexist,butthisisexpensive,error-prone,andisoftenlimitedtoobservingonlythetopfewlayersofmetal).Inourundergraduateclasses,weteachstudentshowtoconnectinternalsignalstopinsduringdebugging (and then recompiling theirdesign),but this is time-consuming,andlimitsdebugproductivity.

12.4.CaseStudy:FromAcademiatoIndustry

Tomakeourdiscussionconcrete,inthissectionwetracethedevelopmentofaparticular technology (or family of technologies) that address this problem,startingfromitsinceptioninacademia,toitsexploitationinthestartupVeridaeSystems.

12.4.1.Tracebuffersandmonitoringcircuits

Observabilitycanbeenhancedusingembedded tracebuffers,asshowninFig.12.1.Theideaistoconnectsignalsthataredeemed“important”tooneormoreembedded memories, and record the values of those signals during normaloperation[Refs.3and4].Whenanerrorisobserved,thesystemcanbehalted(haltingasystempreciselyisdifficult;addressingthis“skidding”problemisaninterestingresearchopportunity),andthehistoryofthesesignalscanbereadout.Tools such asSignalTap II andChipScope are producedbyFPGAvendors toautomate the inclusion of this instrumentation and other forms ofinstrumentationhavebeendescribed[Ref.5].

Fig.12.1.Integratedcircuitinstrumentedwithatracebuffer.

There are a number of challengeswith this approach. First, only a limitedamountofdatacanberecordedinatracebuffer.Therehasbeenworkonbothlosslessandlossycompressionfortracebufferdata,whichincreasestheamountofinformationstored[Ref.6].

A second challenge is the need to determine, a priori, which signals are“important” so that these signals can be hardwired to the trace buffers. Thissignalsectionproblemhasbeenwell-studiedbutremainsdifficult[Refs.7–10].

12.4.2.Academicefforts:Embeddedlogicanalyzers

Toaddresssomeoftheseproblemsweproposedtheuseofprogrammablelogiccores as embedded logic analysers, as shown in Fig. 12.2. The idea is toincorporateaprogrammable logic (FPGA) regionwithina fixed-functionchip,such as an application-specific integrated circuit (ASIC). This programmablelogic region can be configured to implement any digital circuit at runtime bysetting the values of configuration bits, just as can be done in an FPGA. Atruntime,whenanengineer issearchingfor thecauseofunexpectedbehaviour,heorshecanconfigurethisregion(perhaps,forexample,creatingcomplicatedgatingfunctions)thatrecordtracebufferdataonlywhenaspecificpatternonanumber of other signals (perhaps over time) is observed. The programmablelogic region could also be used to compress data, count the number ofpredeterminedevents thatoccur,or implementassertions thatmightbehelpfulduringdebug[Ref.11].

Fig.12.2.Integratedcircuitinstrumentedwithprogrammablelogicfabric.

Furthermore, it is conceivable that, under certain circumstances, theprogrammable logic core could be used to over-write certain values. Thisprovidesameasureofcontrollabilitytothedebugprocess.Oneexamplewherethismightbeusefulduringdebugistosuppressanerrorsignalthatisactivated,possiblyallowingthesystemto“limpalong”afteranerroroccurs.

This technique still requires selecting signals to hardwire to theprogrammable logic core. In our case, we proposed using a concentratornetwork which can be configured, at runtime, to efficiently connect a largernumberofsignals(ontheorderof1,000s) to thesmallernumberofembeddedprogrammable logic core pins [Ref. 12]. Since this concentrator can be

configured at runtime, it is possible to change the set of signalsused for eachdebugscenario.

There are several challenges with this approach. First, the programmablelogicfabricislarge.Researchhassuggestedthatprogrammablelogiccanbe35× larger than the equivalent ASIC circuitry [Ref. 13], so this limits thecomplexityofthedebuglogicthatcanbeimplementedinacore.However,wewere able to show that by limiting the overhead to 10% (this was deemedacceptable bymany of our industrial contacts, since this extra logic does notneedtoyieldinaproductionenvironment),wecouldimplementalargenumberofdebugscenarios.

12.4.3.Industrialefforts:Shiftingthetechnology

In2011,wedecidedtocommercialisethisapproach,andcreatedaVancouver-basedcompanycalledVeridaeSystems.VeridaestartedoutattheUniversityofBritish Columbia, but quickly moved to an off-campus office. Initially, thereweresixprogrammers,hiredusingourownseedmoneyandvariousgovernmentgrantsandrelatedsources.

As we began to commercialise the technology, it became clear that thetechnology developed at the university was not suitable for our purposes. Inparticular,wequickly determined that a general programmable logic corewasfarmoreflexiblethanwasnecessary.Althoughwewantedtosupportavarietyof debug scenarios, most of these debug scenarios were very similar, and somuchoftheprogrammablelogiccouldbereplacedbymuchmoreefficienthardlogic.Inparticular,wecreatedverycomplexcompressioncircuitryoutofhardlogic(intypicalcases,ourcompressioncircuitrymadeitpossibletogatherdatafor seconds of runtime by filtering out uninteresting events on a bus, forexample).Wealsofoundthat thefullflexibilityaffordedbytheprogrammablelogiccorewasnotneededforcommonmatchingapplications;wethusreplacedthislogicwithfixedhand-optimisedflexiblemappingcircuitry.Theconcentratornetwork also evolved as we commercialised the technology. As a result, theeventual technology that was commercialised looked very different from theacademicworkthathadledtothecompanyinthefirstplace[Ref.14].

Theothershiftinthetechnologywasthefocusonusability.Inouracademicwork,usabilitywasnotakeyconcern,however,withinthecompany,oneofourprimary values to our customers was the ability to easily and quickly insertdebug instrumentation; thus, creating a intuitive yet flexible interface becameimportant. A very significant part of our development was aimed towards

creatingaGUIinfrastructure(basedonPythonandQT).Anotherdifferencebetweentheindustrialandacademiceffort is that in the

academicworld,itwassufficientifourtoolsworkedonjusta“handful”ofusercircuits.Intheindustrialworld,ourtoolhadtoworkonallusercircuits,whetherthey be written in VHDL, Verilog, or System Verilog. We quickly came toappreciatethesemanticdifferences(differencesinthewaycodecanbewritten)available todesigners inall three languages.WeusedsoftwarefromVerific tohelpinthisprocess,however,wefoundthatasignificantamountofeffortwasspentensuringourtoolhandledallcornercasesofallthreelanguagescorrectly.

12.4.4.Industrialefforts:Shiftingthepriorities

Asecondshiftoccurredaswecommercialisedourtechnology.AlthoughwehadinitiallyfocusedonASICdesigners,itquicklybecameclearthat(a)therearenotmanyASICdesigners today, sincemoreandmoredesign starts are shifting toFPGAs,and(b) thedesigncyclesforASICsaremuchlonger thanforFPGAs,meaningit ishardforastartuptohavetherighttimingtoseamlesslyfit intoacustomer’sdesignschedule.Thus,akeystrategicdecisionwastoputsignificanteffortintoanFPGAproduct.

A second strategic decision was the realization that there are many ASICdesigners thatuseFPGAprototyping to testanddebugtheirdesigns.Recently,IntelhasdescribedhowtheirAtomprocessorwasprototypedonasingleFPGArunning at 50 Mhz [Ref. 15], and how their i7 Nehalem processor wasprototyped using five FPGAs running at 520 kHz [Ref. 16]. These sorts ofprototypingsystemshavebecomeessentialasdesignerscreate largerandmorecomplexdesigns[Refs.17and18].ItbecameapparentthatitwasduringFPGAprototypingthatmanybugswereuncovered,sothecompanyalsoputresourcesinto addressing that market. A key challenge when addressing the FPGAprototypingmarketwas to handle asynchronous interactions between differentdevices,andtolaterreconstructsignalsbasedonmeasuredtimereferences.

12.5.BacktoAcademia:FutureDirections

In 2011, Veridae Systems was sold to Tektronix. Our experience in industrymotivatedasignificantamountofnewacademicwork,includingmoreefficientincremental trace techniques, overlay networks, and better signal selectionmethods.We also considered a related problem; coveragemonitoring. In this

section,wehighlight someof this futurework, someofwhichhasbegun,andsomeofwhich is still in its infancy.For readers notworking in this field,wehope thissectioncanprovideanappreciationof thebreadthandexcitementofsomeof thework remaining to be done in this area.Perhaps this sectionmayeven provide some motivation for Peter to start addressing some of thesechallengesinthefuture.

12.5.1.Coveragemonitoring

In the software design world, companies regularly use “code coverage” todeterminehoweffectivelyasetoftests“cover”adesign.It isnotimmediatelyclear how to translate this to a hardware design environment. Coveragetechniques such as statement coverage, branch coverage, condition coverage,path coverage, functional coverage,mutation coverage and tag coverage haveallbeenproposed.

Each of these coverage metrics can be easily measured using simulationtechniques;however,forthesamereasonsasdescribedintheprevioussection,effectivecoveragemeasurementscanonlybedoneusingrealrunninghardware.Hardware validation is accepted as the only method to thoroughly exercise adesign.However,gatheringthesecoveragenumbersthroughvalidationrequiresinstrumentationpossiblywithsignificantoverhead.

Asaspecificexample,considerstatementcoverage.AninterestingquestionwouldbewhenIbootLinuxonmySoC,whatstatementcoveragedoIachieve?TheanswertothisquestionwouldgiveanindicationofwhetherbootingLinuxisa“useful” test,andwhatother testsneedtobeperformedbeforeadesignisdeemedcorrect(orcorrectenoughtoship).However,fromtheabove,measuringthisinsimulationwouldtakethousandsofyears.Anaïveapproachwouldentailaddinginstrumentationforeveryinstructionthatwouldtriggerifthisinstructionhasbeen“exercised”.Wehaveshown[Refs.19and20]thattheoverheadcanbereduced by instrumenting at the basic block level, and further reduced usingcompilertechniquessuchasthosebyAgrawal[Ref.21].However,theoverheadisstillunreasonable,especiallyifitisnoteconomicallyfeasibletocreateanewspin to remove the instrumentation before shipping. Clearly more research isneededhere.

Another approach is to employ an FPGA prototype, and measure thecoverage using the FPGA prototype. One challenge here (especially when itcomes to coverage metrics such as path coverage) is that the FPGAimplementationof theusercircuitmaynotcorrespondexactly(gatetogate) to

the original ASIC implementation. When mapping the design to an FPGAprototypingplatform,differentoptimisationsareperformedtoensuretheFPGAresourcesareusedefficiently.Nonetheless,itispossibletoinstrumenttheFPGAimplementationtogatherinformationregardingthecoverage,aswouldbeseenifthistestsuitewasrunontheoriginalASIC.

A final opportunity is the ability to unify thedebug infrastructure, such asthatdescribedearlierinthispaper,andtheinstrumentationrequiredtomeasurecoverage.Indeed,measuringcoverageisaboutvisibility,sotechniquesthatcanbe used to enhance visibility during debug can also be used to measure thecoverage of a set of tests. This is interesting work that has received littleattention.

12.5.2.Futuredirections:Debuggingfromhigh-levelsynthesis

Decreasedelectronicproduct lifecyclesnecessitate shorter times fromproductconceptiontomarketintroduction.Itisclearthat“timetomarket”isbecomingincreasinglyimportant.Toaddressthis,FPGAvendorshaverecentlyinvestedinhigh-levelsynthesis(HLS)technologiesthatautomaticallytransformasoftwareprogram into a hardware circuit (e.g. Altera has OpenCL support and Xilinxsupports AutoESL C-based design flows). Raising the abstraction level fordesign-entrytosupportsoftware-likelanguageshastwoadvantages.First,ithasbeen reported that there are 10 × more software developers than hardwaredevelopers [Ref. 22]; increasing accessibility to this large labour pool allowsmorecompaniestoleveragehardware’sdramaticallyhigherdatathroughputandenergy efficiency benefits. Second, a higher abstraction level increases theproductivity of each designer, just as the move from gate-level design toregister-transferleveldesigndidinthe1980s.Thisnewtechnologyisdisruptiveandcoulddramaticallychangethecomputinglandscape[Ref.23].

However,thisabstractiontechnologycanonlybeeffectiveifitispartofanecosystem inwhich designers using software-likemethodologies can visualiseandanalysetheirdesign’sbehaviour.Suchanecosystemrequiresthatsoftwareengineerscancreatecorrectandefficienthardwarestructures,withoutrequiringdetailedknowledgeofdigitalhardwaredesign.Hardwaredesignischaracterisedby cycle-by-cycle operations, low-level optimisation, and a dataflow-orientedview of a computing problem. However, software engineers typically viewdesigns as interacting sequential processes, ignoring their cycle-by-cyclebehaviour, low-level mapping and operations. An ecosystem that spans thesetwo very different abstraction layers is essential to the accelerated time-to-

marketpromisedbyHLS-basedtechnologies.

12.5.2.1.Motivation

Webelieve there isacriticalneed to investigateanddevelopkey technologiesforthisecosystem.Usingtoday’stools,itispossibleforasoftwaredesignertouseHLStoolstocreatehardwarerunningonanFPGAprogrammableplatformby writing only software. However, debugging is still challenging. Today, acommon test and debug methodology is to port, compile, and run the codedirectly on aworkstation to verify the functionality and help find bugs beforecompiling to hardware. Although appropriate for initial verification, thisemulation approach has two limitations. First, the software emulationwill runmuchslowerthanthetargethardware(typically20to200timesslower)limitingthe thoroughnessof tests that canbeperformed.Second, thiswill notuncoverproblemsrelatedtointeractionswiththeenvironmentorwithothermodulesinthe system; experience from our company indicates this is where most bugsoccur.Thus,itisnecessarytotestthedesigninsitubyexecutingthesynthesisedhardwareonthetargetFPGA.

Today, in order to perform this hardware verification, a designer wouldcompile the software using a HLS tool (e.g. LegUp [Ref. 24]) to a hardwarespecification (more precisely, a register transfer level (RTL) design). Thisspecificationcan thenbemapped tohardwareusingstandard tools.Debuggingpackages such as the one described earlier in this paper can then be used toprovidevisibilityintothedesigntohelptheengineerunderstandtheoperationofthehardwaretonarrowdownthecauseofasuspectedbug.

Thechallengewiththisapproachisthatthesetoolsprovidevisibilitythathasmeaning only in the context of the generated RTL hardware. A softwaredesignertypicallywouldnothaveanunderstandingoftheunderlyinghardware;in fact, this is the primary reason thatHLSmethodologies are able to deliverhigh design productivity. A software designer’s perspective centres onsequentially executed statements, written in a high-level language, to realisecomputations and control flow— there is no notion of a clock.On the otherhand, the key advantages of hardware arise from the use of interconnecteddataflowcomponentsoperatinginparallelacrossmultipleclockcycles.Further,HLS reschedules operations across clock boundaries, creating a disconnectbetween the software-and hardware-views of the state of a system. A C-levelinstructionisnotmappedtothehardwareasaunit,butratherisoptimisedalongwithotherinstructionstocreatecompositehardwareunitsthatexecutewithinasingle clock cycle. We argue that this mismatch between how a software

designer views a design, and how current on-chip debugging tools presentresults,isthemostimportantchallengefacingHLSmethodologiestoday.Inthelong term, this will limit the ability of a large class of designers (softwaredesigners) to obtain the performance and power advantages of FPGAprogrammableplatforms.

12.5.2.2.Keytechnicalchallenges

Thekeytechnicalchallengeis tocreateabridgebetweenasoftwareviewofadesignandthesynthesisedhardware.Thiscanbeaddressedbydevelopingandemployingtwocomplementarytechnologies:instrumentationandtransforms.

Asdescribedearlierinthispaper,effectivedebugandoptimisationcanbestbeachievedthroughinstrumentation,thatis,byaddingsmallamountsoflogictoacircuit/programtoprovidevisibilityandcontrollabilitytothedesign.ThetoolwedescribedinsertsinstrumentationintotheRTL(hardware)circuit;asaresult,the instruments gather and control signals at the hardware level, creating thedisconnect. It isalsopossible to instrumentatother levelsofabstraction—inparticular,ifweinstrumenttheoriginalsoftware,wewouldimmediatelygatherand control signals that the software programmer understands, however, thiswouldsignificantlyconstraintheoptimisationopportunitiesavailabletotheHLScompiler,leadingtoslower,larger,andmorepowerhungrycircuits.Likely,thebest solution combines instrumentation at various levels of abstractions,balancingtheoverheadversusofobservabilityandcontrollability.

Anotherapproachtobridgethisdisconnectistodevelopandusetransformsthatrelatestructuresandtimereferencesinthehardwarecircuit to thoseinthesoftware design. Work by Hemmert [Ref. 25] provides the groundwork forcreating such transforms. However, his techniques were not designed orevaluatedinthecontextofacommercialHLStool thatperformsmorethan50optimisations before the hardware is generated (it is these optimisations thatcreate the disconnect we wish to address). Nonetheless, Hemmert’s workprovidesoptimismthatsuchtransformsarepossibleandcanbeeffective.

12.5.3.Integrationwithruntimebinding

Inthepastfewyears,PeterCheunghasbeendoingimportantworkintheareaofruntime binding [Refs. 26 and 27]. Through runtime binding it is possible totolerate, and even take advantage of, manufacturing differences betweenindividualdevices(faultsorsimplyparametricdifferences).

An exciting research direction would be to unify on-chip debugging andvisibility-enhancement techniques and Peter’s runtime binding work. Runtimebindingrequiresknowledgeoftheparticularsofthedevicebeingused,anditisconceivable that this sort of information could be gathered using aninfrastructurelikethoseusedfordebuginthispaper.Ratherthantracebuffers,itmaybepossible to integrate sensors suchas thosedevelopedbyPeter andhisgroup [Refs. 28–31]. In the long-term, thismay lead to systems that not onlydetecterrors,butaremorereliable[Ref.32],andeventuallythosethatself-heal.

12.6.Conclusions

Theworldisadifferentplacethanitwasevenafewdecadesago.Wearenowall wirelessly connected to each other, around the world, and this has had adramaticimpactonnotonlyhowwework,buthowwerelatetoeachotherandliveourlives.Thistransformationhasbeenenabledbytechnology.Throughout(thefirstpartof)hiscareer,PeterCheunghasbeencentraltothisrevolution.Hehasmade extremely important contributions that have helped take us into thisnewworld.

Asanexampleofthetypesofchallengesthatweneedtoaddresstocontinuethisrevolution,thischapterhastalkedabouttechniquestoacceleratethedebugof integrated circuits. We focused on a specific case study, and traced thedevelopment of a particular technology from academia to industry, and thenshowedhowthecirclewascompletedbyinspiringnewresearchintheacademicrealm.Wehopethatsomeoftheideasinthispapermightbeinspiringforothers,bothwithinPeter’sresearchgroupandbeyond.

References

1.H.Foster.ChallengesofDesignandVerificationintheSoCEra,Report,Oct.2011.2.A.Nahir,A.Ziv,andS.Panda.OptimizingTest-generation to theExecutionPlatform, inProc.Asia

andSouthPacificDesignAutomationConference,pp.304–309,2012.3.M. Abramovici et al. A Reconfigurable Design-for-Debug Infrastructure for SoCs, inProc. Design

AutomationConference,pp.7–12,2006.4.H.Ko,A.Kinsman,andN.Nicolici.Design-for-DebugArchitectureforDistributedEmbeddedLogic

Analysis,IEEETransactionsonVery-LargeScaleIntegrationSystems,19(8),1380–1393,2011.5.E.Matthews,L.Shannon, andA.Fedorova.AConfigurableFramework for InvestigatingWorkload

Execution, in Proc. International Conference on Field-Programmable Technology, pp. 409–412,2010.

6.E.AnisandN.Nicolici.LowCostDebugArchitectureUsingLossyCompressionforSiliconDebug,inProc.DesignAutomation&TestinEuropeConference&Exhibition,pp.1–6,2007.

7. H.F. Ko and N. Nicolici. Algorithms for State Restoration and Trace-Signal Selection for DataAcquisitioninSiliconDebug,IEEETransactionsonComputer-AidedDesignofCircuitsandSystems,28(2),285–297,2009.

8.X. Liu andQ.Xu. Trace Signal Selection forVisibility Enhancement in Post-SiliconValidation, inProc.DesignAutomation&TestinEuropeConference&Exhibition,pp.1338–1343,2009.

9.E.HungandS.Wilton.OnEvaluatingSignalSelectionAlgorithmsforPost-SiliconDebug, inProc.InternationalSymposiumonQualityElectronicDesign,2011.

10.S.Wilton,B.Quinton,andE.Hung.RapidRTL-basedSignalRankingforFPGAPrototyping,inProc.InternationalConferenceonField-ProgrammableTechnology,pp.1–7,2012.

11. M. Boulé and Z. Zilic. Generating Hardware Assertion Checkers: for Hardware Verification,Emulation, Post-FabricationDebugging andOn-LineMonitoring, Springer,NewYork,NY,USA,2008).

12.B.QuintonandS.Wilton.ConcentratorAccessNetworksforProgrammableLogicCoresonSoCs,inProc.InternationalSymposiumonCircuitsandSystems,pp.45–48,2005.

13.I.KuonandJ.Rose.MeasuringtheGapbetweenFPGAsandASICs,IEEETransactionsonComputer-AidedDesignofIntegratedCircuitsandSystems,62(2),203–215,2007.

14.B.Quinton,A.Hughes,andS.Wilton.Post-SiliconDebugofComplexMultiClockandPowerDomainICs,inProc.InternationalWorkshoponSiliconDebugandDiagnosis,2010.

15. P. Wang et al. Intel Atom Processor Core Made FPGA-Synthesizable, in Proc. InternationalSymposiumonField-ProgrammableGateArrays,pp.209–218,2009.

16. G. Schelle et al. Intel Nehalem Processor Core Made FPGA Synthesizable, in Proc. InternationalSymposiumonField-ProgrammableGateArrays,pp.3–12,2010.

17. S.Mitra, S. Seshia, and N. Nicolici. Post-Silicon Validation Opportunities, Challenges, and RecentAdvances,inProc.DesignAutomationConference,pp.12–17,2010.

18. B. Heaney. KeynoteAddress: Designing a 22 nm Intel ArchitectureMulti-CPU andGPU, inProc.DesignAutomationConference,2012.

19. K. Balston et al. Post-Silicon Code Coverage for Multiprocessor System-on-Chip Designs, IEEETransactionsonComputers,62(2),242–246,2013.

20. M. Karimibiuki et al. Post-silicon Code Coverage Evaluation with Reduced Area Overhead forFunctionalVerificationofSoC,inProc.HighLevelDesignValidationandTestWorkshop,pp.92–97,2011.

21.H.Agrawal.Dominators,SuperBlocks,andProgramCoverage, inProc.SymposiumonPrinciplesofProgrammingLanguages,pp.25–34,1994.

22.U.S.BureauofLabourStatistics.OccupationalOutlookHandbook,Report,2012.23.Q.Liuetal.CompilingC-likeLanguagestoFPGAHardware:SomeNovelApproachesTargetingData

MemoryOrganization,TheComputerJournal,54(1),1–10,2011.24.A.Canisetal.LegUp:High-levelSynthesis forFPGA-basedProcessor/acceleratorSystems, inProc.

InternationalSymposiumonFieldProgrammableGateArrays,pp.33–36,2011.25. K. Hemmert et al. Source Level Debugger for the Sea Cucumber Synthesizing Compiler, inProc.

InternationalSymposiumonField-ProgrammableCustomComputingMachines,pp.228–237,2003.26.Z.Guanetal.ATwo-StageVariationAwarePlacementMethodforFPGAsExploitingVariationMaps

Classification,inProc.InternationalConferenceonFieldProgrammableLogicandApplications,pp.519–522,2012.

27.P.Y.K.Cheung.ProcessVariabilityandDegradation:NewFrontierforReconfigurable,ReconfigurableComputing:Architectures,ToolsandApplications,5992,2,2010.

28.J.Levineetal.OnlineMeasurementofTiminginCircuits:forHealthMonitoringandDynamicVoltage

and Frequency Scaling, in Proc. International Symposium on Field-Programmable CustomComputingMachines,pp.109–116,2012.

29.J.Levineetal.HealthMonitoringofLiveCircuits inFPGAsBasedonTime-DelayMeasurement, inProc.InternationalSymposiumonField-ProgrammableGateArrays,pp.284–284,2011.

30. J.S.J.Wong, P. Sedcole, and P.Y.K.Cheung. Self-Measurement ofCombinationalCircuitDelays inFPGAs,ACMTransactionsonReconfigurableTechnologyandSystems,2(2),10:1–10:22,2009.

31. J.S J.Wong,P.Y.K.Cheung,andP.Sedcole.CombatingProcessVariatononFPGAswithaPreciseDelay At-speed Test Measurement Method, in Proc. International Conference on Field-ProgrammableLogicandApplications,pp.703–704,2008.

32.E.StottandP.Y.K.Cheung.ImprovingFPGAReliabilitywithWear-Leveling, inProc.InternationalConferenceonField-ProgrammableLogicandApplications,pp.323–328,2011.

Chapter13

EnablingSurvivalInstinctsinElectronicSystems:AnEnergyPerspective

AlexYakovlevSchoolofElectricalandElectronicEngineering,

NewcastleUniversity

Thewritingofthischapterhasbeeninspiredbythemotivatingideasofincorporatingself-awarenessintosystemsthathavebeenstudiedbyProfessorCheunginconnectiontodealingwithvariabilityand ageing in nano-scale electronics. We attempt here to exploit the opportunities for makingsystemsself-aware,andtakingitfurther,seetheminabiologicalperspectiveofsurvivalunderharshoperatingconditions.Survivabilityisdevelopedhereinthecontextoftheavailabilityofenergyandpower, where the notion of power-modulation will navigate us towards the incorporation intosystem design of the mechanisms analogous to instincts in human brain. These mechanisms areconsideredherethroughasetofnoveltechniquesforreference-freesensingandelasticmemoryfordataretention.Thisisonlyabeginningintheexplorationofsystemdesignforsurvival,andmanyotherdevelopmentssuchasdesignofself-awarecommunicationfabricarefurtherontheway.

13.1.Introduction

Complexinformationandcommunicationsystemshavebeenstudiedforalongtime. Many approaches and methodologies for their modelling, analysis anddesign exist to date. Amongst the properties of interest in those studies aprominentplaceisoccupiedbythepropertyofsystemstostayaliveandfunctionin spite of harsh environmental conditions thatmay surround them. Typicallysuchconditionsareassumedtogeneratehigherratesoferrors,suchasthosethatareforexamplecausedbyradiation.Theyareconsideredmostlyinthescopeofinformation processing, and to a lesser extent in the domain of resourceavailability(forexample,theavailabilityofenergy,themotherofallresources).While thesystemmayremainfully functionalunder thenominalconditionsofenergysupply,itsbehaviourmaybehighlyunpredictablewhentheenergyflowto the system is impaired for one reason or another. Design of systems with

varyingpowermodesisarapidlyemergingareaofresearch,anditcomesfrommanydifferentdirections;forexample,intelligentautonomoussystems,systemswith energy harvesting, green computing etc. Much of this research is aboutsystemsthataresufficientlycomplexthateventheirmostenergy-frugalmodeofaction still requires a certain stable level of energy flow.What about systemsthathaveto‘liveonthepovertyline’,theconditionsinwhichpowerlevelsdroptozeroandsystemsthathavetoself-recoveruponthearrivalofthe‘firstbeamofsunlight’?

In this paper we shall look at the first glimpses of, perhaps still naive,approachestobuildingelectroniccomputersystemswhosepowersourcescanbedefined in awide band ofmodes. Such systemswill effectively need survivalinstincts as part of their intrinsic characteristics.An important element of thisnewdesign discipline is a close link between the designmethods required forpowerconditioningand thosenecessary for computationalblocks, as the latterformthe load in theoverallpowerchain.Thisproximityandeveninterplayofenergy and information flows, and the associated holistic nature of systemdevelopmentactivities,iswhatdrivesustowardsanewtypeofco-design,whichinvolves newmethods for modelling, simulation, synthesis and hardware andsoftware implementation.Thischapterwill addressanumberofparadigms forsuchdesigns,suchaspower-modulatedcomputingandelasticsystemdesign.Itwill present examples of problems formulated and solutions obtained in thecontextofresearchonthenewgenerationofsystemswithhigherself-awarenessforsurvivability.Aprominentplaceinthisexplorationistakenbywhatwecallreference-free sensing,which allows the system to check its power conditionswithoutrelyingonexternalreferencesinvoltageorclock.

On-chip sensing is generally a very important area of research in moderntimesduetothehighvariabilityofdevicesproducedinnanometertechnologies.Beforeanypieceof fabricated silicon isput intoaction, ithas tobemeasuredandtunedtohelpitsperformancebestmeetitsindividualcharacteristics.Ageingis another factor that requires adaptation of functional settings, voltage andfrequencyscaling,throughoutthelifetimeofthesystem.Thishasbeenrealisedby Professor Peter Cheung and his co-workers at Imperial College whoinvestigate methods for health monitoring of chips, exploring their individualcharacterandlookingforwaysofrun-timeperformanceoptimisation(e.g.[Ref.1]).Inmanyrespectsthevariousbuilt-inself-awarenessfacilitiesforadaptationto variations and ageing are similar to those for survival. This interestingrelationshipand long termprofessional friendshipwithProfessorCheunghaveinspiredtheauthorinwritingthischapterforsuchawonderfuloccasion!

Before we start our journey into the subject of this work, it would be

pertinenttobringtwoimportantquotations:

“Theveryessenceofaninstinctisthatitisfollowedindependentlyofreason.”1871:C.DarwinDescentofManI.iii.100.

“Theoperationofinstinctismoresureandsimplethanthatofreason.”1781:E.GibbonDecline&Fall(1869)II.xxvi.10.

Webring thesequotationswithonepurpose: forour studyofcertainbasicfunctionalities in electronic systems that are retained in the conditions ofausterity,weneed an analogywith biology.Thebiologicalworld is the realmwhere survival is a key property of organisms, whether it concerns eachorganism individually or organisms as a species. As we postulated above,instincts are seen as something which is inherent to survival. So is theimportanceofthesequotations—theydefinetheplaceandroleofinstinctalongandincomparisonwithreason,somethingthatisregardedasthehighestformofbiological activity.Armedwith this analogy,wewill start lookingat thewaysthatelectronicsystemscanbebuiltwheretheir‘reason’partsoperatealongwiththeir‘instinct’parts.Theoutlineoftopicsdiscussedinthischapterisasfollows:

•Bio-inspiration:survivalinstinctsinreallife.•‘Survivalinstincts’inICTsystems.•Energy-powermodulationandlayersoffunctionality.•Mechanismsinenergyanddataprocessing:∘Reference-freesensing,∘Elasticmemoryfordataretention,∘Elasticpowersupplyforsurvival.

•Futuredevelopments.

13.2.SurvivalandInstinctsinRealLife

So,whataresurvivalandinstinctingeneralterms?Amongthemanydefinitionsof survival and instinct that can be found in the OED, perhaps the followingserve our needs best: “Survival: the continuing to live after some event;remaining alive, living on”. “Instinct: (a) an innate propensity in organizedbeings (esp. in the lower animals), varying with the species, and manifestingitself in actswhichappear tobe rational,but areperformedwithout consciousdesignorintentionaladaptationofmeanstoends.Also,thefacultysupposedtobe involved in this operation (formerly often regarded as a kind of intuitive

knowledge). (b)Any faculty acting like animal instinct; intuition; unconsciousdexterityorskill”.

If we were looking at instincts from a biological or even psychologicalperspective,wewouldhavedistinguishedbetweeninstinctandintuition.Inourpresentanalysis,wewillalsodothat,andseeintuitionas,perhaps, thehighestformofinstinctthatisclosetoreasoning.Itisakintopredictionininformationsystems, which often connects higher forms such as reasoning with sensory-signallingforms.Inouranalysiswewillnotgotothelevelofintuitionanalogy,but rather stay at the level of basic instincts. What’s more we will mostlyapproach instincts from the perspective of energy in the system, and see howenergyorpowerlevelsdeterminetheroleofinstincts,particularlyfocussingontheirmanifestationunderthelowenergyconditions.

To get a better sense of how instincts may reveal themselves bothstructurallyandbehaviourally,weillustrate themin thefollowingway.Firstly,we bring an example of a ‘case study’ which shows the energetic aspect ofinstinctsquitevividly.Afewyearsago,theworldheardastoryaboutaFrenchcaveexplorerJean-LucJosuat,whogotlostinacaveandspentfiveweekstherewithoutfoodandwaterbeforehewasfoundbyhisrescuers.Duringthisordealhisfirst(conscious)reactionwastoactivelysearchforfood—duetoorexin,ahormoneproducedinthehypothalamus;orexinisnormallygeneratedtotriggeralertnessandallpartsofthebodytoworkfaster.However,atalaterstage,some‘more hardwired’ instincts (inherited by humans frommore primitive speciesthroughevolution)startedtoprevailinthebrainandeverythingsloweddowntoensuresurvivalwhenenergysourcesbecameshort.There isavideoabout thiscase on YouTube that can be accessed from this website:http://videos.howstuffworks.com/discovery/6835-human-body-built-for-survival-video.htm

Secondly,agoodillustrationofwhereinstinctsrestinhumansisprovidedbyPaulMcLean’s triunemodel.Themodel states that thehumanbrainhas threeindependent (and behaviourally concurrent!) brains, which were developedsuccessively in response to evolutionaryneeds.They are reptilian (responsiblefor survival), paleomammalian or limbic (responsible for emotions) andneomammalianorneocortex(responsibleforhigher-orderthinking).Thelowestone, the reptilian brain (or R-complex), is the one which is inherited fromreptiles.Thisiswhereourinstinctsrest.Thisbrainisactiveallthetimeevenindeep sleep. We do not sense this reptilian brain in our consciousness undernormal conditions.However, in the conditions like those of Jean-Luc Josuat’sordealtheR-complextakescontrolofourbodiestohelpthemsurvive.

So in this chapter,we strongly hypothesise that themanifestation of these

http://videos.howstuffworks.com/discovery/6835-human-body-built-for-survival-video.htm

different brains is driven by the energy levels in the body, and with thishypothesisweenterthecyber-worldandthinkofelectronicsystemsofthefuture— with the idea of Darwinian evolution also being transferred to the cyber-world.

13.3.SurvivalandSurvivabilityinElectronicSystems

Let’snowturnourattentiontoartificialsystems,likeinformationsystems,andraise twokeyquestions about survival: ‘survival fromwhat?’ and ‘survival ofwhat?’Firstofall, let’sseewhatsortof‘disasters’weshouldimaginethat thesystems would need to survive. We can roughly categorise them into thefollowingthreegroups:

(1)Faultsanddegradation inside thesystem:defects,ageing, transients (insidegates,crosstalkonsignallines,IRdrops).

(2)Upsetsoutsidethesystem:radiation,powersupplydrops,signaldistortions.(3) Miscellaneous physical effects (both internal and external): temperature

fluctuations,electro-magneticinterference.

Now,what aspects of the systems canwe consider for survival? They aremainly, but not exclusively: structure, behaviour, and specific (or purposeful)functionality(definedbythesystem’suserforexample).

Combining thesourcesof impairmentsand theireffectson thesystem,onewould conventionally consider ways of how the systemwould react to them.Here, the reader might see some relationships if not similarities between thepropertyofsurvivabilityandfollowingproperties,sufficientlywellexplored inthe ICT domain: tolerance, resilience, recoverability, longevity etc. (It is verytemptingtostart thinkingaboutsucheven‘morebiological’propertiessuchasreproducibility, especially if our notion of survival may one day stretch tothinkingaboutgeneticsandpreservationofspecies—well,inafewyearswiththe developments inDNA computingwemay have a chance!) Let’s, at least,brieflycontrastsurvivabilitywithtwofairlycommonproperties:

•Dependability(Fault-tolerance):Dependablesystems typicallywant to restore their full functionalities,hencethey have large costs for redundancy; survivability is supposed to be lessresource-demanding,orinotherwordsthesystemmaycontinuetoworkevenwithincompletepowerlevels.

•Gracefuldegradation:Gracefully degrading systems typically have a smooth (often quantitative)reduction in their performance (cf. today people talk about approximatecomputations and tradeoffs between accuracy and quality of service), ratherthan ‘qualitative’ transitions to a more restricted (more critical) set offunctionalitiesasneededforsurvival.

From these two brief comparisons we can see that the key differencebetweensurvivabilityandotherseeminglysimilarpropertiesliesinthewayweapproach the energy aspect. We start to talk about survivability when thesystem’spower is variable, intermittent, sporadicetc.Of course, the scale andrange of power and energy disruptions wouldmatter here as well, but in oursimpleapproximation,thenotionofsurvivability,similartobiology,refersfirstofalltothepowerconditions.Foryears,ICTsystemshavebeendesignedtobefault-tolerant, robust and resilient to faults, ageing etc., but they have alwaysbeenassumedtobefullypowered.Ofcourse,otherwise,howcanoneactivatethefault-detectionandcorrectionproceduresandengagerecoverymechanisms.

At this point, however, the reader might actually stop us by saying thatsurvivability has been studied in ICT. Indeed, it has — but conventionalsurvivability in ICT is more about software systems (cf. [Ref. 2]) that maketransitionsbetweendifferentservicesdependingontheoperatingenvironment.

What we are interested in here is different. It is what we call ‘Deep, orInstinct-based,Survival’,asopposedtoconventionalsurvivability,whereagain,as it is about software, there is very little scope to think about serious power-relatedissues,suchaspowerdeficiencyorinterruptions.

So, conventional survivability does not consider deep, embedded layers ofhardware/softwarethatworkinproportiontothelevelofavailableenergy/powerresources. Thus, Deep Survival is a new concept, inspired by nature, whichmaintains operation in several structural and behavioural layers, withmechanisms(‘instincts’)developedandaccumulatedinbodiesduetobiologicalevolution. So, we end this section by postulating that survivability cannot beachievedinthesystemwithoutprovidingitwithsufficientback-upintheformofinstinct.And,aswecanseefromourquotationsofDarwinandGibbon,wemustreallytalkaboutanindependentlayerofactivityinthesystem’sstructure,so independent thateven thewaysof itspoweringare independentof thoseofthe ‘reasoning’ layers.Wewill thereforehave to first look at howpowermaymodulatethesystem’sfunctionality,thesubjectofournextsection.

13.4.Power-ModulatedComputingandFunctionalityLayers

13.4.Power-ModulatedComputingandFunctionalityLayers

In this paper we postulate that the principle of power (energy)-modulatedcomputing [Ref.3] is fundamental fordeep survival. Inotherwords,until andunless we start designing systems in such a way that the incoming power isactually the driver of the functional behaviour we will not be able to buildsystemsthatcansurvive.Yet,puttingitevenmorestrongly,untilweonlylimitour design approaches to power-efficiency rather than power-modulation, oursystemswillnotbefullysurvivable.Herearesomefurtherargumentsinfavourofthisview.

Anypieceofelectronicsbecomesactiveandperformstoacertainlevelofitsdeliveredqualityinresponsetosomelevelofenergyandpower.Aquantumofenergy when applied to a computational device can be converted into acorresponding amount of computation activity.Depending on their design andimplementation, systems can produce meaningful activity at different powerlevels. As power levels become uncertain we cannot always guaranteecompletely certain computational activity. Good characterisation of powerprofilesforthesysteminspaceandtimeisimportantfordesigningsystemsforsurvival.Figure13.1illustratesthisidea.

Atanymomentintimewehaveadeterminatetraceofpowersupplyinthepast but the future is indeterminate.The system, thanks to its sensing abilitiesand initial forms of intuition, canmake some localised prediction from everymomentatpresent,anditsabilitytocompute(sayintermsoftherateofactivity)willbedeterminedbytheactualpowerlevels.Thisbringsustothelinkwiththerecently published ideas of power-proportional computing [Refs. 3 and 4].Powerproportionality,however,has two forms.One,moreconventional, formconcerns the fact that the system is power-proportional when its powerconsumptionisproportionaltoitsservicedemand.Whensystemsaredrivenbythe service demand they tend to follow the principle ofmulti-modality,wherethe system ‘consciously’ switches between a full functionality mode to ahibernating mode primarily depending on the data-processing requirements.Survivalaspectsherearelimitedtotheabilityofmodemanagement.

Fig.13.1.Powerprofileintime,itsuncertaintyandillustrationofpower-modulatedcomputing.

Butwhatifthepowerleveldrops?Herewearefacedwiththesecondformofpower-proportionality,whichinourviewlendsitselftoamoregeneralformof survivability. To extend the frontier of survivability, system design shouldalso follow the power-modulation approach, and this leads to structuring thesystem design along partially or fully independent layers (cf. Darwin’s “Theveryessenceofaninstinctisthatitisfollowedindependentlyofreason.”).

Multiplelayersofthesystemarchitecturecanturnon/offatdifferentpowerlevels(cf.analogieswithlivingorganisms’nervoussystems,orunderwaterlife,or layers of expensive/cheap labour in most of the resilient economies). Aspower goes lower higher layers turn off, while the lower layers (‘back up’)remainactive—thisiswhereinstinctsbecomemoreincharge!

Fig.13.2.Layeredcomputationalactivityinresponsetopowerlevels.

Themoreactivelayersthesystemhas,themoreresourcefulandcapableofsurviving it is. This layered view is reflected in Fig. 13.2, which puts it inanalogywith the sea layers and ability of different forms of life to survive indifferentconditionsofsunlightpenetration.

Figure 13.3 illustrates the difference between traditional and energy-modulatedsystemdesign.Inthenextsectionwewillattempttopresentourlistofmostbasicinstinctsthatthesystemneedstomaintainforsurvivability.

13.5.BasicInstincts:Self-AwarenessandNewSensing

The following categories of instincts can be identified in electronic computersystems that can help them to be better equipped for survival. The mostimportant is probably energy/power-awareness, i.e. sensing, detection andpredictionofpowerfailures.Thenextoneistheabilityofstoringenergy‘forarainyday’.Other instincts involvemechanisms for retainingkeydata, reactiveandoptimisingmechanisms,andlayersofpower-drivenfunctionality.

These instincts cannot work without the following basic abilities andassociatedactions:

• ability to accumulate some energy, initially and at any time after longinterruption,saybychargingapassiveelement;

•abilitytoswitch,e.g.generateevents;•abilitytodecide,e.g.whetherthereisaneventornot.

Fig.13.3.Traditionalversusenergy-modulateddesign.

These actions underpin two major categories of instinct-supportingmechanisms:

•Mechanismsinenergyanddataprocessingdomains:—Reference-freeself-sensingandmonitoring[Refs.5and8],—Retentionmemoryforsurvival[Ref.9],—Elasticpower-managementforsurvival[Ref.10].

•Mechanismsincommunicationfabric:—Monitoringprogressintransactions(linklevelfailures,deadlockdetection)

[Refs.11and12],—Powernoiseandthermalmonitoring[Ref.13],—Non-blockingcommunications[Ref.14].

Inthischapterwerestrictourselvesbydiscussingonlythefirstcategoryofmechanisms.Aninterestedreadermayfindthedescriptionofmechanismsinthesecond category in our papers [Refs. 11–15].Ourmain focus here in on self-awareness, hence sensing is our priority. Sensors must work in changingenvironments with uncertainty, where constant and reliable references are notavailable. Traditionally, sensors used in electronic systems are quite heavy—theirpurposeis toconvertsomephysicalformofinformationintodigitalformsothatitcanbeprocessedinthecomputingsystem.Normally,thisisdonewiththepurposeofdigitalsignalprocessingwithfairlyhighrequirementsforfidelityandsignal-to-noiseratio.ThisleadstohavingsensorswithfullyfledgedA-to-Dconvertersinvolvingaccuratevoltageor timereferencessuppliedfromoutside.Insystemsthatareautonomousandintheconditionswheretheaimistosurvivethis is not possible.Hence our target is to design an entirely different sort ofsensors.Inthischapterwefocusontheso-calledreference-freesensors,wherewewillconsiderthefollowingoptions:

•Sensingbycharge-to-digitalconversion;•Sensingbydifferentiatorsindelays;•Sensingbycrossingcharacteristicmodeboundariessuchasoscillations;•Sensingbymeasuringmetastabilityrates.

Allof these sensorshavesomedigitalpartswhosebehaviour ismodulatedby the voltage that they sense, and this voltage is connected to the powerterminal of the digital part. In this way these sensors are inherently power-modulated.Weshallnowdescribesomeofsuchsensors.

13.5.1.Sensingbycharge-to-digitalconversion

Thismethodinvolvessamplingtheinputsignalintoacapacitorintheformofitselectricchargeandthendischargingthecapacitorinsuchawaythatitschargeisconvertedtodigitalcode.Basicallyitisinspiredbythechallengeofbuildingasensorthatispoweredbytheenergyofthesensedsignalitself.So,theprincipleofoperationofsuchasensoristhatenergysampledinthecapacitoraschargeisproportional to the sensed voltage. It is then discharged through some loadregisteringthequantityofenergy(justlikeinawaterwheel!).Forsuchaloadwecanuseaself-timedcounterasshowninFig.13.4.Thebottompartofthefigureshowsvoltageonthecapacitorasafunctionoftime.Wehaveinvestigatedthisrelationshipandfoundthatitissubjecttoacomplexbehaviouroftheswitchinggatesinthecounter,whicharedefinedbythecharacteristicsoftheirconstituenttransistors in different modes and mechanisms, including superthreshold,subthreshold, leakage etc. Under reasonable approximations the analyticalcharacteristicofvoltageversustimeisahyperbolaratherthanexponentialwhilethetransistorsoperateinsuperthresholdmode[Ref.15].

Fig.13.4.Charge-to-digitalconversionprinciple.

Let’snowdiscussthereference-freeissueinthismethod.Intheabsenceofexternal voltage and time references,we still need to control time in order to

decidewhen to stop the discharging processwhile the level of voltage in thecounter is sufficiently high, so the code stored in the counter can be recordedbefore thecounterstopscounting.Weshouldstopcounting irrespectiveofVin—constantsensing/conversiondelay.

However,this‘sametime’impliestimingreferenceorsomeclock.HenceweneedtoproduceavoltagelevelVdsuchthatisaconstantreference.Vdcouldbebasedonsomeinternalconstantsuchasthethresholdofatransistor(similartotheideaofbandgap).

Fig.13.5.Sensorcontrolanditsinternalreferencegenerator,thetimingcontrolledbyRGandthemeasuredcodevsinputvoltage(datafromthefabricated180nmchip).

ThecircuitshowninFig.13.5illustrateshowthecontrolcircuitandinternalreference generator can be built. The waveform in this figure shows that theeventofcrossingthesecondthresholdcorrespondstostoppingthecountingand

latching the code from the counter.Wehavedesigned and fabricated a sensorchip in 180 nm TSMC via Europractice. We connected the chip to a 10 nFsamplingcapacitorandtestedthesensor—theresultsareplottedintheabovefigure.

This experiment has shown the feasibility of building a sensor that ispowered by the signal it senses and that is reference-free. In the followingsectionswewillshowideasforbuildingsensorsthatcanbeusedinthehighlyvariable conditions. We have not yet brought them to the same level ofexperimentalimplementationastheabovesensor,butthereareplanstodoso.

13.5.2.Sensingbydelaydifferentiators

Theideaofsensingusingdelaydifferentiatorsisasfollows.Weneedtodesigntwo circuits, which can operate in a range of voltages of our interest. Thecircuits, however, must have their delays scaled differently to the suppliedvoltage, as shown in Fig. 13.6 (left-hand side). If this is the case, then thedifference between these delays will represent some characteristic form of(ideally, proportional in some critical range of interest) dependence on thesupplyvoltage.Theright-handsideofFig.13.6showsthatthedigitalvalueofthe measured voltage can be obtained by measuring the time when Circuit 1finishesagainstCircuit2.WethusneedamechanismofregisteringthepositionofwherethesignalisinCircuit2whenCircuit1isfinished.

For example,wehaveobserved thedifference (mismatch) indelay scalingbetween SRAM cells and logic gates, as shown in Fig. 13.7 This mismatchrapidly increases (in terms of the number of inverters that need to match thedelayoftheSRAMcell,whichactsasCircuit1)whenVdddropsbelow0.7V(for90nm technology).Now, replacing the lineof inverterswith a self-timedcounter(similartotheoneusedinthecharge-to-codeconverter)toactasCircuit2,whichisstartedtogetherwiththeSRAMcellsandstoppedwhenthereading(or writing) of the cell finishes (we used a self-timed SRAM with explicitcompletion detection), allows us to register the binary code for the delaydifference.ThisisshowninFig.13.8onthebasisofspicesimulationsfora90nmtechnologynode.Althoughthelinearityofthissensorisquitelimited,itcanstillbeused for thepurposesof conditionmonitoringweare interested inandwhat’sreallyimportantisthatitiscompletelyreference-free.

Fig.13.6.Principleofdelaydifferencebasedsensing.

Fig.13.7.Mismatchbetweeninverterchainandmemorycelldelay(90nmtechnology).

13.5.3.Sensingbyoscillationdetection

Itisoftenthecasethatweneedtosensevoltage,saypowersupply,onlytothepoint where it crosses certain level, for example, the level at which some‘reasoning’partsofthesystemcannolongerbetrusted.Thiskindofsensingcanbe done with a circuit which changes its operating mode, for example, fromstabletooscillatory.Anexampleofsuchathreshold-crossingoscillatorisshownin Fig. 13.9. It consists of two stages, each containing a pair of forward (F)inverters and a pair of cross-coupled (CC) inverters. The circuit has twooperatingmodes:oscillationandlatching/locking.WhenthesupplyvoltageVdddropsbelowthecertainVthrlevelthecircuitsoscillates,asshowninFig.13.10.

Fig.13.8.Voltagesensingresultusingmemory-logicmismatch.

Fig.13.9.Voltage-modulatedoscillator.

ThespecificvalueofVthrcanbedefinedatdesigntimebysettingtheratiobetweentransistorsizes:

Fig.13.10.Voltage-modulatedoscillation.

TheeffectoftheratioontheVthr isshowninFig.13.11.TheoverallsetupfordetectinganeventofVthrcrossingviaoscillationisshowninFig.13.12.Ituses a self-timed counter, initially reset to zero but introduces some delay ofcounting until the most significant bit is set to 1, to guarantee that theoscillationsarestable.Asbefore,itiseasytoseethatthismethodofsensingisfree from external references. The behaviour is completely determined by theinternalcharacteristicsofthedevices.

Fig.13.11.TransistorsizeratiovsVddatwhichthecircuitoscillates.

Fig.13.12.Setupforoscillation-basedsensing.

13.5.4.Sensingbymeasuringmetastabilityrates

Finallywepresentanothertechniqueforvoltagesensing(itisalsoapplicabletotemperaturesensing).Itisbasedontheuseofmetastabilityinbistabledevices.Metastability offers a niceway of removing external references in the voltageandtemperaturesensor.Whenthesetupandholdtimeconditionsofaflip-flopare notmet, the flip-flopmay becomemetastable. Ametastable flip-flopwilltakeextratimetodecidewhethertogologichighorlow(decisiontime=clock-

to-qdelay).The‘decisionmaking’timeconstant(τ)isafunctionofVdd.So,theideaofthemethodistousethetimeconstant(τ)toquantifyVdd.Whatweneedtodoistocounttherateatwhichtheflip-flopfailstodecide!

Fig.13.13.Circuitformetastabilityratemeasurement.

The sensor circuit shown in Fig. 13.13works as follows. Firstly, the left-most flip-flop (call it FF1) often becomes metastable because its input isasynchronous. Secondly, when FF1’s output is delayed, the early and latesamples of FF1’s output (captured at the following falling and rising edgesrespectively) will be different. Finally, the counter counts these instances. Itsoutputafterafixedperiodoftimeisanexponentialfunctionofthetimeconstantτ,whichisdeterminedbythesensedparameter.Theadvantagesofthismethodarethatit ispurelydigital,verycompactandofferssufficientlyhighprecision.WeprovedthisconceptinFPGA(AlteraCycloneII),andtheresultsareshowninFig.13.14(inasemi-logarithmicplot).

13.6.ElasticMemoryforData-RetentioninInstincts

Wenowillustrateawayofdesigningretentionstorage(SRAM)forsurvival.Wecallitelasticbecauseitiscompletelyself-timedandoperatescorrectlyinawiderangeofsupplyvoltages,bothstableandtime-varying.ThisSRAMcanbebuiltaround different types of cells; for example,we have designs for 6T and 10Tcells.Onecanusea6Tsolutionforenergy-efficiencyand10Tforcore-functionsurvivability.OnecanbuildcontrolforsuchanSRAMarraywithdifferenttypesof completion detection, again depending on the need to mitigate variationbetween columns. For example, a version with more economic completiondetection(databundling)isshowninFig.13.15.

Fig.13.14.VoltagesensingresultsfortheFPGAprototype.

Aspeed-independent control circuit for theSRAM is shown inFig. 13.16.The timing diagram shown in Fig. 13.17 shows the simulation trace for theSRAMasitworksforDataWritewiththetimevaryingVddsupply.Itiseasytonotice the response time with which the memory sends the Wack signal ismodulated by theVdd (for smaller Vdd the delay betweenWreq andWack islonger).

For a 6Tcasewehavebuilt anASICprototype toprove the concept.Thelayoutof thedie is shown inFig.13.18.Thechipwas successfully testedandoneofthetracesoftheWacksignalscapturedbyoscilloscopeclearlyshowstheeffect of the switching behaviourmodulated byVdd (one can observe theVddchanginginaquick-charge-slow-dischargeshape).

Fig.13.15.SRAMarraywithdatabundling.

Fig.13.16.Speed-independentcontrolcircuitforSRAM.

Fig.13.17.Writesimulationfortime-varyingVdd.

Fig.13.18.SRAMchiplayout(UMC90nmEuropractice)andcontrolsignaltraceforvaryingVdd.

During testing the chip we discovered interesting effects of self-timedSRAM which confirm its time elasticity and useful properties for survival.Despite the fact that in thesimulationwesaw thecircuitworkingdown to thelevelof190mV,therealsiliconshowedthattheSRAMworkedsteadilyforVddabove0.75V,afterwhichitscontrollogic‘froze’ineitheritssettingorresettingphases.Thiscanbeobserved in the traceofFig.13.18where theWacksignal

gets ‘stuck’ either in the low or high state. Interestingly, due to the speed-independentnatureofthecircuit,thecircuitsmoothlyrecoversfromthe‘frozen’stateas soonasVddgoesback to the levelabove0.75V.What’s important isthatwhenVddisbelow0.75VthedataissafelyretainedintheSRAM(thiswaschecked during the testing process). The data is retainedwhile Vdd is greaterthan0.4V.

The above behaviour shows that a fully speed-independent SRAM isexcellentasretentionstorageforsurvivalinpower-deficientregimes.Itprovidesself-detectionofthepowerconditionby‘freezing’,anearlywarning,wellbeforethesystemstartstoloseitsdata.

13.7.RetainingEnergy:ElasticPowerManagementforInstincts

The design of ICT systems destined for survival will increasingly be moreholisticandwillhavetotakecareofnotonlytheirdataprocessingparts,suchassensingandcomputingelectronics,butalsotheirpowersupplyelectronics.Weare exploring new ideas in this direction. They involve more active use ofswitched capacitor circuits for DC/DC conversion. Conventionally there areswitchedcapacitorDC/DCconverters(SCCs).TheyconvertconstantinputVddtoconstantoutputVddaccordingtoasetofratios.However,SCCsusuallyrelyon the availability of stable sources of time and voltage references. Instead,underharshoperatingconditionssuch referencesmaynotbeavailable.Hence,wedevelopadifferent typeofswitchedcapacitorcircuits thatareawareof thepresenceofself-timedcircuitsastheirload.Wecallthemcapacitorbankblocks(CBBs).WehavealsodesignedhybridCBBsthatcanworkasSCCsandCBBsdependingontheconditionsandwhethertheloadelectronicsissynchronousorasynchronous.Detailsofthismethodcanbefoundin[Ref.10].

13.8.ConclusionsandOutlook

Asstatedintheabstractandintroduction,thischapterwasinspiredbytheideasofincorporatingself-awarenessintosystemsthathavebeenstudiedbyProfessorCheunginthecontextofimprovingtheperformanceofelectronicsystemsunderprocessvariationsandageing.Wetakeself-awarenessfurther,andwiththehelpofbiological analogy, consider survival instincts here.Thepaper has focussedalmostexclusivelyon the techniquesandexamplesofcircuits for survivabilitythat support an ‘instinct layer’, which is supposed to remain alive and

operationalundertheconditionsofpowerinstabilitiesandlackofpower.We are currently involved in an EPSRC-funded project ‘Staying alive in

variable, intermittent, low-power environments’ (SAVVIE), in collaborationwithDr. Bernard Stark ofUniversity of Bristol. The project’smain aim is todeveloptechniquesforenablingsystemstosurviveinthetopleftcorneroftheenergy-powerstatespacedepictedinFig.13.19.Whilethereexistmethodsthatsupport trajectories likeT1 andT2 in this state space, approaches to cater fortrajectoriessuchasT3andT4areintheir infancy.Wehopethattheideasthathavebeendescribedinthischapterwillcontributetothisaim.

Alistofoutgoingresearchdirectionswearecurrentlypursuing:

•More diversification—power and data processing paths intertwined,mixeddigitalandanaloguefabrics,synchronousandasynchronousfabrics,multipletechnologyfabrics.

•Newmodellinganddesignapproaches—modelsthatcapturemulti-modalandmulti-layer architectures; combining structure and behaviour in models,capturingoverlayinfunctionality.

Fig.13.19.Energy-powerstatespaceintheSAVVIEproject(courtesyofB.Stark).

Thereisplentytoinvestigateonthispath,andtheresearchisalreadyunderway at Newcastle and in collaboration with our partners from Southampton,Imperial andManchester under the programmegrantPRiME thatwill exploreenergy-reliabilitytradeoffsindesigningfuturemany-coreembeddedsystems.

Acknowledgments

This chapter would not have beenwrittenwithout the combined effort of theauthor’s research team, the Microelectronics Systems Design group atNewcastle.Thelistofpeopleactivelyworkinginthisareacanbefoundonthegroup’swebpage:http://async.org.uk.

Our collaborationwithDrBernard Stark’s team atBristol in the SAVVIEprojectandourrecentcollaborationwithUniversitiesSouthampton,BristolandImperial College London under the Holistic project(http://www.holistic.ecs.soton.ac.uk/)areacknowledgedwithdeepgratitude.

References

1. J.M.Levineet al.OnlineMeasurement ofTiming inCircuits: forHealthMonitoring andDynamicVoltage & Frequency Scaling, in Proc. International Symposium on Field-Programmable CustomComputingMachines,pp.109–116,2012.

2.J.C.KnightandE.A.Strunk.AchievingCriticalSystemSurvivabilitythroughSoftwareArchitectures,inProc.ArchitectingDependableSystemsII,pp.51–78,2004.

3. A. Yakovlev. Energy-Modulated Computing, in Proc. Design, Automation & Test in EuropeConference&Exhibition,pp.1340–1345,2011.

4.R.Ramezanietal.Energy-modulatedQualityofService:NewSchedulingApproach,inProc.FaibleTensionFaibleConsommation,pp.1–4,2012.

5.R.Ramezanietal.VoltageSensingUsinganAsynchronousCharge-to-DigitalConverterforEnergy-AutonomousEnvironments,IEEEJournalonEmergingandSelectedTopicsinCircuitsandSystems,3(1),35–44,2013.

6.D.Shang,F.XiaandA.Yakovlev.Wide-Range,ReferenceFree,On-chipVoltageSensorforVariableVddOperations,inProc.InternationalSymposiumonCircuitsandSystems,pp.37–40,2013.

7.G.Tarawneh,T.MakandA.Yakovlev. Intra-chipPhysicalParameterSensorforFPGAsusingFlip-Flop Metastability, in Proc. International Conference on Field Programmable Logic andApplications,pp.112–119,2012.

8. I.Syranidis,F.XiaandA.Yakovlev.AReference-freeVoltageSensingMethodBasedonTransientModeSwitching,inProc.ConferenceonPh.D.ResearchinMicroelectronicsandElectronics,pp.1–4,2012.

9.A.Bazet al. Self-timedSRAMforEnergyHarvestingSystems,Journal ofLowPowerElectronics,7(2),274–284,2011.

10. X. Zhang et al. A Hybrid Power Delivery Method for Asynchronous Loads in Energy HarvestingSystems,inProc.InternationalNEWCASConference,pp.413–416,2012.

11.L.Daietal.MonitoringCircuitBasedonThresholdforFault-tolerantNoC,ElectronicsLetters,46(14),984–985,2010.

12. R. Al-Dujaily et al. Embedded Transitive Closure Network for Runtime Deadlock Detection inNetworks-on-Chip,IEEETransactionsonParallelandDistributedSystems,23(7),1205–1215,2012.

13.N.Dahiretal.MinimizingPowerSupplyNoisethroughHarmonicMappinginNetworks-on-Chip,inProc.InternationalConferenceonHardware/softwareCodesignandSystemSynthesis,pp.113–122,

http://async.org.uk

http://www.holistic.ecs.soton.ac.uk/

2012.14.F.Xiaetal.DataCommunicationinSystemswithHeterogeneousTiming,IEEEMicro,22(6),58–69,

2002.15. R. Ramezani and A. Yakovlev. Capacitor Discharging through Asynchronous Circuit Switching, in

Proc.InternationalSymposiumonAsynchronousCircuitsandSystems,pp.16–22,2013.

Index

A-to-Dconverters,248acceleratorblockcomposer(ABC),5acceleratorbuildingblocks(ABBs),4,15accelerator-richarchitectures,2accelerator-richCMPs(ARC),3,4,5,9acceleratorstore,8,9accelerators,1,2adaptationtovariations,239adjacencymatrix,172affinearithmetic,25affinetransformations,176ageing,239Amdahl’slaw,52,105Anton,70API,143application-specificintegratedcircuit(ASIC),2architecture,138

manycore,151multicore,151

arrayprocessor,167,168,170,172,173,177,178,180,181,185,198,201,202

aspect-orientedtechniques,112assemblyprogram,169associativelaw,23asynchronous,143asynchronouscircuits,68AutoESL,14,43automaticdifferentiation,40,148autonomous,238

autonomousmemoryblocks,141AutoPilot,14,22average

iterationinterval,200,201AXI4-stream,43

bandgap,250beamtest

FPGA,90binding,180,182–184,191BlockRAM,125BlueVec,118,119brain,72

limbic,241neocortex,241neomammalian,241paleomammalian,241reptilian,241

buffer-in-NUCA(BiN),8,9buffersincache(BiC),8,9,10burstmemoryaccess,119,123,125

C-to-gates,117C-to-Silicon,14cacheline,130Cadence,14CAMEL,6cancellation,39capacitorbankblocks(CBBs),261capacitorDC/DCconverters,260catastrophiccancellation,39,41,42,49,53CenterforDomain-SpecificComputing(CDSC),1,2Centip3De,70CERN,84CESTAC,41charge-to-digitalconversion,248CHARM,4–6,15Cheung,Peter,40,58,83,237circuitspecialisation,21

cloudcomputing,151,156co-processor,43,48coarse-grainfabrics

reconfigurablecomputing,154coarse-grainedreconfigurablearchitectures,167coarse-grainedreconfigurablearrays,168codecompaction,190codegeneration,169,177,185,186,198,199,203,204codeoptimization,186collateralizeddefaultobligations,144commitinstruction

BlueVec,121compactcodegeneration,167,169,170,185,186,198,203compilation,141,202,204compilationflow,156,157compiler,153,175,177completiondetection,252,257completiontime,104composable,2computationalefficiency,152computationalsystems,19computerarchitecture,22computingsystemdesign,151,152,153concentratornetwork,224configurationscrubbing

FPGA,84–86,90constraints,182controlanddataflowgraph(CDFG),14controlflowgraph,167,169,170,188,190–193,197controllability,224controller,168,195–197,199–202convolution,146correlation,207,209,215COSY,90covariance,209coveragemonitoring,227crossingcharacteristicmodeboundaries,248custominstruction,119,120custompipeline,117,118,133

Cynthesizer,14

darksilicon,2,21,66Darwin,Charles,239Darwinianevolution,241dataflow,137,139dataflowarchitecture

computingsystemdesign,153,154dataflowengines(DFE),139,140,141DC/DCconversion,260debugscenarios,225defects,242degradation,242delaydifferencebasedsensing,251delaydifferentiators,251Dennardscaling,19,21dependability,242designerproductivity,220differentiatorsindelays,248digitrecognition,132DMA-controller(DMA-C),4DNAcomputing,242domainsofcoherency

computingsystemdesign,155DREAMS,29,156,159DualFiXed-point,147

EDSAC,57elasticmemory,237,240elasticpowersupply,240elasticsystemdesign,238eliminatingidlefunctions,111embeddedlogicanalysers,223energy,238

energyconsumption,106energyefficiency,112energy-frugalmode,238energyharvesting,238energy-modulateddesign,247

energy-powermodulation,240energy-powerstatespace,262

ENIAC,59Epiphany,168EPSRC,261erroranalysis,41errordetection,53errors,39Europractice,250event-drivensystem,75exascale,138exascalecomputing,151,156executablesubcomponents

compilationflowandsynthesis,157executiontime,104explicitfinitedifference,145

fabricallocator&extender,162DREAMS,160

FAIR,90fault-detection,243faulttolerance,84,242

FPGA,85–87faults,242fieldprogrammablegatearray(FPGA),39,40,41,43,48,51,58,62,69,72,73,83,91,101,137,140,226faulttolerance,84reconfigurablecomputing,154,156SRAMtechnology,84

finance,143fine-grainfabricsfiniteprecisionarithmetic,25firingrate

neuralcomputation,128fixed-point,23,44,45,46,47Flash,29flashtechnology,84

FPGA,84,90flip-flop,256

floating-point,39,40,42–49,51–53,55co-processor,39doubleprecision,47precision,40singleprecision,45–47

floating-pointarithmetic,23FloatingPointUnit,41–45,48,55FPGAprototyping,226Frechtling,M.,39fullwaveforminversion,146functioncomparison,109

Gaussian,207,208geophysics,145Gibbon,Edward,239global

latency,179,182,199,200globalacceleratorbuffermanager(ABM),9globalacceleratormanager(GAM),4,11globallyasynchronouslocallysynchronous,142gracefuldegradation,242GSI/FAIR,84,88,89

Handelmanrepresentation,27hardwaredescriptionlanguages,62,64harshenvironmentalconditions,237Harwell,59heterogeneous

manycore,157heterogeneouscomputingsystems,156,157high-levellanguage,61,133high-levelsynthesis,22,228highperformancecomputing(HPC),137hyperbola,249hypothalamus,241

I-valueaccumulationneuralcomputation,124

IBM,62

ICT,239IEEE-754,39–45,47,49,53ImperialCollege,239,262incrementaltracetechniques,227instincts,238instruction,57instruction-levelparallelism,167instructionset

BlueVec,120insituprogramming,102integratedcircuit,62,65Intel,70interconnect,180,185,198

topologies,173wrapper,172,173

interconnection,181interestrateswap,144inter-processorcommunication,131internalsignals,222intervalarithmetic,25,41intuition,240IOmemorymanagementunit(IOMMU),11iterationinterval,180–183,186,190,192,197,200iterationspace,174–178,180,187–193,196,199,201iterativealgorithms,29Izhikevichneuron

neuralcomputation,123

Josuat,Jean-Luc,241

Kahan,WilliamM.,49,53Kernels,142Knuth,DonaldE.,49

lane-localmemories,119,121latching,253latency,182,187,191,198,199

global,192local,189,190

layeredview,246layers

behavioural,243structural,243

layersoffunctionality,240leadingzerodetector,45leakage,249leakyintegrate-and-fire

neuralcomputation,132LegUp,22Leong,P.H.W.,39lifetime,191

constraints,182LINPACK,50–52,54locallatency,179,181,187locking,253logicgates,73longevity,242loopaccelerators,168loop-carrieddatadependencies,171loopcompiler,198loop-levelparallelism,167loopprogram,169,199,201lowlatency,137LTItheory,25

machinelearning,113machineprecision,43,54manager,142,143ManchesterBaby,57manycore,58,151,152,156manycoresystems,159massivelyparallel,79massivelyparallelsystems,134mathematicaloptimisation,110MaxCompiler,141MaxGenFD,145,146MaximallyDistributedTauswortheGenerator,44,46,47MaxJ,141

May’sLaw,67May,David,67McLean,Paul,241mechanisms,240memorybottleneck,133memorywall,118meta-data

compilationflows,157,158synthesis,157,158

metastability,248,255metastable,256MicroBlaze,11,43,52,55mixed-precision,23,107moduloscheduling,199monitoring,247MonteCarlo,144MonteCarloarithmetic,39–48,50,51,53,54

co-processor,40,50,51,53,54inexact,42,44,45variableprecision,43,50,54

MonteCarlosimulation,109Moore’sLaw,58,63,66,67multicore,58,74,131,155multicore/manycore,151multiple-instruction-multiple-data

computingsystemdesign,155multiple-wordlengthoptimization,147multi-waybranchunit,171

network-on-chip(NoC),4,9,10,15,151network-on-chiparchitectures,155networktopologies,172,173neuralcomputation,117,118neuron,77neuronupdate

neuralcomputation,130neurons,73non-blockingcommunications,247non-uniformcachearchitecture(NUCA),9

numericalrepresentation,23

observability,219on-chipdebugging,219operatingsystems,153optimization,186orexin,241organism,239oscillations,248oscillator

voltage-modulated,253overlaynetworks,227

parallelprogramming,152parallelism,137PARC,11,14Paretotradeoff,21Parker,D.S.,48partiallyorderedevent-drivensystems,76particlefilters,111partitioning,177–179,184,190,202payoff,145performance,139performance-profilinghardware

computingsystemdesign,159DREAMS,159,161

petascalecomputing,151piecewiselinearalgorithm,174piecewiseregularalgorithm,174pipelinedsynchronous,142pipelining,122Plankalkuel,59point-to-pointinterconnection,143polyhedronmodel,31,202powerconsumption,106power-efficiency,244power-modulatedcomputing,238,244power-modulation,237,244power-modulationapproach,245

powernoise,247power-proportionality,245powersupplydrops,242precisionbounding,42precisiontesting,54prediction,244PRiME,262processingelement,152,157–161,163,167processor,167–170,172,173,177–183,185–188,191,193,199–201,203processorarchitecture,152processorclasses,169processorcontroller,202processorindex,179processorspace,179processors,184,185,190,194–196productivity,132programblocks,169programmablelogiccores,223programmableprocessors,167programmingmodel,157

R-complex,241radiation,242radiationeffects,84,85,88

SEU,84radiofrequencyinterconnect(RF-I),10,15randomfield,210,217random-factorloadingmodel,144randomrounding,42,53rankingfunction,28real-time,128real-worldstimulus,221reasoning,239,240reconfigurability,101,137,140reconfigurablecomputing,19,101,151,152,154reconfigurationtime,104record/playbackfacility

BlueVec,122,127recoverability,242

recovery,243reduceddependencegraph,176reference-freesensing,237register

binding,182–184,191,193constraints,182

registerfileBlueVec,120

registerforwarding,122registerscoreboarding,121resilience,242resourcebinding,180resourceallocatorandscheduler,161

DREAMS,160reversetimemigration,145risk,143,145round-offerror,42,53,55rounding,39roundingerror,41,54,55run-timeanalysis,39run-timebinding,232run-timedetection,55run-timeerroranalysis,40run-timeerrordetection,41run-timereconfigurability,103

S-curve,63,65sampling,248SAVVIE,261schedule,142,179–181,183,185,187,188,190–193,195,197scheduler

operatingsystem,156scheduling,180–182,199,203scratch-padmemory(SPM),4,7,8,15SDRAM,22seismicprocessing,146seismictracestacking,148self-awareness,237self-recover,238

self-timedcounter,248self-timedSRAM,252semblance,148sensing,248sensitivityanalysis,40sensors

reference-free,248short-readsequencealignment,110signaldistortions,242signalprocessingalgorithms,168signalsectionproblem,223simpleliveCPU(SLiC),143single,40singleassignmentcode,173singleeventeffects(SEE)

radiationeffects,85,87–89singleeventfunctionalinterrupt(SEFI)

radiationeffects,85singleeventupsets(SEU),84

radiationeffects,86singleprecision,49single-instruction-multiple-data

computingsystemdesign,155SoC,65socialinsects,72society,72softvectorprocessors,119SouthamptonUniversity,262spatial,213specialisationofcircuitry,20speed-independentcontrol,257spikedelay

neuralcomputation,130spikingneurons,80SpiNNaker,73–76,79,80Stark,Bernard,261staticrandom-accessmemory(SRAM),251

SRAMtechnology,84FPGA,85,89,90

staticscheduling,126subthreshold,249supercomputer,138superthreshold,249survivability,237survival,237

Deep,243switchedcapacitorDC/DCconvertor(SCCs)swizzle-switch,71synapticconnections

neuralcomputation,125synapticplasticity,80synapticupdate

neuralcomputation,124synthesis,156,157systems

dependable,242manycore,151neuralcomputation,124self-aware,237

systolicarrayprocessor,139

template,3thermalmonitoring,247threshold,250throughput,200,201

radiationeffects,85TILE64,71tolerance,242topology,172,173,203totalionizingdose(TID)tracebuffers,222tranche,144tranchedcreditderivatives,143transformations,176,177,179,202transients,242transistor,59translationlook-asidebuffer(TLB),4triunemodel,241

TSMC,250Turing,57,80Turingmachines,68

uncertainty,245UniversalMachine,57

validation,220variation,207,210vectorprocessing,117,118vectorwidth

BlueVec,119VeridaeSystems,225verification,220virtualprecision,42,54VivadoHLS,14VLIW,71

processor,167–170,182

weaklyprogrammableprocessorarray,168wrappers,170–172

Xilinx,52,84–86,90

Yeung,Jackson,H.C.,41,51yield,214,215

Zynq,52

Documents

Transforming Reconfigurable Systems: A Festschrift Celebrating the 60th Birthday of Professor Peter