18
Conger #233 MAPLD 2005 NARC: NARC: Network Network - - Attached Reconfigurable Computing for Attached Reconfigurable Computing for High High - - performance, Network performance, Network - - based Applications based Applications Chris Conger, Ian Troxel, Daniel Espinosa, Vikas Aggarwal, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida

NARC: Network-Attached Reconfigurable Computing for High ...ufdcimages.uflib.ufl.edu/UF/00/09/47/34/00002/MAPLD05_Conger.pdf · NARC: Network-Attached Reconfigurable Computing for

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Conger #233 MAPLD 2005

    NARC: NARC: NetworkNetwork--Attached Reconfigurable Computing for Attached Reconfigurable Computing for HighHigh--performance, Networkperformance, Network--based Applicationsbased Applications

    Chris Conger, Ian Troxel, Daniel Espinosa, Vikas Aggarwal, and Alan D. George

    High-performance Computing and Simulation (HCS) Research LabDepartment of Electrical and Computer Engineering

    University of Florida

  • #233 MAPLD 2005Conger 2

    Outline

    IntroductionNARC Board Architecture, ProtocolsCase Study ApplicationsExperimental SetupResults and AnalysisPitfalls and Lessons LearnedConclusionsFuture Work

  • #233 MAPLD 2005Conger 3

    IntroductionNetwork-Attached Reconfigurable Computer (NARC) Project

    Inspiration: network-attached storage(NAS) devicesCore concept: investigate challenges and alternatives for enabling direct network access and control over reconfigurable (RC) devicesMethod: prototype hardware interface and software infrastructure, demonstrate proof of concept for benefits of network-attached RC resources

    Motivations for NARC project include (butnot limited to) applications such as:

    Network-accessible processing resourcesGeneric network RC resource, viable alternative to server and supercomputer solutionsPower and cost savings over server-based FPGA cards are key benefits

    No server needed to host RC deviceInfrastructure provided for robust operation and interfacing with users

    Performance increase over existing RC solutions is not a primary goal of this approachNetwork monitoring and packet analysis

    Easy attachment; unobtrusive, fast traffic gathering and processingNetwork intrusion and attack detection, performance monitoring, active traffic injectionDirect network connection of FPGA can enable wire-speed processing of network traffic

    Aircraft and advanced munitions systemsStandard Ethernet interface eases addition and integration of RC devices in aircraft and munitions systemsLow weight and power also attractive characteristics of NARC device for such applications

  • #233 MAPLD 2005Conger 4

    Envisioned ApplicationsAerospace & military applications

    Modular, low-power design lends itself well to military craft and munitions deploymentFPGAs providing high-performance radar, sonar, and other computational capabilities

    Scientific field operationsQuickly provide first-level estimations for scientific field operations for geologists, biologists, etc.

    Field-deployable covert operationsCompletely wireless device enabled through battery, WLANPassive network monitoring applicationsActive network traffic injection

    Distributed computingCost-effective, RC-enabled clusters or cluster resourcesCluster NARC devices at a fraction of cost, power, cooling

    Cost-effective intelligent sensor networksUse FPGAs in close conjunction with sensors to provide pre-processing functions before network transmission

    High-performance network technologiesFast Ethernet may be replaced by any network technologyGig-E, Infiniband, RapidIO, proprietary communication protocols

  • #233 MAPLD 2005Conger 5

    NARC Board Architecture: HardwareARM9 network control with FPGA processing power (see Figure 1)

    Prototype design consists of two boards, connected via cable:Network interface board (ARM9 processor + peripherals)Xilinx development board(s) (FPGA)

    Network interface peripherals include:Layer-2 network connection (hardware PHY+MAC)External memory, SDRAM and FlashSerial port (debug communication link)FPGA control and data lines

    NARC hardware specifications:ARM-core microcontroller, 1.8V core, 3.3V peripheral

    32-bit RISC, 5-stage pipeline, in-order execution16KB data cache, 16KB instruction cacheCore clock speed 180MHz, peripheral clock 60MHzOn-chip Ethernet MAC layer with DMA

    External memory, 3.3V32MB SDRAM, 32-bit data bus2MB Flash, 16-bit data busPort available for additional 16-bit SRAM devices

    Ethernet transceiver, 3.3VDM9161 PHY layer transceiver100Mbps, full duplex capableRMII interface to MAC

    Figure 1 –

    Block diagram of NARC device

  • #233 MAPLD 2005Conger 6

    NARC Board Architecture: SoftwareARM processor runs Linux kernel 2.4.19

    Provides TCP(UDP)/IP stack, resource management, threaded execution, Berkeley Sockets interface for applicationsConfigured and compiled with drivers specifically for our board

    Applications written in C, compiled using GCC compiler for ARM (see Figure 2)NARC API: Low-level driver function library for basic services

    Initialize and configure on-chip peripherals of ARM-core processorConfigure FPGA (SelectMAP protocol)Transfer data to/from FPGA, manipulate control linesMonitor and initiate network traffic

    NARC protocol for job exchange (from remote workstation)NARC board application and client application must follow standard rules and procedures for responding to requests from a userUser appends a small header onto data (if any) containing info. about request before sending over network (see Figure 3)

    Bootstrap software in on-board Flash, automatically loads and executes on power-up

    Configures clocks, memory controllers, I/O pins, etcContacts tftp server running on network, downloads Linux and ramdiskBoot Linux, automatically execute NARC board software contained in ramdisk

    Optional serial interface through HyperTerminal for debugging/development

    Figure 3 –

    Request header field definitions

    Figure 2 –

    Software development process

    main.c

    main applicatio

    n

    util.c

    library routines

    narc.h

    definitions

    global vars

    Makefile arm-linux-gcc

    RAMDISK

    NARC Board

    Linux Kernel

    Client application `

    User Workstation

    gccNARCboard

    application

    client.c

    clientapplicatio

    n

  • #233 MAPLD 2005Conger 7

    NARC Board Architecture: FPGA InterfaceData communicated to/from FPGA by means of unidirectional data paths

    8-bit input port, 8-bit output port, 8 control lines (Figure 4)Control lines manage data transfer, also drive configuration signalsData transferred one byte at a time, full duplex communication possibleControl lines include following signals:

    Clock – software-generated signal to clock data on data portsReset – reset signal for interface logic in FPGAReady – signal indicating device is ready to accept another byte of dataValid – signal indicating device has placed valid data on portSelectMAP – all signals necessary to drive SelectMAP configuration

    Figure 4 –

    FPGA interface signal diagram

    ARM FPGAOut[0:7]

    In[0:7]Out[0:7]

    In[0:7]

    a_validf_ready

    a_validf_ready

    a_readyf_validf_valid

    a_ready

    clockreset

    SelectMAPPort

    D[0:7]

    PROG, INIT, CS, WRITE, DONE

    PROG, INIT, CS, WRITE, DONE

    FPGA configuration through SelectMAP protocolFastest configuration option for Xilinx FPGAs, protocol emulated using GPIO pins of ARMNARC board enables remote configuration and management of FPGA

    User submits configuration request (RTYPE = 01), along with bitfile and function descriptorFunction descriptor is ASCII string, formatted list of functions with associated RTYPE definitionARM halts and configures FPGA, stores descriptor in dedicated RAM buffer for user queries

    All FPGA designs must restrict use of all SelectMAP pins after configurationSome signals are shared between SelectMAP port and FPGA-ARM linkOnce configured, SelectMAP pins must remain tri-stated and unused

  • #233 MAPLD 2005Conger 8

    Results and Analysis: Raw PerformanceFPGA interface I/O throughput (Table 1)

    1KB data transferred over link, timedMeasured using hardware methods

    Logic analyzer – to capture raw link data rate, divide data sent by time from first clock to last clock (see Figure 9)

    Performance lower than desired for prototypeHandshake protocol may add unnecessary overhead

    Widening data paths, optimizing software routine will significantly improve FPGA I/O performance

    Network throughput (Table 2)Measured using Linux network benchmark IPerf

    NARC board located on arbitrary switch within network, application partner is user workstationTransfers as much data as possible in 10 seconds, calculates throughput based on data sent divided by 10 seconds

    Performed two experiments with NARC board serving as clientin one run, server in otherBoth local and remote (remote location ~400 miles away, at Florida State University) IPerf partnerNetwork interface achieves reasonably good bandwidth efficiency

    External memory throughput (Table 3)4KB transferred to external SDRAM, both read and writeMeasurements again taken using logic analyzerMemory throughput sufficient to provide wire-speed buffering of network traffic

    On-chip Ethernet MAC has DMA to this SDRAMShould help alleviate I/O bottleneck between ARM and FPGA

    Mb/s Input Output

    Logic Analyzer 6.08 6.12

    Mb/s Local Network

    Remote Network (WAN)

    NARC-

    Server 75.4 4.9Server-

    Server 78.9 5.3

    Figure 9 –

    Logic analyzer timing

    Table 1 –

    FPGA interface I/O performance

    Table 2 –

    Network throughput

    Mb/s Read Write

    Logic Analyzer 183.2 160

    Table 3 –

    External SDRAM throughput

  • #233 MAPLD 2005Conger 9

    Results and Analysis: Raw Performance

    Reconfiguration speedIncludes time to transfer bitfile over network, plus time to configure device (transfer bitfile from ARM to FPGA), plus time to receive acknowledgementOur design currently completes a user-initiated reconfiguration request with a 1.2MB bitfile in 2.35 sec

    Area/resource usage of minimal wrapper for Virtex-II Pro FPGAStats on resource requirements for a minimal design to provide required link control and data transfer in an application wrapper are presented below:

    Design implemented on older Virtex-II Pro FPGANumbers to right indicate requirements for wrapper only, un-used resources available for use in user applicationsExtremely small footprint!Footprint will be even smaller on larger FPGA

    Device utilization summary:--------------------------------------------------------

    Selected Device : 2vp20ff1152-5

    Number of Slices: 143 out of 9280 1% Number of Slice Flip Flops: 120 out of 18560 0% Number of 4 input LUTs: 238 out of 18560 1% Number of bonded IOBs: 24 out of 564 4%Number of BRAMs: 8 out of 88 9% Number of GCLKs: 1 out of 16 6%

  • #233 MAPLD 2005Conger 10

    Case Study ApplicationsClustered RC Devices: N-Queens

    HPC application demonstrating NARC board’s role as generic compute resourceApplication characterized by minimal communication, heavy computation within FPGANARC version of N-Queens adapted from previously implemented application for PCI-based Celoxica RC1000 board housed in a conventional serverN-Queens algorithm is a part of the DoD high-performance computing benchmark suite and representative of select military and intelligence processing algorithms

    Exercises functionality of various developed mechanisms and protocols for job submission, data transfer, etc. on NARCUser specifies a single parameter N, upon completion the algorithm returns total number of possible solutionsPurpose of algorithm is to determine how many possible arrangements of N queens there are on an N × N chess board, such that no queen may attack another (see Figure 5)Results are presented from both NARC-based execution and RC1000-based execution for comparison

    Figure c/o Jeff Somers

    Figure 5 –

    Possible 8x8 solution

  • #233 MAPLD 2005Conger 11

    Case Study ApplicationsNetwork processing: Bloom Filter

    This application performs passive packet analysis through use of a classification algorithm known as a Bloom Filter

    Application characterized by constant, bursty communication patternsMost communication is Rx over network, transmission to FPGAFilter may be programmed or queried

    NARC device copies all received network frames to memory, ARM parses TCP/IP header and sends it to Bloom Filter for classification

    User can send programming requests, which include a header and string to be programmed into FilterUser can also send result collection requests, which causes a formatted results packet to be sent back to the userOtherwise, application constantly runs, querying each header against the current Bloom Filter and recording match/header pair information

    Bloom Filter works by using multiple hash functions on a given bit string, each hash function rendering indices of a separate bit vector (see Figure 6)

    To program, hash inputted string and set resulting bit positions as 1To query, hash inputted string, if all resulting bit positions are 1 the string matches

    Implemented on Virtex-II Pro FPGAUses slightly larger, but ultimately more effective application wrapper (see Figure 7)Larger FPGA selected to demonstrate interoperability with any FPGA

    Figure 6 –

    Bloom Filter algorithmic architecture

    Figure 7 –

    Bloom Filter implementation architecture

  • #233 MAPLD 2005Conger 12

    Experimental SetupN-Queens: Clustered RC devices

    NARC device located on arbitrary switch in networkUser interfaces through client application on workstation, requests N-Queens procedure

    Figure 8 illustrates experimental environmentClient application records time required to satisfy requestPower supply measures current draw of active NARC device

    N-Queens also implemented on RC-enabled server equipped with Celoxica RC1000 board

    Client-side function call to NARC board replaced with function call to RC1000 board in local workstation, same timing measurementComparison offered in terms of performance, power, cost Workstation

    NARC

    EthernetNetwork

    User

    RC-enabled servers

    NARC

    Figure 8 –

    Experimental environmentBloom Filter: Network processingSame experimental setup as N-Queens case studySoftware on ARM co-processor captures all Ethernet frames

    Only packet headers (TCP/IP) are passed to FPGAData continuously sent to FPGA as packets arrive over network

    By attaching NARC device to switch, limited packets can be capturedOnly broadcast packets and packets destined for the NARC device can be seenDual-port device could be inserted in-line with network link, monitor all flow-through traffic

  • #233 MAPLD 2005Conger 13

    Results and Analysis: N-Queens Case StudyFirst, consider an execution time comparison between our NARC board and a PCI-based RC card (see Figure 10a and 10b)

    Both FPGA designs clocked at 50MHzPerformance difference is minimal between devices

    Being able to match performance of PCI-based card is a resounding success!

    Power consumption and cost of NARC devices drastically lower than that of server with RC card combosMultiple users may share NARC device, PCI-based cards somewhat fixed in an individual server

    Power consumption calculated using following method

    Three regulated power supplies exist in complete NARC device (network interface + FPGA board): 5V, 3.3V, 2.5VCurrent draw from each supply was measured Power consumption is calculated as sum of V×I products of all three supplies

    N-Queens Execution Time Comparison(small board size)

    0

    0.01

    0.02

    0.03

    0.04

    0.05

    5 6 7 8 9 10

    Algorithm Parameter (N)

    Exe

    c. T

    ime

    (s)

    NARCRC-1000

    Figure 10 –

    Performance comparison between NARC board and PCI-based RC card on server

    N-Queens Execution Time Comparison(large board size)

    010203040506070

    11 12 13 14

    Algorithm Parameter (N)

    Exec

    . Tim

    e (s

    )

    NARCRC-1000

  • #233 MAPLD 2005Conger 14

    Results and Analysis: N-Queens Case StudyFigure 11 summarizes the performance ratio of N-Queens between both NARC and RC-1000 platformsConsider Table 4 for a summary of cost and power statistics

    Unit price shown excluding cost of FPGAFPGA costs offset when compared to another devicePrice shown includes PCB fabrication, component costs

    Approximate power consumption drastically less than server + RC-card combo

    Power consumption of server varies depending on particular hardwareTypical servers operate off of 200-400W power supplies

    See Figure 12 for example of approximate power consumption calculation

    NARC Board

    Cost per unit (prototype) $175.00

    Approx. Power Consumption 3.28 WTable 4 –

    Price and power figures for NARC device

    Figure 12 –

    Power consumption calculation

    P = (5V)(I5

    ) + (3.3V)(I33

    ) + (2.5V)(I25

    )

    I5

    0.2A ; I33

    0.49A ; I25

    0.27A

    P = (5)(.2) + (3.3)(.49) + (2.5)(.27) = 3.28W

    NARC / RC-1000 Performance Ratio

    0

    5

    10

    15

    20

    25

    5 6 7 8 9 10 11 12 13 14

    Algorithm Parameter (N)

    Rat

    io

    RATIOEquivalency

    Figure 11 –

    Power consumption calculation

  • #233 MAPLD 2005Conger 15

    Results and Analysis: Bloom FilterPassive, continuous network traffic analysis

    Wrapper design was slightly larger than previous minimal wrapper used with N-QueensStill small footprint on chip, majority of FPGA remains for applicationMaximum wrapper clock frequency 183 MHz, should not limit application clock if in same clock domain

    Packets received over network link are parsed by ARM, with TCP/IP header saved in bufferHeaders sent one-at-a-time as query requests to Bloom Filter (FPGA), when query finishes another header will be de-queued if available

    User may query NARC device at any time for results update, program new pattern

    Device utilization summary:-------------------------------------------------------Selected Device : 2vp20ff1152-5

    Number of Slices: 1174 out of 9280 13% Number of Slice Flip Flops: 1706 out of 18560 9% Number of 4 input LUTs: 2032 out of 18560 11% Number of bonded IOBs: 24 out of 564 4%Number of BRAMs: 9 out of 88 10% Number of GCLKs: 1 out of 16 6%

    Figure 13 –

    Device utilization statistics for Bloom Filter design

    Figure 13 shows resource usage for Virtex-II Pro FPGAMaximum clock frequency of 113MHz

    Not affected by wrapper constraintSignificantly faster computation speed than FPGA-ARM link communication speed

    FPGA-side buffer will not fill up, headers are processed before next header transmitted to FPGAARM-side buffer may fill up under heavy traffic loads

    32MB ARM-side RAM gives large buffer

  • #233 MAPLD 2005Conger 16

    Pitfalls and Lessons LearnedFPGA I/O throughput capacity remains persistent problem

    One motivation for designing custom hardware is to remove typical PCI bottleneck and provide wire-speed network connectivity for FPGAUnder-provisioned data path between FPGA and network interface restricts performance benefits for our prototype designLuckily, this problem may be solved through a variety of approaches

    Wider data paths (16-bit, 32-bit) double or quadruple throughput, at expense of higher pin countUse of higher-performance co-processor capable of faster I/O switching frequenciesOptimized data transfer protocol

    Having co-processor in addition to FPGA to handle network interface is vital to success of our approach

    Required in order to permit initial remote configuration of FPGA, as well as additional reconfigurations upon user requestOffloading network stack, basic request handling, and other maintenance-type tasks from FPGA saves significant amount of valuable slices for user designsDrastically eases interfacing with user application on networked workstationActive co-processor for FPGA applications, e.g. parsing network packets as in Bloom Filter application

  • #233 MAPLD 2005Conger 17

    ConclusionsA novel approach to providing FPGAs with standalone network connectivity has been prototyped and successfully demonstrated

    Investigated issues critical to providing remote management of standalone NARC resourcesProposed and demonstrated solutions to discovered challengesPerformed pair of case studies with two distinct, representative applications for a NARC device

    Network-attached RC devices offer potential benefits for a variety of applicationsImpressive cost and power savings over server-based RC processingIndependent NARC devices may be shared by multiple users without movingTightly coupled network interface enables FPGA to be used directly in path of network traffic for real-time analysis and monitoring

    Two issues that are proving to be a challenge to our approach include:Data latency in FPGA communicationSoftware infrastructure required to achieve a robust standalone RC unit

    While prototype design achieves relatively good performance in some areas, and limited performance in others, this is acceptable for concept demonstration

    Fairly complex board design; architecture and software enhancements in developmentAs proof of “NARC” concept, important goal of project was achieved in demonstration of an effective and efficient infrastructure for managing NARC devices

  • #233 MAPLD 2005Conger 18

    Future WorkExpansion of network processing capabilities

    Further development of packet filtering applicationMore specific and practical activity or behavior sought from network trafficAnalyze streaming packets at or near wire-speed rates

    Expansion of Ethernet link to 2-port hubPermit transparent insertion of device into network pathProvide easier access to all packets in switched IP network

    Merging FPGA with ARM co-processor and network interface into one device

    Ultimate vision for NARC deviceWill restrict number of different FPGAs which may be supported, according to chosen FPGA socket/footprint for boardIncreased difficulty in PCB design

    Expansion to Gig-E, other network technologiesFast Ethernet targeted for prototyping effort, concept demonstrationTrue high-performance device should support Gigabit EthernetOther potential technologies include (but not limited to) InfiniBand, RapidIO

    Further development of management infrastructureNeed for more robust control/decision-making middlewareAutomatic device discovery, concurrent job execution, fault-tolerant operation

    NARC: �Network-Attached Reconfigurable Computing for High-performance, Network-based ApplicationsOutlineIntroductionEnvisioned ApplicationsNARC Board Architecture: HardwareNARC Board Architecture: SoftwareNARC Board Architecture: FPGA InterfaceResults and Analysis: Raw PerformanceResults and Analysis: Raw PerformanceCase Study ApplicationsCase Study ApplicationsExperimental SetupResults and Analysis: N-Queens Case StudyResults and Analysis: N-Queens Case StudyResults and Analysis: Bloom FilterPitfalls and Lessons LearnedConclusionsFuture Work