76
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign FPGA Data Ingest Processing for NARA Electronic Records Craig Steffen Innovative Systems Lab, NCSA [email protected]

FPGA Data Ingest Processing for NARA Electronic Records

  • Upload
    didina

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

FPGA Data Ingest Processing for NARA Electronic Records. Craig Steffen Innovative Systems Lab, NCSA [email protected]. NARA/NSF OCI Grant. Innovative Systems and Software: Applications to NARA Research Problems - PowerPoint PPT Presentation

Citation preview

Page 1: FPGA Data Ingest Processing for NARA Electronic Records

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign

FPGA Data Ingest Processing for NARA Electronic Records

Craig SteffenInnovative Systems Lab, [email protected]

Page 2: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

NARA/NSF OCI Grant

Innovative Systems and Software: Applications to NARA Research Problems

National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-ChampaignFunded through theNational Science Foundation Cooperative Agreement NSF OCI 05-25308 Cooperative Support Agreements NSF OCI 04-38712 and NSF OCI 05-04064by theNational Archives and Records Administration

Peter Bajcsy, PI

Page 3: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

NCSA Innovative Systems Lab

• Rob Pennington and Mike Showerman at NCSA• work with innovative systems for HPC use• Expertise in:

• FPGAs• Cell BE• Nvidia-based CUDA GPUs

Page 4: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Genesis of FPGA at NARA project

NARA is creating Electronic Records Archive. All federal records digital by 2012

• ERA needs to ingest records and store with metadata

Page 5: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Boundary Conditions of the Problem

• On records storage for 100 years: “You might be able to count on ASCII.”

• no programs can be assumed to be available• Other people’s problems:

• How is the data stored? (physical medium, format)• how is it retreived?

• no compression

Page 6: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

What NARA Cares About

• Data storage• Data retrieval • Data integrity

Page 7: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Field Programmable Gate Arrays

• Software-configurable chip• Lower clock speed, higher specialization than ASIC• Cores of routers (on-the-fly routing)• prototype electronics• front-end signal processing

Page 8: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

What’s Difficult About Using FPGAs?

• Slow clock speed (100 MHz typical) (30 to 1 DISadvantage against scalar CPU)

• Slave processors, must be activated by CPU program, no interrupts, dependent on CPU for data transfer (or at least data configuration)

• Typical programing: hardware design languages (user configures buffers, clocks, and wires in VHDL)

• Programming is particular to the target chip• different families of chips require re-design; basic logical blocks can

differ• FPGAs typically exist (philosophically and logically) in device space;

to use as a processor, you must generate glue routines so that A talks to B

Page 9: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

FPGAs: Deep Parallelism• Deep Parallelism: execution units can be arranged arbitrarily,

including feeding each other• If 1-clock operations A then B then C need to be performed on data

items 1, 2, 3, 4, 5… (and 1, 2, 3, 4 etc. are independent)• first clock: 1 A B C • next clock: 2 A 1’ B C

3 A 2’ B 1’’ C 4 A 3’ B 2’’ C 1’’’ 5 A 4’ B 3’’ C 2’’’ 1’’’

one output per clock. This continues to be true for arbitrarily long instruction sequences. (Add one instruction per clock, still one output per clock)

Later on, when tools talk about pipelining, this is what they’re talking about

Limited by: sophistication of programming, data independence, logic complexity

Page 10: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

FPGAs: Wide Pipelining

• A,B and C performed on data objects 1,2,3,4,5… Deep parallel only:

5 A 4’ B 3’’ C 2’’’ 1’’’

Wide parallel:9 A 7’ B 5’’ C 3’’’ 1’’’10 A 8’ B 6’’ C 4’’’ 2’’’

• limited by: input and output throughput

Page 11: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Tools: How Do We Design Algorithms

• ISL’s philosophy: scientific/algorithm practitioners must be part of the design process. We believe that algorithm design must not require hardware expertise (coding is Ok, but we never want to see a “clock” or a “buffer”)

• Compilers translate high-level structured language to machine design (vhdl)

• low-level libraries and glue logic take the vhdl and create a machine design

• practitioners should know Compiler and basic hardware design, possibly not libraries and glue logic

Page 12: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Computational FPGA Vendors

• SRC Computers Inc. • hardware “MAP” processor, in

integrated system• in-house Carte-C compiler

language • glue logic is all integrated and

invisible• Nallatech

• Hardware FPGA accelerator cards (coming out with Intel FSB-socket design)

• in-house Dime-C C compiler (we’ll see this later)

• glue logic is Dimetalk board design. We have written improvements. CPU –socket system comes with improved version

Hardware-only:• Xtreme Data

• AMD CPU-socket FPGA hardware• DRC

• AMD CPU-socket FPGA hardware• AlphaData

• embedded FPGA designs• SGI

• integrated modular FPGA Systems

Software-only• Impulse

• “streaminng” FPGA compiler and support software

• Mitrionics• Mitrion-C parallel language and Mitrion

virtual soft-processor

Page 13: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

What Can We Contribute?

• FPGAs: long to activate, good for data-intensive algorithms• Best used when every piece of data needs the same processing• What information do we need to extract from every file that comes

into the ERA?• Problem: many various file types• file types change, software versions change

• (remember: tens of years)• unpacking would be a major problem

• How do we deal with this diverging difficulty?

• Ignore it

Page 14: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Nallatech software stack

• Dime-C compiler• Dime-C is a mostly-ANSI-C language• arrays implemented as block rams• scalar variables are implemented as registers • compiler indicates deep pipelining states• wide pipelining done automatically

• Dime-Talk hardware network design tool• Hardware/bus/card design tool• implements internal switching network for data movement (saves designing

explicit hardware buffers for moving data• dime-talk is 32 bits wide by definition• too complex for programming use, requires wiring and components• Improvement is on the way! The new FSB system that is being ordered has

the Nallatech Abstraction Layer which promises to do away with the function of DimeTalk for the FSB system. It is unclear the status with regards to older cards

Page 15: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

debugging via trace functions

• no debugger for Dime-C• no behavioral simulation• no printf with Dime-C code (on a separate hardware

device)• poor-mans printf: diagnostic array

• bram array that controlling program reads out after running• array contains copies of input parameters• final values of loop variables (diagnose if array didn’t run, if it hit

alarm condition• be sure to initialize to prevent stale data from being read

Page 16: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

screenshot example of trace code• i=0;• output_count = 0;• loop_count = 0;• DataOut[0]=HostSetPacketSize;• DataOut[1]=SndRcv;• DataOut[2]=delay;• DataOut[3]=wordstosend;• DataOut[4]=direction;• if(SndRcv==0){• // we're receiving

• [deleted for clarity

• my_count = DataOut[6] = (int)GetPipeCount(ChannelA.datain);• i=0;• while( (i<my_count) && (output_count<INT_RAMSIZE_0) ){• DataReceived[output_count] = readFIFO(ChannelA.datain);

Page 17: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Nnalli library

• Created NCSA/NALLatech Interface library in 2008• holds state information for the running cards• provies an API layer over the hardware-oriented FUSE

software library from Nallatech

Page 18: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

nnalli library functions

• nnalli_init(int my_NUMDEVICES, char **my_devicefiles, DWORD **my_deviceinfo, int my_memory_map_0,DWORD default_timeout);

• int nnalli_run(nnalli_accelerator_t *my_accel);• int nnalli_wait(nnalli_accelerator_t *my_accel);• int nnalli_run_wait(nnalli_accelerator_t *my_accel);• int nnalli_write_4byte_values(nnalli_accelerator_t

*my_accel,void *source_pointer, int destination_address, int destination_node, int num_values);

• int nnalli_read_4byte_values(nnalli_accelerator_t *my_accel,void *destination_pointer, int source_address, int source_node, int num_values);

Page 19: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Teaching students at CHPC in CT

• Craig Steffen was invited by Mike Inggs of the University of Cape Town to give a class in reconfigurable computing in December of 2008 at the now-forming Center for High Performance Computing

• Students were about 16 advanced undergraduates and graduate students

• CHPC has Nallatech hardware and software, so used Dime-C, Dime-Talk and Nnalli abstraction library

• Realized some of the limitations of the library for everyday use, more on this later

Page 20: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Nallatech H101 card

• Nallatech’s first in their High-Performance Computing line of products

• Announced in 2007, based on the Xilinx Virtex-4 FPGA line

• PCI-X 133 bus (obsolete now)• Each chip has .6 MB internal block RAM• Each card has 4 banks of SRAM at and 1 bank of

SDRAM at 512 MB• Each card has 4 high-speed serial links to communicate

with other nearby cards

Page 21: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Nallatech H101

Page 22: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

H101 limitations

• Dime-C/Dime-Talk designs run at 100 MHz• Bus width for Dime-Talk network is 32-bits (hardware bus

is 64• Block RAM can be arbitrarily configured up to .6 MB.

(“infinite bandwidth” of FPGA for small data sets)• SRAM dual-ported and pipelined, so up to 8 32-bit data

references for collections from .6 MB up to 16 MB• SDRAM quad-ported but NOT pipelined. Up to 4 non-

pipelined memory references per clock

Page 23: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

H101 timing measurements

• Loading H101 Virtex-4 configuration: ~600 ms• run-stop sequence (running function of 0 length): 200 µs• Latency and throughput for memory transfer to/from

SRAM and block RAM: latency: 23 µs and 290 MB/s

Page 24: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

FPGA design process

• Code written in Dime-C dialect and SDK• Compiler takes seconds• DimeTalk “build” includes place-and-route process on chip,

which can take many hours or many tens of hours for a fairly full design

• place-and-route is not a problem with a closed design time• place-and-route can fail (after 10 or 20 hours)

• SO• all software is developed first, in C and tested• parts of the software are ported over to the FPGA processor

Page 25: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

variable-re-use rule breaking pipelining

• Deep pipelining requires strict data-flow through the inner loop

• C is a sequential language, Dime-C is a parallel language. Statement execution can happen in ANY ORDER to satisfy pipelining requirements. Statements are put in parallel if they don’t depend on each other.

• therefore: variable can be read multiple times, but only assigned once. Otherwise, which order they are assigned is important pipelining breaks

Page 26: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Text Parsing For Indexing

• ignore file format• blindly search for strings of ascii text characters

• character map is arbitrarily configurable (by 255 character input array)

• blocks of “text” surrounded by “non-text” logged as “words”

• non-text ignored• raw product is unsorted list of words (or word hashes)

(as 64-bit unsigned integers)

Page 27: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Raw text indexing scheme

• Everything must be trivially parseable• Created index format using ASCII information

• index file lists files indexed or other index files• rest of the files is an ordered list of words encountered• hierarchical file format with one master file at the top

• Prototype deployed on NARA demo system

Page 28: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Text Search Infrastructure

• Text finding scheme searches through file hierarchy starting at a “top index”

• Follows braches of index tree where at least one file contains each search term

• common words are contained in all branches, so they have no effect (but are no detriment)

Page 29: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Index File Example 1

• 0• ./109/h1607_ih.xml• ./109/h1609_ih.xml• ./109/h1610_ih.xml• ./109/h1611_ih.xml• ./109/h1612_ih.xml• ./109/h1613_ih.xml• ./109/h1614_ih.xml• ./109/h1615_ih.xml• ./109/h1616_ih.xml• ./109/h1617_ih.xml• ./109/h1618_ih.xml

• 00151664• 00160447• 00200442• 00301762• 00500445• 00700667• 00714002

• level-0 file indicates index level (references individual files)

• file name list shows files referred to

• rest of the file is word hash list

Page 30: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Index File Example 2

• 1• .index_00_0000.ind• .index_00_0001.ind• .index_00_0002.ind• .index_00_0003.ind• .index_00_0004.ind• .index_00_0005.ind• .index_00_0006.ind• .index_00_0007.ind• .index_00_0008.ind• .index_00_0009.ind• .index_00_0010.ind

• 00001452• 00006235• 00014005• 00054001• 00064002• 00070676

• level-1 file is an index of level-0 index files

• referenced index file list

• rest of the file is word hash list for all referenced original files

Page 31: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Index File Example 3

• 4• .index_03_0000.ind• .index_03_0001.ind

• 00000550• 00001452• 00006235• 00014005• 00014043• 00014044• 00014046• 00014047• 00015415• 00016234• 00017234

• level-4 “top” level index file

• just 2 child index files• file hash list for entire

collection (file is very large, but one monolithic file

Page 32: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Index Explanation of Philosophy

• Assume that files and data will one day be on archives with indexing information LOST

• With only the information on and about a data storage volume, determine if the volume might contain a certain set of files

• The master index of the volume can determine if the data is not in the volume, in which case the entire volume does not need to be mounted and read

• only a crude indexing system; this is not meant as a fine search tool

Page 33: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Search Appliance Screen shot 1

Page 34: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Search Appliance 2; Index Searchable

Page 35: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Search Appliance Timing Examples

• Timing examples of search appliance:• NARA-generated file hierarchy index• Large file set:

• SEC documents from 2005• 686,000 xml-annotated files total• 146 GB on disk (we could only put part of the collection on the

machine, as it mostly fills up the hard drive)

Page 36: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Search Application: more obscure words = less time• The less common the word, the less of the tree has to be

searched• if a word does NOT exist at all in the top level index file,

no file contains that word and the search ends. Very obscure words generate 0-time searches

• The more words and the more obscure words, the faster the search runs

Page 37: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Text Index: Dropping Undesired Keywords

• Some words are not wanted (formatting tags, xml tags in xml documents and xml keywords)

• Raw text parser outputs these words as well as others• Unwanted words can be eliminated at index storage

time, which is after sorting• xml tags are strictly grouped and are easy to eliminate

(exists in hardware but not in software)

Page 38: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Sorting on FPGA

• We have not implemented a sorting algorithm in the FPGA

• data is passed back to the CPU in unordered, sequential order

• sorting is a problem that scalar CPUs (fast, instruction based processors) are very good at

• not likely to be a win for FPGA

Page 39: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Memory Bank Multiple Access

• arrays in Dime-C are directly implemented as literal banks of internal memory

• memory is single-ported from the Dime-C process: you may ONLY make ONE reference to a memory bank and still be pipelineable

• A=a[i];• B=a[j]; // legal• C=a[k]; // will not pipeline

Page 40: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Temporary Variables to Help Compiler

for(i=0…){ A1 = a[i]; A2 = a[i]; // will not pipeline}

instead

for(i=0…){ temp = a[i]; A1 = temp; A2 = temp;}“temp” is a register, and can be made into various “virtual” variables that can be

read multiple times

Page 41: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Bit Shifting Operations

• Bit-shifts and bit-masks are very low gate cost; they are implemented as wires inside the FPGA

• all numerical operations are implemented as fully general-purpose operators (taking lots of gates)

• the lower the width, the lower the gate-count• avoid divides if possible• divide and multiply by powers of 2 by bit-shifting

Page 42: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Dime-C Code Screenshot

Page 43: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Dime-C Pipeline Display

Page 44: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Dime-Talk System Design Screenshot

Page 45: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Histogramming Discussion

• histogramming is a real computer science problem• each update of histogram array consists of a read-

location, increment value, write same location• can only be pipelined if you can guarantee that the

subsequent reads and writes won’t overlap• histogramming in the FPGA is therefore non-pipelined,

sequential-only code (each loop takes X clocks before the next one starts)

Page 46: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Histogram Solution, Wide Pipeline Only

• pipelining can be done in non-deep-pipelined loops, with many different block rams to hold sub-histograms

• second loop at the end to add histograms together• another solution: small number of bins (dozens or

perhaps hundreds) can be histogrammed via hardware accumulators

• histogramming can be done• used to be able to trick compiler into it, but they updated the

compiler and now that’s not allowed• can be done in hardware, just have to convince Dime-C to do it.

Page 47: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Bit Shifting For Input

• input file for text parsing is stored in 4-byte-wide array• one read per clock is four bytes (4 characters)• algorithm must parse through the file one character at a

time to check for text boundaries• split input and analyze the file 4-wide-parallel (file is

processed in ¼ as many clocks as file has bytes

Page 48: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Eliminating If-Then-Else

• if-then structure is a control change statement, cannot be done in a pipelined fashion (causes halts or bubbles in pipeline). Pipelined control is static

• numerical and logical operations can be pipelined• better to AND two values together than check to see if

it’s zero

Page 49: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Input validation; File Characterization

• file bytes can be histogrammed to create basic imprint of file

• different file types will exhibit different groupings of bytes in files (text characters vs. control characters)

• files can be tested against a known character frequency map of files of that type for validation

• can be done as part of ingest processing

Page 50: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

File Characterization: doc file

Page 51: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

File Characterization: Compressed File

Page 52: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Can’t Determine The Type of Input File

• Too many file types, too many variations• however…very good at spotting that it’s the WRONG

type of file (pdf for doc or vice versa, for instance)• Flag incoming data that doesn’t match expectations

Page 53: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Images: High Data Re-Use

• Due to transfer and control latencies, FPGAs tend to be very good at tasks have high data-reuse characteristics

• stencil operations on images have a 9x re-use factor• each pixel participates in the calculation of all its nearest

neighbors and itself • ring buffers insure that all each pixel only needs to be

read once from the local memory (one pixel per clock, up to 4-byte pixels)

Page 54: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Image Analysis Conundrum: Too Much Data

• What information is useful and interesting to extract from an image?

• Histogramming problem aside; what pixel information combinations are useful to keep track of? (This information must take the place of reading the image itself to be of any use)

• meta-data could be saved as reduced version of image. 12k image is fairly impressive. Anything over 12k (3000 numbers) is redundant

Page 55: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Measure Ratio of Image Imprints

• make a series of numbers that embody basic characteristic quantities in the image, like for bytes in files

• ratios of brightnesses, ratios of colors, level of hue and contrast over the whole image

• key is to give informatino that will allow you to decide the image is NOT what you’re looking for

Page 56: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

must have bmp images; don’t have unpacking• unlike text, can’t work on raw compresed images• FPGA algorithm only looks at BMP images, where each

pixel is represented by an array of pixel values of a fixed length and pixel size

• will look into uncompression of images on FPGA, but that algorithm is complicated

Page 57: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Image Slice Analysis

• instead of image characterization by scalar numbers, try to “walk” across image and count certain features as a method of characterization

• simple and fast to do in FPGA

Page 58: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

slice analysis example 1

Page 59: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

slice example analysis 2

Page 60: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Slice Example 2

Page 61: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Slice Example 2 (fail)

Page 62: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Slice 2 again

Page 63: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

raw timings of analysis of 1024x768 image on CPU vs. FPGA• CPU timing

• Full image scan on CPU takes 400 ms• sliced image scan takes 50 ms

• FPGA timing• Sliced image scan takes ~10 ms

Page 64: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

End-Point Goal: Multiple Simultaneous Agorithms• Each of these analyses on the FPGA tends to be limited

by the latencies involved• due to place-and-route times, it’s best to build images

that are lightly occupied (P&R takes under an hour)• The end-point goal is single-pass ingest processing; all

algorithms present in the FPGA at the same time• The key to that is picking the algorithms that produce the

best and most important meta-data for one-pass processing

• Integration of multiple algorithms is the final step

Page 65: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Networked-Based Ingest Test Environment

• Past tests are stand-alone prototypes• driven by local processing test harnesses• local disk• single machine, no competition for RAM or CPU time

• Real Electronic Records Archive• files come in on network, out to mass storage (presumably)• processing control and scheduling driven by external interface• any process is running in a heterogeneous environment

• Task this year is to use archive prototype system to get as close as possible to a real ingest behavior

Page 66: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Transcontinental Persistent Archive Prototype

• Current technology of the Archive Prototype is the Storage Research Broker. This is a distributed file system that makes multiple sites look like it has the same virtual disk

• New iRODS: Integrated Rules-Oriented Data System• distributed data storage• Rules-based server behavior implements functions• rules invoke modular “services” • services are modular and can be compiled-in on the server side

Page 67: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Implemented iRODs FPGA Accelerator

• Used Nnnalli interface to implement FPGA file processing as an iRODS service

• perform any of the previous analysis operations under the iRODs rules controls

• problem: Nnalli system designed for:• start (one latency)• lots of operations under local control• stop

• iRODs:• invoke rule set (FPGA service) on each file separately• each file takes > 600 ms

Page 68: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Static Library for FPGA Re-execution

• upgraded Nnalli library that holds information about the FPGA configuration between calls

• the iRODS service request contains execution information and the identity (hash) of the requested FPGA execution image

• If the requested image matches the the FPGA contains then configuration is skipped, letting the execution go through in 200 µs plus data movement time

Page 69: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Next Steps (2008-2009 Fiscal year)

• Connect local FPGA iRODs system to that of TPAP to transfer and share files

• Start taking timing tests for a variety of file types• Make timing tests of files being brought in through an

ingest process• Aggregate the best algorithm designs into FPGA ingest

processor design for H101 card

Page 70: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Near-Term Improvements: Multiple Files

• text files are typically small and dominated by transfer and execution latency

• modify algorithms to operate on more than one file• more parameters to pass, but latencies will be

ammortized over many more files per invocation• test case: text parser with multiple input files, multiple

output arrays

Page 71: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Near-Term Improvements: Double Buffering

• in a production environment, double-buffer used to hide the time penalty of data copying

• test case again text parsing

Page 72: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Implementing Improvements

• to modify algorithms for double-buffer and multiple-file processing:• input in SRAM (more room)• all reads and writes have start offset• use nalli_run() and nnalli_wait() functions instead of

nnalli_runwait()

Page 73: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

No 24-bit Histogramming Yet

• Peter Bajcy has standard suite of histogramming images• They are histogramming 24-bit values of color pixels• there isn’t yet a practical way to do image histogramming

of pixels in this way, input space is too large. An internal histogram will roughly have a maximum of an 18-bit histogram, with no parallel histogram

Page 74: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

One-Time Processing and Goals

• Goal: one-time processing of files ingested into TPAP or ERA-type environment via FPGA-based processing

• some file-processing schemes defined• text parsing for word indexing• file characterization for ingest verification• image stripe processing for object counting

Page 75: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Embedded Bitfile Format

• FUSE and Nnalli embed the location of the FPGA bitfile at compile time

• meta-data about bitfile (interfaces and file location) embedded through network.h file

• Ok for deployment, very clumsy for testing bitfile versions• created embedded bitfile container format that contains the bitfile

itself and parsed version of network.h files• bitfile container is invoked at the command line at run-time (the

same executable can use multiple bitfiles at different times)• useful for side-by-side testing• useful for one piece of code (say, an iRODs service) existing on

different machines with different FPGA hardware • working on this now; current state has been submitted to SAAHPC

Page 76: FPGA Data Ingest Processing for NARA Electronic Records

Imaginations unbound

Future FPGA improvements

• FPGAs:• low memory bandwidth in expansion slots• slow function activation• requires helper function to input data

• Three companies now sell CPU-socket-based FPGA devices• direct access to system RAM for lower latency and higher

throughput• We are in the process of purchasing a Nallatech front-side bus

FPGA system