Profile-directed speculative optimization of reconfigurable floating point data paths

The Queen’s TowerThe Queen’s TowerImperial College LondonImperial College LondonSouth Kensington, SW7South Kensington, SW7

27th Jan 2008 | Ashley Brown

Profile-directed speculative optimization

ofreconfigurable floating

point data pathsWorkshop on Reconfigurable

Computing at 2008

Ashley Brown, 27th Jan 2007

27th Jan 2008 | Ashley Brown # 2

IntroductionIntroduction

• Computational science requires reproducible and accurate results

• IEEE-754 is a compromise– Broad range of values

– Many special cases

• Idea: use profiling to reduce range and remove special cases

Generate floating-point data-paths for FPGAs which are smaller and faster

• BUT KEEP RESULTS CONSISTENT WITH IEEE-754


Advantages of Smaller Floating PointAdvantages of Smaller Floating Point

• Embedded Systems– Do the same work for a lower cost– Implement IEEE-754 compliant floating point where

it may not have been possible before

• High performance– Do more work with the same hardware– Increase in parallel execution on FPGAs– No need to sacrifice IEEE-754 compliance

Four Pictures to Explain: #1Four Pictures to Explain: #1







Pre-optimisation Post-optimisation


Optimisation TechniqueOptimisation Technique

• Remove features from the floating-point unit:– Operand alignment– Normalisation– Operand swap

• If these were required, detect and fall-back to alternative solution:– Software-based on embedded/host processor– Hardware-based full implementation for larger

designs

Optimisation OptionsOptimisation Options


The stages of optimisationThe stages of optimisation

• Profile target application with training datasets– Source usually FORTRAN, C

• Identify frequently-executed blocks

• Check for good value-locality

• Generate reduced-size floating point datapath– Reduced operand alignment hardware– Reduced normalisation hardware

• Error checking: execute with additional datasets, check error rates



FloatWatch ProfilerFloatWatch Profiler

• Valgrind-based value profiler

• Can return some metrics of interest here:– Floating point value

ranges– Ratio of floating point

operands

• Each has uses for optimisation!


VFLOAT LibraryVFLOAT Library

• VHDL variable-precision floating-point library– Initially developed by Belanovic at Northeastern,

continued development under the supervision of Leeser

• Allows basic customisation of precision, exponent bit widths

• Further customisations added for our optimisations:– Operand alignment

– Normalisation

• Performance is lower than vendor-specific libraries


Data-path GeneratorData-path Generator

• Takes user-selected data-path and generates VHDL implementation

• Assembles modified version of the RPL library – customised to allow removal of various items

• Builds hardware/software integration layer– C library for software– VHDL for hardware

• Does not modify the software source automatically (yet)


Proof-of-Concept TestingProof-of-Concept Testing

• Original application modified to call C library (usually from FORTRAN)

• Data sent to hardware, calculated, and returned– Software waits for response– No data-aggregation or hardware-side error

detection occurs

• Software layer performs same calculation for verification

• Overall error rate reported


‘‘ydl_pij’ydl_pij’

• ‘ydl_pij’ is an iterative solver for quantum mechanics, using the “Molecular Mechanics – Valence Bond” method

• Datasets of various sizes available, allowing a variety of test cases be used

• Initial profiling and testing use separate datasets


‘‘ydl_pij’: Profiling (Hot Code Section)ydl_pij’: Profiling (Hot Code Section)

Narrow value ranges


‘‘ydl_pij’: Identificationydl_pij’: Identification

• FloatWatch identifies the regions of code executing the most operations

• In this case, these show narrow value ranges

• Create optimised datapaths for testing– Maximum operand alignment reduced to 2n

, where n is in the range [1, 6]

– Normalisation hardware modified similarly

‘‘ydl_pij’ Error Rateydl_pij’ Error Rate

Not profiled

‘ydl_pij’: Error Rate and Size

• 20% size reduction with negligible re-execution rate (< 0.5%)

• 27% size reduction with 3% re-execution rate

• Size reduction permits ~40% increase parallelism due to better space usage

ydl_pij: Area saving for one F.P. ydl_pij: Area saving for one F.P. adder/subtractoradder/subtractor


Pre-optimisation Post-optimisation


Coming SoonComing Soon

• Per-operation optimisations– Currently only at data-path level

• Optimisation of operand-swap hardware

• Per-operation exponent customisation (size, bias)

• Performance evaluation using state-of-the-art FPGA accelerator hardware

• Implementation of error detection and re-execution

• Potential for even greater size reductions

Documents

Profile-directed speculative optimization of reconfigurable floating point data paths