Study the effects of approximation on conjugate gradient ... · Study the effects of approximation...

Examiners:Prof. Dr. Christian PlesslProf. Dr. Marco Platzner

Guide:Michael Laß

Presenter:Tasneem Filmwala

Study the effects of approximation on conjugate gradient algorithm

and accelerate it on FPGA platform

Outline➔ Problem Description

➔ Motivation

➔ Approximate Computing

➔ Algorithm Background

➔ Tools and Hardware Platform

➔ Design

➔ Approximation Techniques

➔ Precision Scaling in CG

➔ Error Analysis

➔ Evaluation

➔ Conclusion

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Problem Description

Huge data available

Clusters to compute them

Increased complex mathematical modelsand simulation environment

HPC Meets Approximate Computing

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Motivation

● Computing nodes already advanced● Focus on optimizing algorithms● Use of approximation for performance/resource

benefits

● Iterative Algorithms promising targets in HPC● Conjugate Gradient method used in HPC for

iteratively solving systems of linear equation● Approximate data paths of algorithm

● Accelerate on FPGA

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Approximate Computing

● Key Idea

Approximate Computing● Key Idea

● Error Resilient Domains

Image ProcessingMachine LearningSignal Processing

Approximate Computing

HPC Meets Approximation

Approximate Iterative Algorithms

Tolerate imperfect solutions

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Algorithmic Background

Conjugate Gradient Method➢ Iterative Approach➢ Optimization of quadratic function

F(x) = ½ F(x) = ½ xxT T A x - B x A x - B xT T + C+ C

F(x) = Ax – B =0F(x) = Ax – B =0

➢ Solves large system of linear equation

➢ Advanced variation of steepest descent method

➢ Converges in few iterations

Input(Known) Matrix

A x – B = 0

Input (Known) Vector

Solution vector

➢ Move along A Conjugate Search direction ➢ Pk

T A Pk-1 = 0➢ Initial search direction is same as gradient vector

➢ Next search direction (Pk) is linear combination of current gradient vector and previous search direction

= Xk + step-size * P

k + beta * P

CG Algorithm

initial guess: x0 = 0

Compute: r0 = Ax

0 - b, p

0 = -r

For(k = 0, 1, 2, .. until convergence){α

k = rT

k / pT

= xk + α

= rk + α

βk = r

k+1T .r

k+1 / r

k T . r

= rk+1

+ βk p

Calculate Step-size

Calculate X

Calculate Residual

Calculate search step

Calculate search Direction

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Tools and Hardware Platforms 18

Tool FlowTools and Hardware Platform

Hardware Platform● IBM POWER8 system with virtex7 based FPGA

● Coherent memory access of host memory by FPGA.

Tools and Hardware Platform

Hardware Platform

Coherent Accelerator Processor Interface

Accelerator Function UnitAFU

Hardware Platform

● Components of CAPI

Main components of CAPIapplication and accelerator

Coherent Attached Processor Proxy(proxy for accelerator)

Links coherency protocol between CAPP and PSL

Power Service Layer(local cache for accelerator)

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Design

➢ Optimization of AFU➢ Design Methodology➢ Framework

Optimization of AFU

➢ Increase the data width of AXI bus➢ Partition Vector and Matrix➢ Pipeline and unroll operations➢ Optimize Floating point MAC operation

and achieve II=1 by partial accumulation of product

➢ MAC operation achieved II=1 in HLS simulation but failed in hardware

Design

Design MethodologyDesign

Framework

ha_pclock

Verilog top module

for PSL Signals

Clock at 250M

Clock at 125M

Reset at 125M

Reset at 250MData Transfer

CAPI ADAPTER

CAPIAXI Interconnect

AFU AXI Interconnect

Clocking Wizard

rst_Clock_125MHz

rst_Clock_250MHz

Design

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Approximation Techniques

➢ Loop Perforation➢ Inexact Circuits➢ Voltage Over-scaling➢ Over-clocking➢ Skipping tasks and memory access➢ Precision scaling Shall be used

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Precision Scaling in CG

ApproximateStorage

ApproximateComputation

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Error analysis➢ Residual not used to check error distance➢ Residual propagates error from previous

residual➢ Causes CG to converge at wrong solution➢ Use euclidean error distance to study

error distance for different approaches➢ Study Error Distance for approximate

storage➢ Study Error Distance for approximate

computation

Approximate Storage

➢ Single Precision Storage➢ Half Precision Storage ➢ Fixed Point Storage (using varied bit-

widths)

Error Analysis

Approximate Storage

Half Precision Storage

Single Precision Storage

Error Analysis

Approximate Computation

➢ Half Precision Computation

➢ Fixed Point Matrix-Vector Computation(using varied bit-widths)

➢ All operations calculated in fixed point(using varied bit-widths)

Error Analysis

Approximate ComputationHalf Precision ComputationFixed Point Matrix Vector Computation

HP Mat-Vec

FP Mat-Vec

Single Precision

Error Analysis

Designs EvaluatedError Analysis

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Evaluation➢ BRAM Utilization➢ Latency Comparison➢ DSP Utilization➢ Hardware Evaluation

BRAM UtilizationEvaluation

➢ ~40-50 % reduction in BRAM using half precision storage

Latency ComparisonEvaluation

➢ Speed up of around 2.0x for designs using half precision storage

DSP UtilizationEvaluation

Hardware Evaluation

➢ Designs Evaluated on Hardware➢ Resources available on hardware➢ Resource Utilization Results➢ Performance Results

Evaluation

Designs Evaluatedon Hardware

➢ Single precision storage with fixed point matrix-vector multiplication

➢ Half Precision storage with fixed point matrix-vector multiplication

Evaluation

Resources AvailableEvaluation

Resource UtilizationReport

Evaluation

➢ Half precision uses 63 % less BRAM as compared with single precision

➢ More DSP usage by half precision in comparison with single precision

Performance ResultsEvaluation

➢ FPGA implementations better than software

➢ Half Precision gave 1.5x speed up as compared to single precision

➢ Half precision achieved 2.0x times speed up as compare to software implementation

➔ Motivation

➔ Design

➔ Error Analysis

➔ Evaluation

➔ Conclusion

Conclusion➢ Proposed use of approximate computing in HPC

domain

➢ Approximate Conjugate Gradient method using precision scaling

➢ Built an IP core to connect HLS design with CAPI

➢ Implemented CG using HLS, approximated it and performed error analysis

➢ Evaluated 5 designs in terms of resource and performance

➢ Successfully ran 2 designs and compared performance/resources against software implementation

➢ Gained speed up and resource benefits using approximation

Key Findings➢ Floating point worked better for

approximate storage➢ Matrix-Vector multiplication more error

resilient than rest of the operations ➢ Residual Calculation proved erroneous

Thank YouAny Questions ?

Backup Slides

Pipeline

Residual Problem

Effect of condition numberand matrix sizes

Rest Operations using Fixed Point

Fixed Point ComputationResidual Pattern

Framework

Precision Scaling in CG

➢ Approximate Storage for saving BRAM➢ Approximate Computation for matrix-

vector operations➢ Use of custom HLS data types➢ Use of type casting feature of HLS

Study the effects of approximation on conjugate gradient ... · Study the effects of approximation...

Documents

Gradient Methods April 2004. Preview Background Steepest Descent Conjugate Gradient

GLOBALCONVERGENCE PROPERTIES OF CONJUGATE GRADIENT …pages.cs.wisc.edu/~swright/726/handouts/SJE000021.pdf · ofseveral conjugate gradient methods for nonlinear optimization. Weconsider

Conjugate Gradient Method for Indefinite Matrices

Non-Linear Conjugate Gradient Magnetotelluric Inversion

Conjugate Gradient Methods for Multidimensional Optimizationdmitra/SciComp/19Fall/Presentations/Conjugate... · Conjugate Gradient Methods for Multidimensional Optimization Stephen

Conjugate Gradient Explanation

SOLVING LINEAR EQUATIONS WITH CONJUGATE GRADIENT METHOD …sites.khas.edu.tr/tez/CanerSayin_izinli.pdf · SOLVING LINEAR EQUATIONS WITH CONJUGATE GRADIENT METHOD ON OPENCL PLATFORMS

An Adaptive Multipreconditioned Conjugate Gradient Algorithm

A conjugate-gradient based approach for approximate

Conjugate gradient method - An Introduction to the Conjugate

An Introduction to the Conjugate Gradient Method Without the …graphics.cs.cmu.edu/.../01/painless-conjugate-gradient.pdf · 2008-09-08 · An Introduction to the Conjugate Gradient

The HPC Conjugate Gradient (HPCG)

Conjugate Gradient

Preconditioned Conjugate Gradient Methods in Truncated ...cjlin/papers/tron_pcg/precondition.pdf · Preconditioned Conjugate Gradient Methods in Truncated ... The model parameters,

CGIHT: Conjugate Gradient Iterative ... - University of Oxford

Parallel Conjugate Gradient: Effects of Ordering ... · Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms LEONID OLIKER

The Conjugate Gradient Method...Conjugate Gradient Algorithm [Conjugate Gradient Iteration] The positive deﬁnite linear system Ax = b is solved by the conjugate gradient method

Gradient Methods May 2005. Preview Background Steepest Descent Conjugate Gradient

Steepest Gradient Method Conjugate Gradient Method

A descent hybrid conjugate gradient method based on the ...esdlab/newEsdlab/...conjugate gradient method from the first category and switch to a conjugate gradient from the second