GUSTO: General architecture design Utility and Synthesis Tool
for Optimization Qualifying Exam for Ali Irturk University of
California, San Diego 1
Slide 2
Thesis Objective 2 Design of a novel tool, GUSTO, for automatic
generation and optimization of application specific matrix
computation architectures from a given Matlab algorithm; Showing
the effectiveness of my tool by Rapid architectural production of
various signal processing, computer vision and financial
computation algorithms,
Slide 3
Motivation Matrix Computations lie at the heart of most
scientific computational tasks Wireless Communication, Financial
Computation, Computer Vision. Matrix inversion is required in
Equalization algorithms to remove the effect of the channel on the
signal, Mean variance framework to solve a constrained maximization
problem, Optical flow computation algorithm for motion estimation.
QRD, A -1 3
Slide 4
Motivation 4 There are a number of tools that translate Matlab
algorithms to a hardware description language; However, we believe
that the majority of these tools take the wrong approach; We take a
more focused approach, specifically developing a tool that is
targeting matrix computation algorithms.
Slide 5
Computing Platforms 5 ASICsDSPsFPGAsGPUCELL BE Exceptional
Performance Long Time to Market Substantial Costs Ease of
Development Fast Time to Market Low Performance Ease of Development
Fast Time to Market ASIC-like Performance
Slide 6
Field Programmable Gate Arrays FPGAs are ideal platforms High
processing power, Flexibility, Non recurring engineering (NRE)
cost. If used properly, these features enhance the performance and
throughput significantly. BUT! A few tools exist which can aid the
designer with the many system, architectural and logic design
choices. 6
Slide 7
GUSTO General architecture design Utility and Synthesis Tool
for Optimization An easy-to-use tool for more efficient design
space exploration and development. GUSTO Matrix dimensions Bit
width Resource allocation Modes Algorithm Required HDL files 7
GUSTO: An Automatic Generation and Optimization Tool for Matrix
Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam
Mirzaei and Ryan Kastner, under review, Transactions on Embedded
Computing Systems.
Slide 8
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 8
Slide 9
GUSTO Design Flow Algorithm Analysis Architecture Generation
Instruction Generation Resource Allocation XilinxMentor Graphics
Algorithm Matrix Dimensions Type and # of Arithmetic Resources
Design Library Data Representation + - */ Area, Latency and
Throughput Results Simulation Results Mode 1 General Purpose
Architecture (dynamic) Resource Trimming and Scheduling
XilinxMentor Graphics Area, Latency and Throughput Results
Simulation Results Mode 2 Application Specific Architecture
(static) Error Analysis 9
Slide 10
Mode 1 of GUSTO generates a general purpose architecture and
its datapath. Can be used to explore other algorithms. Do not lead
to high-performance results. GUSTO Modes Instruction Controller
Arithmetic Unit Memory Controller Arithmetic Unit Multipliers
Adders Multipliers Arithmetic Units 10 Mode 2 of GUSTO creates a
scheduled, static, application specific architecture. Simulates the
architecture to Collect scheduling information, Define the usage of
resources.
Slide 11
Matrix Multiplication Core Design Algorithm Analysis
Architecture Generation Instruction Generation Resource Allocation
XilinxMentor Graphics Algorithm Matrix Dimensions Type and # of
Arithmetic Resources Design Library Data Representation Area,
Latency and Throughput Results Simulation Results Mode 1 General
Purpose Architecture (dynamic scheduling) Resource Trimming and
Scheduling XilinxMentor Graphics Area, Latency and Throughput
Results Simulation Results Mode 2 Application Specific Architecture
(static scheduling) Error Analysis 11
Slide 12
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Error
Analysis Matrix Multiplication Core Design 12
Slide 13
Matrix Multiplication Core Design Algorithm Analysis for i=1:n,
for j=1:n, for k=1:n Temp = A(i,k)*B(k,j); C(i,j) = C(i,j) + Temp;
end C = A * B Built-In Function 13
Slide 14
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Ali Irturk
Error Analysis Matrix Multiplication Core Design 14
Instruction Controller Arithmetic Unit Memory Controller
Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units
Matrix Multiplication Core Design Instructions 16
Slide 17
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Error
Analysis Matrix Multiplication Core Design 17
Slide 18
Instruction Controller Arithmetic Unit Memory Controller
Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units
Matrix Multiplication Core Design Number of Arithmetic Units
18
Slide 19
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Error
Analysis Matrix Multiplication Core Design Error Analysis 19
Slide 20
Matrix Multiplication Core Design Error Analysis GUSTOMATLAB
User Defined Input Data Error Analysis Metrics: 1)Mean Error 2)Peak
Error 3)Standard Deviation of Error 4)Mean Percentage Error Fixed
Point Arithmetic Results (using variable bit width) Floating Point
Arithmetic Results (Single/Double precision) 20
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Ali Irturk
Error Analysis Matrix Multiplication Core Design Architecture
Generation 22
Slide 23
Instruction Controller Arithmetic Unit Memory Controller
Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units
Matrix Multiplication Core Design Architecture Generation General
Purpose Architecture Dynamic Scheduling Dynamic Memory Assignments
Full Connectivity 23
Slide 24
Algorithm Analysis Architecture Generation Instruction
Generation Resource Allocation XilinxMentor Graphics Algorithm
Matrix Dimensions Type and # of Arithmetic Resources Design Library
Data Representation Area, Latency and Throughput Results Simulation
Results Mode 1 General Purpose Architecture (dynamic scheduling)
Resource Trimming and Scheduling XilinxMentor Graphics Area,
Latency and Throughput Results Simulation Results Mode 2
Application Specific Architecture (static scheduling) Ali Irturk
Error Analysis Matrix Multiplication Core Design Architecture
Generation 24
Slide 25
Instruction Controller Arithmetic Unit Memory Controller
Arithmetic Unit Multipliers Adders Multipliers Arithmetic Units
Matrix Multiplication Core Design Architecture Generation
Application Specific Architecture Static Scheduling Static Memory
Assignments Required Connectivity 25
Slide 26
GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B
In_B1In_B2 Out_B mem In_mem1 A Out_A Out_B Out_mem1 Out_mem2 Out_A
Out_B Out_mem1 Out_mem2 Out_A In_A1 In_A2 Out_A Out_B Out_ mem1
Out_ mem2 Simulation runs 26
Slide 27
GUSTO Trimming Feature A In_A1In_A2 Out_mem2 Out_A Out_mem1 B
In_B1In_B2 Out_B mem In_mem1 B Out_A Out_B Out_mem1 Out_mem2 Out_A
Out_B Out_mem1 Out_mem2 Out_B In_B1 In_B2 Out_A Out_B Out_ mem1
Out_ mem2 Simulation runs 27
Slide 28
Matrix Multiplication Core Results 28 Inst. Cont. A A A A M M M
M Mem. Cont. Inst. Cont. A A M M M M Mem. Cont. Inst. Cont. A A M M
Mem. Cont. Design 1Design 2Design 3 Design 1 Design 2Design 3 Area
(Slices) Throughput Hardware Implementation Trade-offs of Matrix
Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation
Conference (DAC 2009). A
Slide 29
Hierarchical Datapaths Unfortunately the organization of the
architecture does not provide a complete design space to the user
for exploring better design alternatives. This simple organization
also does not scale well with the complexity of the algorithms: To
overcome these issues, we incorporate hierarchical datapaths and
heterogeneous architecture generation options into GUSTO. 29 Number
of Instructions Optimization Performance Number of Functional Units
Internal Storage and Communication
Slide 30
Matrix Multiplication Core Results 30 Core A_1 Design 4 Core
A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1
Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core
A_1 Design 5 Hardware Implementation Trade-offs of Matrix
Computation Architectures using Hierarchical Datapaths, Ali Irturk,
Nikolay Laptev and Ryan Kastner, under review, Design Automation
Conference (DAC 2009). 16 Inst. Cont. A A A A M M M M Mem. Cont.
Core A_1
Slide 31
Matrix Multiplication Core Results 31 Core A_1 Design 4 Core
A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1
Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core
A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2
Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Hardware
Implementation Trade-offs of Matrix Computation Architectures using
Hierarchical Datapaths, Ali Irturk, Nikolay Laptev and Ryan
Kastner, under review, Design Automation Conference (DAC 2009). 16
8 Inst. Cont. A A A A M M M M Mem. Cont. Core A_2
Slide 32
Matrix Multiplication Core Results 32 Core A_1 Design 4 Core
A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1
Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core A_1 Core
A_1 Design 5 Core A_2 Design 6 Core A_2 Core A_2 Core A_2 Core A_2
Core A_2 Core A_2 Core A_2 Core A_2 Design 7 Core A_4 Design 8 Core
A_4 Core A_4 Core A_4 Core A_4 Design 9 Hardware Implementation
Trade-offs of Matrix Computation Architectures using Hierarchical
Datapaths, Ali Irturk, Nikolay Laptev and Ryan Kastner, under
review, Design Automation Conference (DAC 2009). 16 8 4 Inst. Cont.
A A A A M M M M Mem. Cont. Core A_4
Matrix Multiplication Core Results 34 123 45678 9 Core A_2
Design 10 Core A_4 Core A_1 Design 11 Core A_2 Core A_1 Core A_1
Design 12 Core A_4 Core A_1 1011 12 Throughput Area (Slices)
Hardware Implementation Trade-offs of Matrix Computation
Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay
Laptev and Ryan Kastner, under review, Design Automation Conference
(DAC 2009). 3 2 3 5 5 2 4 4
Slide 35
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 35
Slide 36
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 36
Slide 37
M ATRIX D ECOMPOSITIONS QR, LU AND C HOLESKY Given Matrix
Orthogonal Matrix Upper Triangular Matrix 37 Lower Triangular
Matrix Given Matrix Upper Triangular Matrix Unique Lower Triangular
Matrix (Cholesky triangle) Transpose of Lower Triangular Matrix
Given Matrix
Slide 38
M ATRIX I NVERSION Given Matrix Inverse Matrix Identity Matrix
Full Matrix Inversion is costly! 38
Slide 39
Results Inflection Point Analysis 39 Automatic Generation of
Decomposition based Matrix Inversion Architectures, Ali Irturk,
Bridget Benson and Ryan Kastner, In Proceedings of the IEEE
International Conference on Field-Programmable Technology (ICFPT),
December 2009. Matrix Size
Slide 40
Results Inflection Point Analysis Inflection Point Analysis
Implementation : Serial Parallel Bitwidths 16 bits 32 bits 64 bits
Matrix Sizes 2 2 3 3 .. 8 8 40
Slide 41
Results Inflection Point Analysis: Decomposition Methods
41
Slide 42
Results Inflection Point Analysis: Matrix Inversion 42 An FPGA
Design Space Exploration Tool for Matrix Inversion Architectures,
Ali Irturk, Bridget Benson, Shahnam Mirzaei and Ryan Kastner, In
Proceedings of the IEEE Symposium on Application Specific
Processors (SASP), June 2008.
Slide 43
Results Finding the Optimal Hardware : Decomposition Methods
General Purpose Architecture (Mode 1) Application Specific
Architecture (Mode 2) QRLUCholesky Decrease in Area (Percentage)
94%83%86% 43 Architectural Optimization of Decomposition Algorithms
for Wireless Communication Systems, Ali Irturk, Bridget Benson,
Nikolay Laptev and Ryan Kastner, In Proceedings of the IEEE
Wireless Communications and Networking Conference (WCNC 2009),
April 2009.
Slide 44
Results Finding the Optimal Hardware: Decomposition Methods
General Purpose Architecture (Mode 1) Application Specific
Architecture (Mode 2) QRLU Cholesky Increase in Througput
(Percentage) 68% 16% 14% 44
Slide 45
Results Finding the Optimal Hardware: Matrix Inversion (using
QR) average of 59% decrease in area 3X increase in throughput
45
Results Comparison with Previously Published Work: Matrix
Inversion Eilert et al. Our ImplA Our ImplB Our ImplC Edman et al.
Karkooti et al. Our Method Analytic QR Bit width1620 1220 Data
typefloating fixed floatingfixed Device type Virtex 4 Virtex
2Virtex 4 Slices1561209470214002808440091173584 DSP48s004816NR2212
BRAMsNR 000 1 Throughput (10 6 s -1 )
1.040.830.380.721.30.280.120.26 J. Eilert, D. Wu, D. Liu, Efficient
Complex Matrix Inversion for MIMO Software Defined Radio, IEEE
International Symposium on Circuits and Systems. (2007). F. Edman,
V. wall, A Scalable Pipelined Complex Valued Matrix Inversion
Architecture, IEEE International Symposium on Circuits and Systems.
(2005). M. Karkooti, J.R. Cavallaro, C. Dick, FPGA Implementation
of Matrix Inversion Using QRD-RLS Algorithm, Asilomar Conference on
Signals, Systems and Computers (2005). 48
Slide 49
Results Comparison with Previously Published Work: Matrix
Inversion F. Edman, V. wall, A Scalable Pipelined Complex Valued
Matrix Inversion Architecture, IEEE International Symposium on
Circuits and Systems. (2005). M. Karkooti, J.R. Cavallaro, C. Dick,
FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm,
Asilomar Conference on Signals, Systems and Computers (2005). Edman
et al. Karkooti et al. Our Method QR LUCholesky Bit width1220 Data
typefixedfloatingfixed Device type Virtex 2Virtex 4
Slices44009117358427193682 DSP48sNR2212 BRAMsNR 111 Throughput ( 10
6 s -1 ) 0.280.120.260.330.25 49
Slide 50
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 50
Slide 51
Asset Allocation Asset allocation is the core part of portfolio
management. An investor can minimize the risk of loss and maximize
the return of his portfolio by diversifying his assets. Determining
the best allocation requires solving a constrained optimization
problem. 51 Markowitzs mean variance framework
Slide 52
Asset Allocation Increasing the number of assets significantly
provides more efficient allocations. 52
Slide 53
High Performance Computing Higher number of assets and more
complex diversification require significant computation. The
addition of FPGAs to the existing high performance computers can
boost the application performance and design flexibility. 53 Zhang
et al. and Morris et al.Single Option Pricing Kaganov et al.Credit
Derivative Pricing Thomas et al.Interest Rates and Value-at-Risk
Simulations We are the first to propose hardware acceleration of
the mean variance framework using FPGAs. FPGA Acceleration of Mean
Variance Framework for Optimum Asset Allocation, Ali Irturk,
Bridget Benson, Nikolay Laptev and Ryan Kastner, In Proceedings of
the Workshop on High Performance Computational Finance at SC08
International Conference for High Performance Computing,
Networking, Storage and Analysis, November 2008.
Slide 54
T HE M EAN V ARIANCE F RAMEWORK 54 Computation of Required
Inputs Computation of the Efficient Frontier Expected Prices E{M}
Expected Covariance Cov{M} Allocations Optimal Allocation Standard
Deviation (RISK) Expected Return Efficient Frontier Standard
Deviation (RISK) Expected Return Highest Utility Portfolio
Computation of the Optimal Allocation 5 Phases MVF
Slide 57
Hardware Architecture for MVF Step 2 57 Random Number Generator
1 = [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo
Block = M Objective Value Allocation =[ 1, 2,, Ns ] Market Vector ?
Is this allocation the best? Expected Return Standard Deviation
(RISK)
Slide 58
Hardware Architecture for MVF Step 2 58
Slide 59
Hardware Architecture for MVF Step 2 59
Slide 60
Hardware Architecture for MVF Step 2 60 Satisfaction Function
Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo
Blocks Parallel N m Utility Calculation Blocks Parallel N p
Satisfaction Function Calculation Blocks Parallel Satisfaction
Function Calculator Blocks
Slide 61
Results 61 Mean Variance Framework Step 2 1000 runs 10
Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10
Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and
50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20
multipliers and 20 Utility Function Calculator Blocks) 302 - 442
FPGA Acceleration of Mean Variance Framework for Optimum Asset
Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan
Kastner, In Proceedings of the Workshop on High Performance
Computational Finance at SC08 International Conference for High
Performance Computing, Networking, Storage and Analysis, November
2008.
Slide 62
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 62
Slide 63
Thesis Outline and Future Work 1. Introduction 2.Comparison of
FPGAs, GPUs and CELLs - Possible journal paper, - GPU
implementation of Face Recognition for journal paper. 3. GUSTO
Fundamentals 4. Super GUSTO - Journal paper for Hierarchical design
and Heteregenous Core Design, - Employing different instruction
scheduling algorithms and analysis of their effects on implemented
architectures. 5. Small code applications of GUSTO - Matrix
Decomposition Core (QR, LU, Cholesky) designs with different
architectural choices - Matrix Inversion Core (Analytic, QR, LU,
Cholesky) designs with different architectural choices - Design of
an Adaptive Weight Calculation Cores. 6. Large code applications
using GUSTO - Mean Variance Framework Step 2 implementation, -
Short Preamble Processing Unit implementation, - Optical Flow
Computation algorithm implementation. 7. Conclusions 8. Future Work
9. References 63
Slide 64
Outline Motivation GUSTO: Design Tool and Methodology
Applications Matrix Decomposition Methods Matrix Inversion Methods
Mean Variance Framework for Optimal Asset Allocation Future Work
Publications 64
Slide 65
Publications [15] An Optimized Algorithm for Leakage Power
Reduction of Embedded Memories on FPGAs Through Location
Assignments, Shahnam Mirzaei, Yan Meng, Arash Arfaee, Ali Irturk,
Timothy Sherwood, Ryan Kastner, working paper for IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems. [14]
Xquasher: A Tool for Efficient Computation of Multiple Linear
Expressions, Arash Arfaee, Ali Irturk, Ryan Kastner, Farzan Fallah,
under review, Design Automation Conference (DAC 2009), July 2009.
[13] Hardware Implementation Trade-offs of Matrix Computation
Architectures using Hierarchical Datapaths, Ali Irturk, Nikolay
Laptev and Ryan Kastner, under review, Design Automation Conference
(DAC 2009), July 2009. [12] Energy Benefits of Reconfigurable
Hardware for use in Underwater Sensor Nets, Bridget Benson, Ali
Irturk, Junguk Cho, Ryan Kastner, under review, 16th Reconfigurable
Architectures Workshop (RAW 2009), May 2009. [11] Architectural
Optimization of Decomposition Algorithms for Wireless Communication
Systems, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan
Kastner, In Proceedings of the IEEE Wireless Communications and
Networking Conference (WCNC 2009), April 2009. [10] FPGA
Acceleration of Mean Variance Framework for Optimum Asset
Allocation, Ali Irturk, Bridget Benson, Nikolay Laptev and Ryan
Kastner, In Proceedings of the Workshop on High Performance
Computational Finance at SC08 International Conference for High
Performance Computing, Networking, Storage and Analysis, November
2008. [9] GUSTO: An Automatic Generation and Optimization Tool for
Matrix Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam
Mirzaei and Ryan Kastner, under review (2 nd round of reviews),
Transactions on Embedded Computing Systems. [8] Automatic
Generation of Decomposition based Matrix Inversion Architectures,
Ali Irturk, Bridget Benson and Ryan Kastner, In Proceedings of the
IEEE International Conference on Field-Programmable Technology
(ICFPT), December 2009. 65
Slide 66
Publications [7] Survey of Hardware Platforms for an Energy
Efficient Implementation of Matching Pursuits Algorithm for Shallow
Water Networks, Bridget Benson, Ali Irturk, Junguk Cho, and Ryan
Kastner, In Proceedings of the The Third ACM International Workshop
on UnderWater Networks (WUWNet), in conjunction with ACM MobiCom
2008, September 2008. [6] Design Space Exploration of a Cooperative
MIMO Receiver for Reconfigurable Architectures, Shahnam Mirzaei,
Ali Irturk, Ryan Kastner, Brad T. Weals and Richard E. Cagley, In
Proceedings of the IEEE International Conference on
Application-specific Systems, Architectures and Processors (ASAP),
July 2008. [5] An FPGA Design Space Exploration Tool for Matrix
Inversion Architectures, Ali Irturk, Bridget Benson, Shahnam
Mirzaei and Ryan Kastner, In Proceedings of the IEEE Symposium on
Application Specific Processors (SASP), June 2008. [4] An
Optimization Methodology for Matrix Computation Architectures, Ali
Irturk, Bridget Benson, and Ryan Kastner, Unsubmitted Manuscript.
[3] FPGA Implementation of Adaptive Weight Calculation Core Using
QRD-RLS Algorithm, Ali Irturk, Shahnam Mirzaei and Ryan Kastner,
Unsubmitted Manuscript. [2] An Efficient FPGA Implementation of
Scalable Matrix Inversion Core using QR Decomposition, Ali Irturk,
Shahnam Mirzaei and Ryan Kastner, Unsubmitted Manuscript. [1]
Implementation of QR Decomposition Algorithms using FPGAs, Ali
Irturk, MS Thesis, Department of Electrical and Computer
Engineering, University of California, Santa Barbara, June 2007.
Advisor: Ryan Kastner 66
M ATRIX I NVERSION Ali Irturk -UC San DiegoSASP 2008 Use
Decomposition Methods for Analytic Simplicity Computational
Convenience Decomposition Methods QR LU Cholesky etc. Analytic
Method
Slide 69
Matrix Inversion using QR Decomposition Ali Irturk -UC San
DiegoSASP 2008 Given Matrix Orthogonal Matrix Upper Triangular
Matrix
Slide 70
Matrix Inversion using QR Decomposition Ali Irturk -UC San
DiegoSASP 2008 1 2 3 4 5 6 7 8 columns of the matrix Entry at the
intersection of i th row with j th column Three different QR
decomposition methods: Gram-Schmidt Orthogonormalization Givens
Rotations Householder Reflections 00101010 00101110 00001010
11101010 00101110 00101100 Memory Euclidean Norm
Slide 71
Matrix Inversion using Analytic Method Ali Irturk -UC San
DiegoSASP 2008 The analytic method uses The adjoint matrix,
Determinant of the given matrix. Adj(A) det A Determinant of a 2 2
matrix
Slide 72
Adjoint Matrix UC San DiegoSASP 2008 * * * * * * - - - * * * A
33 A 44 A 34 A 43 A 32 A 44 A 34 A 42 A 32 A 43 A 33 A 42 A 22 A 23
A 24 * * C 11 Adjoint Matrix Calculation Cofactor Calculation
Core
Slide 73
Different Implementations of Analytic Approach Ali Irturk -UC
San DiegoSASP 2008 Cofactor Calculation Core Implementation
AImplementation B Implementation C
Slide 74
Matrix Inversion using LU Decomposition Ali Irturk - UC San
DiegoICFPT 2008 Given Matrix Lower Triangular Matrix Upper
Triangular Matrix
Slide 75
Matrix Inversion using LU Decomposition Ali Irturk - UC San
Diego ICFPT 2008 1 2 3 4 5 6 7 8 9
Slide 76
Matrix Inversion using LU Decomposition Ali Irturk - UC San
Diego ICFPT 2008 1 2 3 4 5 6 7 8 9
Slide 77
Matrix Inversion using Cholesky Decomposition Ali Irturk - UC
San DiegoICFPT 2008 Given Matrix Unique Lower Triangular Matrix
(Cholesky triangle) Transpose of Lower Triangular Matrix
Slide 78
Matrix Inversion using Cholesky Decomposition Ali Irturk - UC
San DiegoICFPT 2008 1 2 3 4 5 6 7
Slide 79
Matrix Inversion using Cholesky Decomposition Ali Irturk - UC
San DiegoICFPT 2008 1 2 3 4 5 6 7
Slide 80
Matrix Inversion using Cholesky Decomposition Ali Irturk - UC
San DiegoICFPT 2008 1 2 3 4 5 6 7
Slide 81
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation Standard Deviation
(RISK) Expected Return Efficient Frontier Standard Deviation (RISK)
Expected Return Highest Utility Portfolio 81 Computation of the
Optimal Allocation 5 Phases MVF
Slide 82
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier 1 23 Expected Prices E{M}
Expected Covariance Cov{M} Allocations Optimal Allocation 82
Computation the Optimal Allocation 5 Phases MVF
Slide 83
C OMPUTATION OF R EQUIRED I NPUTS Computation of Required
Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly
Available Data Prices Covariance # of Securities Reference
Allocation Horizon Investor Objective Known Data The time that
investment made Investment Horizon Estimation Interval 1) Detect
the Invariants 2) Determine the Distribution of Invariants 3)
Project the Invariants to the Investment Horizon 4) Compute the
Expected Return and the Covariance Matrix 5) Compute the Expected
Return and the Covariance Matrix of the Market Vector 83
Slide 84
C OMPUTATION OF R EQUIRED I NPUTS Computation of Required
Inputs Expected Prices E{M} Expected Covariance Cov{M} Publicly
Available Data Prices Covariance # of Securities Reference
Allocation Horizon Investor Objective Investor Objectives Absolute
Wealth Relative Wealth Net Profits (3) = M Objective Value
Allocation =[ 1, 2,, Ns ] Market Vector 84
Slide 85
C OMPUTATION OF R EQUIRED I NPUTS S TEP 5 = M Objective Value
Market Vector M is a transformation of the Market Prices at the
Investment Horizon M a + BP T+ Standard Investor Objectives
Absolute WealthRelative WealthNet Profits (a)Specific Form = W T+
() = W T+ ()-() W T+ () = W T+ ()- w T () (b)Generalized Form a 0,
B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T ) 85
Allocation =[ 1, 2,, Ns ]
Slide 86
C OMPUTATION OF R EQUIRED I NPUTS Each step requires to make
assumptions: Invariants Distribution of invariants Estimation
interval. Our assumptions: Compounded returns of stocks as market
invariants, 3 years of the known data, 1 week estimation interval,
1 year as our horizon. 86 Phase 5 is a good candidate for hardware
implementation.
Slide 87
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation 87 Computation of
the Optimal Allocation 5 Phases MVF
Slide 88
MVF: S TEP 1 Computation of the Efficient Frontier Computation
of Required Inputs Computation of the Efficient Frontier Expected
Prices E{M} Expected Covariance Cov{M} Current Prices # of
Portfolios # of Securities Budget Allocation Standard Deviation
(RISK) Expected Return Efficient Frontier (v) arg max E{M}, v 0
constraints Cov{M} =v E{ } Var{ } 88
Slide 89
MVF: S TEP 1 Computation of the Efficient Frontier (v) arg max
E{M}, v 0 constraints Cov{M} =v Standard Deviation (RISK) Expected
Return Unachievable Risk-Return Space Efficient Frontier An
investor does NOT want to be in this area! 89
Slide 90
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation 90 Computation the
Optimal Allocation 5 Phases Standard Deviation (RISK) Expected
Return Efficient Frontier Standard Deviation (RISK) Expected Return
Highest Utility Portfolio MVF
Slide 91
MVF: S TEP 2 Computing the Optimal Allocation Determination of
the Highest Utility Portfolio Optimal Allocation Current Prices #
of Securities # of Portfolios # of Scenarios Satisfaction Index ?
Is this allocation the best? 91 Expected Return Standard Deviation
(RISK)
Slide 92
MVF: S TEP 2 Computing the Optimal Allocation Satisfaction
Indices Represent all the features of a given allocation with one
single number, Quantify the investors satisfaction. Satisfaction
Indices Certainty-equivalent, Quantile, Coherent indices.
Certainty-equivalent satisfaction indices are Represented by the
investors utility function and objective, u(), We use Hyperbolic
Absolute Risk Aversion (HARA) class of utility functions. Utility
Functions Exponential, Quadratic, Power, Logarithmic, Linear.
92
Slide 93
MVF: S TEP 2 Computing the Optimal Allocation Hyperbolic
Absolute Risk Aversion (HARA) class of utility functions are
Specific forms of the Arrow-Pratt risk aversion model, Defined as
where = 0. A() = 2++ Utility Functions Exponential Utility (>0
and 0) Quadratic Utility (>0 and -1) Power Utility ( 0 and 1)
Logarithmic Utility (lim 1 ) Linear Utility (lim ) u() = -e (1/)
u() = (1/2) 2 u() = 1- 1/ u() = ln()u() = 93
Slide 94
I DENTIFICATION OF B OTTLENECKS In terms of computational time,
most important variables are: Number of Securities, Number of
Portfolios, Number of Scenarios. Computation of Required Inputs
Computation of the Efficient Frontier Determination of the Highest
Utility Portfolio Expected Prices E{M} Expected Covariance Cov{M}
Allocations Optimal Allocation Standard Deviation (RISK) Expected
Return Efficient Frontier Standard Deviation (RISK) Expected Return
Highest Utility Portfolio 94
Slide 95
I DENTIFICATION OF B OTTLENECKS # of Securities dominates
computation time over # of Portfolios. 95 # of Scenarios =
100,000
Slide 96
I DENTIFICATION OF B OTTLENECKS # of Portfolios dominates
computation time over # of Scenarios. 96 # of Securities = 100
Slide 97
I DENTIFICATION OF B OTTLENECKS 97 # of Portfolios = 100# of
Scenarios = 100,000
Slide 98
I DENTIFICATION OF B OTTLENECKS 98 # of Scenarios = 100,000 #
of Securities = 100
Slide 99
I DENTIFICATION OF B OTTLENECKS 99 # of Portfolios = 100# of
Securities = 100
Slide 100
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation 100 Computation
the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected
Return Efficient Frontier Standard Deviation (RISK) Expected Return
Highest Utility Portfolio MVF
Slide 101
Generation of Required Inputs Phase 5 /- - pTpT pTpT '' ININ 0
1 0 1 cntrl_a cntrl_b M Market Vector Calculator IP Core K Building
Block pTpT KP T+ or P T+ Absolute Wealth Relative WealthNet Profits
a 0, B I N = P T+ a 0, B K = KP T+ a -p T, B I N = (P T+ -p T )
Control Inputs Objectivecntrl_acntrl_b Absolute00 Relative10 Net
Profits01 101 P T+
Slide 102
Generation of Required Inputs Phase 5 pTpT pTpT 102
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation 107 Computation
the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected
Return Efficient Frontier Standard Deviation (RISK) Expected Return
Highest Utility Portfolio MVF
Slide 108
Hardware Architecture for MVF Step 1 (v) arg max E{M}, v 0
constraints Cov{M} =v A popular approach to solve constrained
maximization problems is to use the Lagrangian multiplier method.
108
Slide 109
Hardware Architecture for MVF Step 1 Number of Securities
amount of functions need to be computed for determination of the
efficient allocation for a given risk. 109
T HE M EAN V ARIANCE F RAMEWORK Computation of Required Inputs
Computation of the Efficient Frontier Expected Prices E{M} Expected
Covariance Cov{M} Allocations Optimal Allocation 111 Computation
the Optimal Allocation 5 Phases Standard Deviation (RISK) Expected
Return Efficient Frontier Standard Deviation (RISK) Expected Return
Highest Utility Portfolio MVF
Slide 112
Hardware Architecture for MVF Step 2 Random Number Generator 1
= [ 11, 12, , 1Ns ] Requires N s Multiplications Monte Carlo Block
112 Utility Functions Exponential Utility (>0 and 0) Quadratic
Utility (>0 and -1) u() = -e (1/) u() = (1/2) 2
Slide 113
Hardware Architecture for MVF Step 2 113
Slide 114
Hardware Architecture for MVF Step 2 114
Slide 115
Hardware Architecture for MVF Step 2 Satisfaction Function
Calculator Blocks Parallel N s Multipliers Parallel N m Monte Carlo
Blocks Parallel N m Utility Calculation Blocks Parallel N p
Satisfaction Function Calculation Blocks Parallel Satisfaction
Function Calculator Blocks 115
Slide 116
Results Generation of Required Inputs Phase 5 1000 runs N s
number of arithmetic resources in parallel 6 - 9.6 629 (for 50
Securities) 116
Slide 117
Results Mean Variance Framework Step 2 1000 runs 10
Satisfaction Blocks (1 Monte-Carlo Block with 10 multipliers and 10
Utility Function Calculator Blocks) 151 - 221 100,000 scenarios and
50 Portfolios 10 Satisfaction Blocks (1 Monte-Carlo Block with 20
multipliers and 20 Utility Function Calculator Blocks) 302 - 442
117
Slide 118
Conclusion Mean Variance Frameworks inherent parallelism make
the framework an ideal candidate for an FPGA implementation; We are
bound by hardware resources rather than by the parallelism Mean
Variance Framework offers; However, there are many different
architectural choices to implement Mean Variance Frameworks steps.
118