64
IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 60 CREDITS , STOCKHOLM SWEDEN 2017 FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks KALLE NGO KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

FPGA Hardware Acceleration of Inception Style Parameter …1087367/... · 2017. 4. 6. · FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 60 CREDITS

    , STOCKHOLM SWEDEN 2017

    FPGA Hardware Acceleration of Inception Style Parameter Reduced Convolution Neural Networks

    KALLE NGO

    KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

  • c© Kalle Ngo, December 2016

  • Abstract

    Some researchers have noted that the growth rate in the number of networkparameters of many recently proposed state-of-the-art CNN topologies is placingunrealistic demands on hardware resources and limits the practical applicationsof Neural Networks. This is particularly apparent when considering many of theprojected applications (IoT, autonomous vehicles, etc) utilize embedded systemswith even greater restrictions on computation and memory bandwidth than thetypical research-class computer cluster that the CNN was designed on.

    The GoogLeNet CNN in 2014 proposed a new level of organization (“InceptionModule”) that was demonstrated in competition to achieve similar/better performance,while using an order of magnitude less network parameters than the othercompeting topologies.

    This thesis explores the characteristics of the new GoogLeNet inceptionmodules and the implications it presents to current CNN accelerator architectures.A custom FPGA accelerator is proposed to offset the inception module’s increasedneed to buffer large intermediate convolution arrays through array partitioning andcascading two convolution operations into a single pipeline pass.

    A Xilinx Artix-7 FPGA was used to implement architecture where it was ablecontinuously supply data to the 331 utilized DSP blocks (approx. half of totalavailable), while using only a quarter of the DDR bandwidth to achieve a peakthroughput of 9.11 GFLOPS. The low utilization of the DDR bandwidth suggeststhat with some optimization, the design can be scaled up to better utilize theavailable resources and increase throughput.

    i

  • Contents

    1 Introduction 11.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Structure thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background 42.1 The Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.1.1 Activation Function . . . . . . . . . . . . . . . . . . . . . 52.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . 6

    2.2.1 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . 62.2.2 Convolutional Neural Networks . . . . . . . . . . . . . . 6

    2.3 Convolutional Neural Networks Topology . . . . . . . . . . . . . 92.3.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 OverFeat . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3 Related Works 143.1 State of Neural Networks Today . . . . . . . . . . . . . . . . . . 153.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.2.1 Core Processing Element . . . . . . . . . . . . . . . . . . 173.2.1.1 Time-Division Processing . . . . . . . . . . . . 173.2.1.2 Matrix Processing . . . . . . . . . . . . . . . . 183.2.1.3 Systolic Array . . . . . . . . . . . . . . . . . . 19

    4 Architectural Design 204.1 Design Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4.1.1 GoogLeNet Inception Module . . . . . . . . . . . . . . . 204.1.2 Memory Structure . . . . . . . . . . . . . . . . . . . . . 224.1.3 Data Dependency . . . . . . . . . . . . . . . . . . . . . . 244.1.4 Hardware Reuse . . . . . . . . . . . . . . . . . . . . . . 25

    4.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . 26

    ii

  • CONTENTS iii

    4.2.1 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . 264.2.1.1 Depth Slicing . . . . . . . . . . . . . . . . . . 264.2.1.2 Static Width . . . . . . . . . . . . . . . . . . . 28

    4.2.2 Convolution Cascade . . . . . . . . . . . . . . . . . . . . 304.2.3 Processing Topology . . . . . . . . . . . . . . . . . . . . 30

    5 Implementation and Analysis 325.0.1 Processing Elements . . . . . . . . . . . . . . . . . . . . 32

    5.0.1.1 Streaming Convolution . . . . . . . . . . . . . 335.0.1.2 Square Convolution . . . . . . . . . . . . . . . 35

    5.0.2 Accelerator Control . . . . . . . . . . . . . . . . . . . . . 375.0.3 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . 38

    6 Results 406.1 Testing Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 40

    6.1.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . 416.1.1.1 Memory Bandwidth . . . . . . . . . . . . . . . 416.1.1.2 Floating Point Throughput . . . . . . . . . . . 42

    6.1.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    7 Conclusions 447.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    Bibliography 46

    A Vivado Reports 52

  • List of Figures

    2.1 Mathematical Model of a Neuron . . . . . . . . . . . . . . . . . . 42.2 Multi-Layer Perceptrion . . . . . . . . . . . . . . . . . . . . . . 62.3 Example CNN Network . . . . . . . . . . . . . . . . . . . . . . . 72.4 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 CNN downsampling with MaxPooling and stride 2 . . . . . . . . 82.6 Convolution Filter Kernels [1] . . . . . . . . . . . . . . . . . . . 92.7 AlexNet Topology from [1] Image cropping found on original . . . . . 102.8 GoogLeNet Topology from [2] . . . . . . . . . . . . . . . . . . . 112.10 Inception Module . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.1 Common Inner-production structures . . . . . . . . . . . . . . . . 173.2 Example Systolic Array [3] . . . . . . . . . . . . . . . . . . . . . 19

    4.1 Modular Topology . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Inception Module . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Network Parameters of Inception Modules . . . . . . . . . . . . . 234.4 Next stage is delayed until output array is complete (blue layer) . 244.5 Partitioned sub-array requires neighboring elements to complete

    convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.6 Streaming Convolution Array Access Pattern . . . . . . . . . . . 304.7 Conceptual Parallel Processing Element Layout . . . . . . . . . . 31

    5.1 Processing Element Architecture . . . . . . . . . . . . . . . . . . 325.2 Utilization Report: Multiply Unit (HLS Version) . . . . . . . . . 345.3 Utilization Report: Accumulator Unit (HLS Version) . . . . . . . 345.4 Convolution Buffer Array . . . . . . . . . . . . . . . . . . . . . . 355.5 Shift Register Buffer . . . . . . . . . . . . . . . . . . . . . . . . 365.6 Utilization Report: 5x5 Convolution with two core molecules . . 375.7 Hardware Accelerator System View . . . . . . . . . . . . . . . . 38

    A.1 Power Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 52A.2 Test System Utilization . . . . . . . . . . . . . . . . . . . . . . . 53

    iv

  • LIST OF FIGURES v

    A.3 Test System Utilization . . . . . . . . . . . . . . . . . . . . . . . 53A.4 Implementation Topology . . . . . . . . . . . . . . . . . . . . . . 54

  • List of Tables

    2.1 Common Activation Functions . . . . . . . . . . . . . . . . . . . 5

    4.1 GoogleNet Network Topology [2] . . . . . . . . . . . . . . . . . 214.2 Array Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Partitioning: 3 New Static Array Sizes . . . . . . . . . . . . . . . 28

    6.1 Pipeline Latency for different DDR access patterns . . . . . . . . 416.2 Measured Throughput . . . . . . . . . . . . . . . . . . . . . . . . 42

    vi

  • Chapter 1

    Introduction

    Conventional computers have demonstrated near undisputed superiority over thehuman brain in highly structured logic tasks and mathematical computation.Intrinsically, digital logic’s dependence on rigid rules and absolute order, limitsit’s ability to interpret information from an analog world. The Artificial NeuralNetwork (ANN), in contrast to modern computers, excel in non-absolute domainssuch as pattern recognition, classification, and interpretation of incomplete data.

    The capability of a particular ANN, although dependent on many factors, canbe fundamentally simplified to be a proportional relationship to the network’scomplexity (network nodes and connections). Consequently, as complexityincreases, training (liken to programming the computer) of ANNs suffer fromthe “Curse of Dimensionality”, the time to derive the network weights by iterativeexploration exhibits a factorial time complexity O(n!) and can quickly exceed theage of the known universe (even with small networks). Modern research emphasishas been largely concentrated on learning algorithm development and networktopology design.

    Current applications of neural network have largely been academic withrelatively limited real world implementations such as character recognition inpostal centres, medical diagnosis, and image processing/recognition by searchengines. Most of these real world neural network implementations are utilizedin specialized industrial installations, adapting computer graphics processors ornetworked cloud computing.

    1.1 Problem descriptionThe current technological advance towards smart, context aware systems andautonomous vehicles only accelerates the transition of ANN from researchsupercomputers to applications in embedded systems, and further emphasizes the

    1

  • 1.2. RESEARCH OBJECTIVES 2

    need for a capable cost conscious hardware solution.While FPGA implementation is ideal due to it’s compatibility with the highly

    parallel characteristics of ANNs, there is always the problem of limited fast on-chip memory, necessitating the movement of data across physical chip boundariescausing bandwidth bottlenecks that limit FPGA parallelism.

    The majority of current solutions are variations that are specifically optimizedfor the limited resources found on embedded systems. Even with these optimizations,many of the current proposed architectures are unable to cope with large datasizes without resorting to using the off-chip storage as the working (Scratch-pad)memory.

    1.2 Research ObjectivesThe research objectives pursued in this thesis include:

    • Determine the forward-pass resource requirements of a typical “researchclass” artificial neural network. Leading neural network designs will bereferenced from the ILSVRC challenges 1 to ensure an accurate representationsof current state-of-the-art research networks.

    • Develop a suitable architecture to meet the computation and memorydemands from the established requirements with modular design to allowfor scaling to implement larger networks.

    • Design and test a hardware reference prototype utilizing a Xilinx Artix7FPGA to implement the core CNN processing architecture.

    1.3 Structure thesisThis exploration is based in the field of Electronic Systems but spans to includeArtificial neural networks (a field of Machine Learning), with emphasis onexploring hardware solutions for FPGA based embedded systems.

    Due to the combined research fields, this report has been written to includereaders from both disciplines, therefore some sections may seem overtly detailedand can be skipped at reader’s discretion.

    • Chapter 1 describes the problem and its context.1 The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual event thatevaluates algorithms for object detection and image classification and provides a uniform datasetfor training and evaluation

  • 1.3. STRUCTURE THESIS 3

    • Chapter 2 provides a simplified background on Artificial Neural Networksand their operations, and only covers the depth necessary to understand theproblem under the context of this thesis.

    • Chapter 3 contains an exploration of the current state of both fields, aswell as architectural solutions commonly applied towards neural networkimplementation.

    • Chapter 4 establishes the requirements and is formatted as an ongoingdiscussion in reaching the proposed solution.

    • The following 2 chapters involve the physical realization of the principlesdescribed in Chapter 4 and a critical evaluation of the implemented design.

    • Chapter 7 offers final conclusions and suggestions on future work anddirection.

  • Chapter 2

    Background

    2.1 The PerceptronThe Neuron is the basic building block of Artificial Neural Networks (ANN) andcan be likened to the logic gate of digital logic circuits. First described by FrankRosenblatt in his 1958 paper [4], its design was inspired from the arrangementof neurons found in brain tissue. This artificial neuron is often referred to as aPerceptron, and consists of one or more inputs, a ”processor” node and an output.The value of the output is determined by taking the weighted sum of the inputsand evaluating against an activation function.

    x2 w2 Σ fActivatefunction

    yOutput

    x1 w1

    x3 w3Weights

    Biasb

    Inputs

    Figure 2.1: Mathematical Model of a Neuron

    Information incoming to the neuron is multiplied by a weight value that isexclusive to that input and neuron. These weighted inputs are then summedtogether along with an optional ”bias” value. The bias acts as a magnitude offsetfor the summation, and regulates the neurons ”switching” region and sensitivity.The ”programming” of an Artificial Neural Network (ANN) structure are the

    4

  • 2.1. THE PERCEPTRON 5

    weight values and bias of each neuron and are the values that are adjusted duringthe training phase.

    A single Perceptron is a linear classifier, the output value determines thecategory based on the value of the output. By Increasing the number of Perceptronnodes, the network can distinguish between more classes, up to the number ofunique output combinations.

    2.1.1 Activation FunctionThe activation function, performs the thresholding operation to determine theneuron output value and in practise are usually a step function, a linear/exponentialsigmoid function, and linear rectifier commonly used for in deep neural networks[5].

    0

    1

    x

    y(v i)

    Step Function

    0

    1

    x

    y(v i)=(1

    +e−

    v i)−

    1

    Logistic Function

    −1

    0

    1

    x

    y(v i)=

    tanh(v

    i))

    Tanh Function

    0

    4

    x

    y(v i)

    linear Rectifier

    Table 2.1: Common Activation Functions

  • 2.2. ARTIFICIAL NEURAL NETWORKS 6

    2.2 Artificial Neural Networks

    2.2.1 Multilayer PerceptronThe Multilayer Perceptron (MLP) is a modification of the single layer Perceptronto enable classification of non-linearly separable data and commonly referred toas the ”FC” (Fully Connected) layer when used in conjunction with other networktopologies. MLPs always consists of 2 or more layers, where the “Hidden Layers”refers to the neurons between the input and output layer.

    ...

    ......

    I1

    I2

    I3

    In

    H1

    Hn

    O1

    On

    Inputlayer

    Hiddenlayer

    Ouputlayer

    Figure 2.2: Multi-Layer Perceptrion

    The MLP is the most basic configuration of the Feed-Forward class of neuralnetworks, where the propagation of data only travels from input to the outputlayers during implementation (Recall Phase). This is in contrast to other classessuch as the Recurrent Neural Networks (RNN) with internal states that allow itto exhibit temporal behaviours. RNNs are used in applications where events aredistinguished in a serial manner (either by location or time) such as in speech andhandwriting recognition of words or letters.

    2.2.2 Convolutional Neural NetworksLike MLPs, Convolutional Neural Network (CNN) belong to the feed-forwardclass of ANNs, and are most mainly used for vision applications. This networkstructure was motivated from studies of the animal visual cortex, where individualneurons were found to be responsive in small over lapping regions of the visual

  • 2.2. ARTIFICIAL NEURAL NETWORKS 7

    Figure 2.3: Example CNN Network

    field; in contrast an MLP style structure would require every neuron in the inputlayer to receive information from every pixel of the visual field.

    CNNs perform rudimentary information extraction by leveraging the spatialstructure found in vision derived data. The convolutions serve as preprocessing tofilter and abstract the pixels of visual field into logical “features” that are fractionsof the original data volume. This reduction enables far less FC neuron layers toanalyze for recognition and classification.

    As an analogy, in asking a friend over the phone to help identify an object on aphotograph, describing the position and colour of every pixel is a poor approach;instead, it is more efficient to describe the distinguishing features, such as handlebars, 2 wheels, a seat, pedals, etc. In this way, much of the information in thephoto such as sky colour, road texture, etc. are unnecessary for the purpose ofidentifying the bicycle.

    The 2D Convolution is the core operation that performs the preprocessingabstraction. Heavily utilized in the field of signal processing, it is an adaptation ofthe mathematical discreet convolution into 2D space. The function fundamentallycalculates cross-correlation between 2 matrices, and outputs a scalar value thatcorresponds to how similar the input is to the kernel.

    y[m,n] = (b+K−1

    ∑k=1

    L−1

    ∑l=1

    w[k, l]x[m+ k,n+1]) (2.1)

    The convolution equation 2.1, where y[m,n] is the output matrix map of samedimensions as the input map x, m and n are the coordinates of the ”center” pixel ofthe interest region, and L, K denote the kernel dimensions. The convolution kernel”slides” across the input until the entire image has been traversed, as depicted in2.4a.

    The convolution kernel values stay constant throughout the frame traversalwith the output map containing correlation values of how strongly a particularinput region displays the kernel feature. Each layer of the CNN distinguishmultiple features with independent convolutions, each with a distinct kernel. Each

  • 2.2. ARTIFICIAL NEURAL NETWORKS 8

    of these convolutions are stored in a unique buffer, forming the multiple featuremaps of 2.3.

    Input M

    ap

    1 23 4

    5 6

    7 81 2

    3

    7 8

    Convol

    ution

    Kernel

    Outpu

    t Map

    a bc d

    ef

    g h

    (a) Sliding Window Kernel

    (k×l×

    n)

    Convol

    ution

    Kernel

    Height

    Width

    nChannels

    (b) Three Channel RGB Input Map

    Figure 2.4: Convolution

    A common simplification used to learn basic CNN operation is to considerinput image and kernels as two-dimensional arrays, but in practice multichannelinputs such as RGB or the combine output feature maps, form three-dimensionaldata volumes that utilize similarly three-dimensional convolution kernels. Thisenables CNNs to distinguish higher order correlations such as colour and textures,and numerical relations.

    RGB I

    nput

    Map

    Slidin

    g

    Windo

    wCo

    nvolut

    ion

    Featur

    e Map

    2 17 9

    5 3

    1 6

    Pooli

    ng

    Stride

    22 7

    1 6

    Down

    sample

    d

    Featur

    e Map

    7

    Figure 2.5: CNN downsampling with MaxPooling and stride 2

    In most CNN networks, such as the fictional 2.3, it typical for CNN networksto utilize Pooling Layers combined with a stride. These Pooling layers combine

  • 2.3. CONVOLUTIONAL NEURAL NETWORKS TOPOLOGY 9

    the output of neuron clusters to improve generalization and also perform non-linear down-sampling.

    The stride dictates the amount of pixels the kernel advances per iteration, astride of greater then one results in a dimension reduction of the output by samefactor value. Down-sampling is not restricted to only Pooling layers but stride canalso be specified on convolution layers to perform dimensional down-samplingwithout generalizing.

    There are several pooling functions, with MaxPooling being most common;a MaxPool function searches for the maximum value in the pooling kernel andsubsequently passes only that value to the output map buffer 2.5.

    Figure 2.6: Convolution Filter Kernels [1]

    The convolution Kernels from AlexNet (discussed in Section 2.3.1) arevisualized in figure 2.6, and correspond to the 96 filters of 11x11x3 pixels inthe first layer. Patterns found in the visualization are the features each convolutionlooks for in the input image. The filters from deeper layers become increasinglyabstract and can not be accurately visualize as they are believed to represent higherorder relationships such as numerical counts, orientation, etc. These combinedfeature maps can be thought of as an extension of the RGB colour channels torepresent higher order features.

    2.3 Convolutional Neural Networks Topology

    2.3.1 AlexNetIn 2012, Krizhevsky et al. presented a network design that significantly outperformedthe second runner up in the ILSRC 2012 (top 5 error of 16.4% compared to runner-up with 25.8% error). Their paper [1], has since been widely cited by subsequentresearch, and has lead the field to shift toward the widespread adaptation andresearch of deep CNN for computer vision tasks [6]. Although the topology wassimilar to the pioneering 1989 LeNet [7], AlexNet was much larger and most

  • 2.3. CONVOLUTIONAL NEURAL NETWORKS TOPOLOGY 10

    Figure 2.7: AlexNet Topology from [1]Image cropping found on original

    notably featured multiple convolution layers consecutively stacked on top of eachother, when other CNN commonly only had 1 convolution layer followed by aPooling layer.

    The success of their neural network can attributed to several factors, but ismainly focused on allowing for the efficient training of deep neural networks.They employed a novel regularization method termed dropout to reduce theoverfitting1 that plagues deep neural network training with finite datasets. Thedropout method allowed them to train a large network able to distinguish RGBcolour with a massive 60 million parameters.

    The AlexNet CNN 2.7 consists of 8 weight layers, with the first 5 beingConvolutional and followed by 3 fully connected classification layers.The splitsfound between the layers were a consequence of the limited memory requiring thenetwork residing on two separate GPUs. Training the network with two high-endGTX 580 3GB GPUs took 5 to 6 days, with the the GPU memory and the trainingtime limiting the network size.

    2.3.2 OverFeatOverFeat[8] participated in the ILSRC2013 with a CNN network based on theprevious year winner AlexNet. This CNN somewhat represents a scaled upderivative with more than double the network weights, 125 million compared tothe 60 million of AlexNet.

    This doubling in network weights pushed the classification error rate downto 14.2%, although it is important to note that this network participated in a newobject localization challenge introduced that year, this additional capability mayaffect the performance gains of scaling neural networks.

    The increased parameters results in 580MB of network weights that must be

    1 When a statistical model is excessively complex and describes random error/noise instead of theunderlying relationship. Occurs due to iterative network training over a finite dataset

  • 2.3. CONVOLUTIONAL NEURAL NETWORKS TOPOLOGY 11

    read during a single forward pass in order to classify a single 221x221 pixel imageframe.

    For real time video processing ay 30fps, this represents a massive computationload with memory bandwidth requirements at 17.5GB/sec just for the networkweights transfer and does not including storage and retrieval of any intermediateresults.

    2.3.3 GoogLeNet

    Figure 2.8: GoogLeNet Topology from [2]

    During ILSRC2014 challenge, there was a noticeable adaptation towardstopologies belonging to a category dubbed Very Deep Convolutional NeuralNetwork, and displayed much higher network layer counts, doubling the typicalnumber layers of 2013. These emerging Very Deep networks such as the 2014runner-up VGGNet [9], contained 19 layers but the total neuron connectionsstayed approximately the same as the 8 layer AlexNet. This in part due to usingsmaller kernel sizes such as 3x3, and shifting to those resources to constructingthe network deeper.

    The GoogLeNet network by Szegedy et al.[2] introduced a novel approachby integrating several Inception modules together to form a Network-in-Networktopology similarly described in [10]. The resulting network consisted of 22 Layerswith the winning top-5 error rate of 6.66%. The significance of their topology isapparent when considering they were able to achieve this accuracy with an orderof magnitude fewer parameters (4M, compared to AlexNet with 60M, VGGNetwith 143M), resulting in a weight file of only 27.2MB.

    The dramatic reduction in parameters is due to the introduction of Inceptionmodules and in part by replacing the fully connected layers with an AveragePoolingto create the output classification.

    Szegedy et al. noted that the recent growth trend of network size came with acorresponding increase in network weight parameters, they speculated that if this

  • 2.3. CONVOLUTIONAL NEURAL NETWORKS TOPOLOGY 12

    (a) Filter depth optimized for locality (b) Simplified Module

    trend continued, it could diminish neural networks into a field of purely academiccuriosity. Due to their observation, they developed GoogLeNet while balancingaccuracy with computational and memory bandwidth.

    The Inception module from Szegedy et al. exploits a tendency for localcorrelations in images. By allocating the number of filters accordingly (Figure2.9a), it is possible to capture a high number of features and cover a larger spatialarea without a dramatic increase in resource utilization[11].

    Figure 2.10: Inception Module

    Figure 2.9b depicts a simplified module where a staggered series of 1x1, 3x3and 5x5 convolutions is performed and combined to produce the module output.Due to the depth of the network, trivial 5x5 convolutions can be prohibitivelyexpensive when performed on an input with a large number of feature maps[12].

    To alleviate this, Szegedy et al. used 1x1 sized convolutions to reduce thedimensions before the expensive 3x3 and 5x5 filters. The 1x1 operations actas a form of compression with a 1x1 convolution kernel that can be trained towork in conjunction with the following 3x3 or 5x5 filter operations. This allowGoogLeNet to chain 9 Inception modules in series while maintaining a reasonable1.5 billion multiply-accumulates for a forward recall pass.

    The network successfully demonstrated that smaller convolutions of varying

  • 2.3. CONVOLUTIONAL NEURAL NETWORKS TOPOLOGY 13

    sizes are able to differentiate features just as well as a typical single largecontiguous filter, but with the benefit of significantly reducing the necessarynetwork weight parameters. This reduction of parameters does come at a costdue to the multiple stages of back-to-back convolutions, the network topologyproduces many large intermediate arrays that consume buffer space and will to beinvestigated to determine if the reduced parameters translates to a total reductionin requirements.

  • Chapter 3

    Related Works

    Many earlier proposed systems were faced with the challenges of limited FPGAsgate capacities and lead to a compromise in arithmetic precision[13] or in thecase of [14], required 30 Xilinx XC3090 FPGAs to implement the relativelysmall (by today’s standards) MLP network. To make matters worse, the mostdirect known method of improving neural network performance is to simplyincrease network size [2], consequently even with the modern increased capacityof FPGAs, specialized DSP blocks, and runtime reconfiguration, there will alwaysbe an unbridgeable divide between theoretical network design and realizablecapacity.

    The most straightforward CNN implementation with ideal performance is tosimply map every neuron and connection directly onto the FPGA. This wouldbe prohibitively expensive and realistically infeasible for nearly any modernlarge artificial neural networks. The finite density of FPGAs have lead tosolutions that break down the neural network into smaller manageable portionsfor computations.

    While this divide-and-conquer methodology provides greater functional density,it comes at a cost; by storing the unused network portions off chip, memorybandwidth is fast becoming the limiting factor, unable to keep up with the rawcomputation power of modern FPGAs.

    The majority of CNN solutions take inspiration from the SIMD vectorprocessors of the GPUs commonly used for ANN development. Leveraging theGPU architectural reliance on program counter and instruction issues in order toemulate any ANN by breaking the topology down to an ambiguous massive seriesof floating point calculations.

    Sankaradas et al. [15] employed this principle on a Xilinx Virtex5 to builda general purpose FPGA CNN accelerator with a custom pipeline optimizedfor floating point multiply accumulate. Their prototype was tested to under-perform the theoretical potential by a large margin (3.37 vs. 11.5 GMAC/s) and

    14

  • 3.1. STATE OF NEURAL NETWORKS TODAY 15

    was attributed to the bottleneck in data access to off-chip memory. A similarconclusion is found by [16], where a custom pipeline ANN (requiring researchers3 man years to complete), is compared to an IP derived vector coprocessor. Theyfound both processors to yield similar performance when memory becomes thebottleneck.

    CNNs characteristically exhibit deterministic data access patterns, by exploitingthis property with loop optimization techniques[17] and careful cache design[18],it is possible to maximize data reuse and reduce off-chip transfers. In the paper[19], Zang et al. evaluates these techniques along with Loop nest optimizationusing the polytope method1, using the roofline model[21] to address the issue offinding an efficient balance of the above optimizations to improve performanceand reduce FPGA area.

    Although their prototype posted high throughput, it was still memory boundto 61.62 GFLPOS of their calculated 100 GFLOPS computational roof. Alsothe choice to only use layer 1-5 (out of 8) of their test network[1] creates anartificially high CTC ratio (computation to communication) by explicitly avoidingthe communication intensive MLP classifier (layers 6 to 8). Similar testing of onlylayers 1-5 was also observed in [22], but author describes the platform as a CNNaccelerator, implying the fully connected layers would be processed by the host.The prototype of [19] also displayed 50% BRAM utilization, indicating that theaddition of DMA pre-fetching capabilities such as [23] and [24], could potentiallyhelp to reduce overall congestion during periods of peak memory access.

    A similar result is presented in [25], despite using a much smaller network,memory bandwidth was also found as the limiting factor in their attempt to bringCNN to a more mobile platform. Jin et al. notes the current lack of mobileplatforms capable of applying CNN, and many of the proposed CNN hardwareprocessing solutions were actually PC accelerators that relied on a PCI connectionwith the host computer. Although their solution was also a co-processor, it wasa single chip FPGA solution designed around the ARM core of the Xilinx Zynq-7000 and a stark contrast to the other proposals that utilized expensive high endFPGAs.

    3.1 State of Neural Networks TodayIn order to derive the resource reference for an embedded platform targeted atcurrent “state of the art” neural networks; it is prudent to first determine whatcriteria would allow a neural network to claim such a credited title. As Neuralnetworks are an actively researched field, there are countless variations and factors

    1 a formal mathematical framework for extracting legal loop transformation described by [20]

  • 3.1. STATE OF NEURAL NETWORKS TODAY 16

    that make a fair comparative evaluation of performance difficult.Even when networks of duplicate topology and learning methodology are

    trained for an identical task, the performance will differ depending on thetraining data. Organized events such as ImageNet Large Scale Visual RecognitionChallenge (ILSVRC) and PASCAL Visual Object Classes Recognition Challenge(PASCAL VOC) provide a reliable performance evaluation by providing aconsistent training and test dataset. ILSVRC is an annually held challenge ofaccuracy in object classification, localization and detection for still images (Videointroduced for 2015). A fully annotated training dataset of 1.2 Million images isreleased prior (approx. 3 months) to the submission deadline. The images arecollected from the internet and are hand labeled with the presence or absence of1000 object categories.

    It is important to note that these events are oriented towards the areaof Machine Learning research, a subfield of computational science, thereforepublications from the participant teams are primarily focused on discussionsinvolving training algorithm, methodology and network topology. Consequently,little concern is placed on evaluating computational effort or data utilization andnot usually directly discussed in the papers (with the notable exception of [2]).Therefore, requirements stated in this section was derived through analysis of thepublished network topologies.

    The computation cost estimation used here are related to the Connections-per-second (CPS) metric described in [26], where a multiply and accumulateoperation count as a single connection. The sum of the multiply-accumulates of asingle forward pass should proportionally reflect the total computational effort ofa network topology because of the overwhelming prevalence of the convolutionoperator. The fully connected layer neurons are conceptually identical to 1x1convolutions and as such, are included in the effort calculation.

    In an analysis of the ILSVRC (2010 to 2014), [6] provides an examination ofimprovements and current state of the art in computer vision. In 2012, a massiveleap in accuracy was observed in object classification by the AlexNet deep CNN[1], this marked a paradigm shift towards CNN as the dominating neural networkfor computer vision.

    In relating the performance of neural networks against humans in objectclassification, Russakovsky et al. finds that the current (2014) best modelGoogLeNet, only slightly under-performs (1.6%) against humans in objectrecognition, although the study notes that due to the length and repetitivenessof the tests, human performance could be improved if significant effort anddetermination.

  • 3.2. ARCHITECTURE 17

    3.2 Architecture

    3.2.1 Core Processing ElementIn [27], Hu et al. states that among the many challenges of implementing ANNs,mapping of the algorithm to achieve higher computation power is particularlycritical due to the sheer frequency of inner-product calculations. These structuresform the core of the processing element and can be fundamentally broken downinto two ways, chain-configuration and tree-configuration, but in large scaleimplementations, a combination of both is common.

    (a) Chain Configuration [27] (b) Tree Configuration [27]

    Figure 3.1: Common Inner-production structures

    These fundamental computation structures must be shared between neuronsdue to finite logic density[28]. Many solutions have been published and acomparative evaluation between the complete proposed system is difficult, asvarying test hardware and optimizations that are specific to the chosen test CNNmay skew the results for a particular computation structure. Therefore theproposals will be categorized by the processing methodology in order to betterunderstand the distinguishing characteristics.

    3.2.1.1 Time-Division Processing

    The basic solution for calculating the inner-product of the convolution operationfollows a form of time-division-multiplexing as stated by [29], where all the inputsof an individual neuron share a single multiply-accumulate (MACC) processingelement such as in [30]. These structures utilize a repeated loop where the inputsare multiplied by the weight and added to a running sum; upon completion of theloop, the neuron’s weighted sum is passed to the activation function which in turnrecords the result into the output feature map buffer.

    The strengths of this method include the flexibility to accommodate anysize convolution kernel by simply increasing loop iterations, low logic count,fine grained control and increased “general-purpose” compatibility. While inter-neuron (between different neurons) parallelism can be simply exploited by

  • 3.2. ARCHITECTURE 18

    replicating the processing units, by definition, this method is not able to exploitthe intra-neuron parallelism inherit in Convolutional Neural Networks.

    Adaptations utilizing vector processing cores follow this ideology in order tolocalize data and reduce data transfer between cores. A scalable custom pipelinevariation is presented in [24] and [31], along with a memory network that allowsindividual data streaming to each core. This removes the need for the controller toexplicitly manage the cores and allows for simpler scaling of processing elements.

    3.2.1.2 Matrix Processing

    Most platforms exhibit a variation of the above to exploit the intra-neuronparallelism by instantiating processing elements consisting of NxN convolutionMACC units such as in [32], [25], [33]. The NxN convolution is usually equal toor greater than the largest network filter kernel size. Inefficiencies are unavoidableduring layers where convolution kernel is less then NxN due to the unusedmultiplication units.

    The energy and utilization penalty for instantiating arbitrarily large processingelements usually confine this design to implementations where network topologyis known before hardware design. When a neural network is created specificallyto satisfy these limitation, this arrangement can be very efficient in terms ofperformance per watt as demonstrated by [33] (using fixed point notation).

    Bypassing the fixed kernel size limitation is possible with additional computationoverhead as in [32], a CNN coprocessor accelerator that utilizes the host pre-processing to decompose larger kernels into NxN convolution.

    FPGA runtime reconfigurability has been leveraged to adapt the convolutionkernel size to the specific needs of each layer. This adaptable architecture isdemonstrated in both [34] and [35], with the latter paper observing a resultingspeed up of 1.2x to 3.5x over the fixed architecture. In the paper the total computetime for a forward pass is listed, but there was no discussion on the impact ofreconfiguration time vs compute time.

    While runtime reconfiguration may improve efficiency and density limitations,it also places additional periodic load on the memory subsystem, with possiblyto coincide with the ANN data access patterns, further aggravating the limitedmemory bandwidth. Multiplexing the FPGA layout also requires that theproductive compute time is sufficiently longer then the reconfiguration time inorder overcome the overhead of reconfiguration. Guccione et al. [36] provides arelationship of the break-even point M = R / (N-1), where R is time (in cycles) toperform reconfiguration, and N is total compute time between reconfiguration.

  • 3.2. ARCHITECTURE 19

    Figure 3.2: Example Systolic Array [3]

    3.2.1.3 Systolic Array

    The name of systolic arrays comes from the likeness of the dataflow pattern tothe systolic contraction phase of a beating heart, a wave like propagation througha homogeneous array of processing elements. The architecture is organized suchthat each processing element only performs a single predefined function, thenrepeats after passing the output to the next node in the array. This eliminates theneed to store and coordinate data transfer between processing elements, resultingin reduced on-chip memory utilization.

    These characteristics are confirmed by several studies [37], where also acomparison against his earlier dynamically reconfigurable CNN implementation[38], found that the systolic array was significantly faster (5x), with less memoryaccesses, while utilizing the same silicon area.

    The trade off to this approach comes from the fixed nature of the systolic arrayand constrains the convolution operation to the instantiated dimensions; while alsolimiting the number of concurrent convolutions implementable, this is due to thearea costs of the inherently large granularity requiring a complete structure forproper function.

    A proposal for ”Super-Systolic” architecture in [39] and [40], describes abit-level systolic convolution purported to utilize FPGA resources with greaterefficiency and increased performance. Results from implementations shows asignificant reduction in FPGA slice utilization, requiring about a third of theresource of a conventional word level systolic array uses. While this is a largereduction of resources, the kernel depth limitation still remains and will need tobe investigated to determine the suitable architecture for implementing very largenetworks.

  • Chapter 4

    Architectural Design

    4.1 Design Exploration

    4.1.1 GoogLeNet Inception ModuleThe GoogLeNet Neural Network is derived from an emerging class of “Networkin Network” type topologies. The network entered into the ImageNet challengecomprises of 9 ”Inception Modules”, where each contain several differentlysized convolution and maxpool operations that are organized over two internallayers. The utilization of smaller 1x1 convolutions as a form of abstraction pre-processing, allows the more expensive 3x3/5x5 convolutions to require less filtersto extract the desired information and the major contributing factor of reducingthe required network parameters by a factor of 12 (as compared to AlexNet). TheGoogLeNet development team also attempted to keep the computational budgetat 1.5 billion multiply-adds, approximately similar to AlexNet.

    Figure 4.1: Modular Topology

    20

  • 4.1. DESIGN EXPLORATION 21

    Figure 4.2: Inception Module

    Type size/stride Output Size Depth #1x1 #3x3Rdc #3x3 #5x5Rdc #5x5 pool proj param Ops

    convolution 7x7/2 112x112x64 1 2.7k 34Mmax Pool 3x3/2 56x56x64 0convolution 3x3/2 56x56x192 2 64 192 112K 360Mmax Pool 3x3/2 28x28x192 0Inception (3a) 28x28x256 2 64 96 128 16 32 32 159K 128Minception (3b) 28x28x480 2 128 128 192 32 96 64 380K 304Mmax Pool 3x3/2 14x14x480 0inception (4a) 14x14x512 2 192 96 208 16 48 64 364K 73Minception (4b) 14x14x512 2 160 112 224 24 64 64 437K 88Minception (4c) 14x14x512 2 128 128 256 24 64 64 463K 100Minception (4d) 14x14x528 2 112 144 288 32 64 64 580K 119Minception (4e) 14x14x832 2 256 160 320 32 128 128 840K 170Mmax Pool 3x3/2 7x7x832 0inception (5a) 7x7x832 2 256 160 320 32 128 128 1072K 54Minception (5b) 7x7x1024 24 384 192 384 48 128 128 1388K 71Mavg pool 7x7/1 1x1x1024 0dropout (40%) 1x1x1024 0linear 1x1x1000 1 1Msoftmax 1x1x1000 0

    Total 1.80M 1.502B

    Table 4.1: GoogleNet Network Topology [2]

    While at a cursorily glance, the relatively small 25MB parameter size of theGoogLeNet may seem to indicate a viable topology for memory constrainedsystems, the extra intermediate stages between layers introduced by the 1x1convolutions shifts the burden from storing static network weights, to the storageof working data. The very technique that enables a reduction in parameter size,also extends variable lifetime and further increases cache memory burden throughmultiple intermediate stages.

    Within the Inception Module, the computation and memory expensive 3x3and 5x5 convolutions are preceded by a set of 1x1 convolution operations thatreduce the depth of the input matrix (and in turn, reduce the necessary kernelparameters).

  • 4.1. DESIGN EXPLORATION 22

    The first inception module “inception (3a)” with an input matrix of 28x28 anda depth of 192, would normally require a 5x5 convolution kernel with a depthof 192, but by passing through the reduction operation, the depth is reduceddown to only 16.

    The problem emerges when applying conventional techniques of minimizing off-chip access through memory reuse.

    4.1.2 Memory StructureCurrent existing embedded implementations of neural networks universally operateon small datasets with single producer/consumer topologies where the variablelifetime of the input data expires once the output product is ready.

    These attributes are ideal for architectures using ping-pong type memorystructures to swap the input/output buffers between each stage, and whencombined with the relatively smaller datasets, allows for on-chip buffering ofsource and destination; reducing off-chip memory access to only retrieval of thekernel and biases for each stage.

    Inception 3bWidth Height Depth Size(x) (y) (z) (KB)

    Branch 1 Peak Memory3b/input 28 28 256 802.8 Simple 2.71 MB 161.6%3b/1x1 28 28 128 401.4 out DDR 1.20 MB 71.8%

    Branch 23b/input 28 28 256 802.8 Peak Memory3b/3x3Rdc 28 28 128 401.4 Simple 3.31 MB 197.6%3b/3x3 28 28 192 602.1 out DDR 4.01 MB 23.9%

    Branch 33b/input 28 28 256 802.8 Peak Memory3b/5x5Rdc 28 28 32 100.4 Simple 2.71 MB 161.6%3b/5x5 28 28 96 301.1 out DDR 1.20 MB 71.8%

    Branch 43b/input 28 28 256 802.8 Peak Memory3b/pool 28 28 256 802.8 Simple 3.31 MB 197.6%3b/poolproj 28 28 64 200.7 out DDR 1.51 MB 89.8%

    Table 4.2: Array Memory Usage

  • 4.1. DESIGN EXPLORATION 23

    In the case of GoogLeNet’s Inception module, a single input matrix isprocessed by 4 blocks (three 1x1 convolutions and a max pooling), requiring theproduction of 4 output matrices before the original input buffer can be freed.

    Three of the intermediate products then become input for another set ofconvolution operations before concatenating together to form the final inceptionmodule output. Table 4.2 features a memory analysis of the second GoogLeNetmodule “Inception (3e)”, which was found to be the limiting design case withhighest utilization of on-chip memory.

    Figure 4.3: Network Parameters of Inception Modules

    From the chart in Figure 4.3, it is quickly apparent the conventional approachof keeping the working data within chip boundary (even when kernel constantsare in external SDRAM) already far exceeds (220%) the available on-chip BRAMof even the largest Artix 7 FPGA.

    The next obvious adaptation is to traverse each internal branch individuallywhile offloading the final output to external RAM before sequentially tackling theother three branches. This approach trades throughput for a significant reductionin the resources normally allocated to data storage by reusing the intermediatestage buffer.

    A major design hurdle is the varying sizes of both the consumed and producedmatrices found throughout the network. During the early stages of the network,the input size begins at nearly a 1:1 ratio to the kernel coefficients, while kernelsize significantly increases in the latter half (Figure 4.3). This size variationcomplicates simple approaches that attempt to divide the available BRAM intodiscrete memory structures directly attached to the input/output ports of thecomputation module.

    In hardware, reallocating memory between physical memory structures is

  • 4.1. DESIGN EXPLORATION 24

    not a trivial affair; since it requires advanced techniques such as runtimereconfiguration. Although it may be tempting to follow a software inspiredapproach to implement dynamic memory allocation, by using registers to store‘virtual’ buffer array offsets to address a location in a single large contiguousaddress space (as suggested by a colleague). From a hardware perspective thiswould require combining the entire BRAM resource into a single structure. Thiswill certainly result in poor concurrency due to the large quantity of data (possiblythe entire computational payload) will be constrained behind the BRAM’s dualport interface.

    Optionally by carefully partitioning the data array to ensure all concurrentlyaccessed elements are placed in separate BRAM interfaces and using multiplexersto handle the addressing, it is possible to increase the number of access ports toa single address space when the data access pattern is predictable such as mostmatrix array accesses. Application of this technique must be used in moderationas it became apparent that due to large width MUXs required to select a particularBRAM structure cause massive congestion during place and route.

    4.1.3 Data Dependency

    3×3×2

    KConvolution

    Figure 4.4: Next stage is delayed until output array is complete (blue layer)

    The dependency between (back to back) convolution operators limits thenext convolution stage to begin only after last activation map 1 is written to theoutput array. This is depicted in Figure 4.4, where each output element (ball)requires calculation with the entire input depth covered by the kernel (left Array).Consequently the last feature filter (Blue) needs to complete it’s convolution (bluelayer) before passing the output array to the next convolution stage to begin.

    1 Activation Map is the convolution output resulting from a single kernel filter, and with a depthof 1 (length×width×1)

  • 4.1. DESIGN EXPLORATION 25

    To pipeline multiple convolution stages in a single accelerator pass, normallythe entire output product of a prior stage must be available. Therefore, it has beencommon among all recently reviewed works to process only one network layer at atime to full completion before tackling the next, while using the external memoryto assemble the individual activation maps to form the output array for the nextpass through the accelerator. This type of architecture benefits from simpler datatraffic and management of a large number of concurrent matrix multipliers.

    Unfortunately adapting this flow for large modern deep layered networks suchas GoogLeNet, leading to poor throughput due to several compounding factors,and cumulates in poor data reuse and massive amounts of data transfer acrossthe chip boundary to external memory; almost degrading to a situation where theexternal SDRAM acts as an accelerator based on external scratchpad memory.

    A contributing factor is a larger number of layers found in deep networks,naturally leads to greater total latency from increased input array loads fromexternal memory, but the largest impact comes from the tendency of modernnetwork topologies to incorporate multiple back to back convolution operations ina single layer/module and extremely large datasets, that further delay the startingof the next convolution stage.

    To produce a single slice W ×H (A coloured layer from Figure 4.4) of theoutput array’s depth n (n=4 for Figure 4.4) , the entire input array must bereferenced n times by a single operation block. Without the capacity to completelystore the input array, amounts to a transfer of the same input data for everyActivation map to be calculated.

    The first Inception module has an input array of 28× 28× 256 totalling803KB (32bit float), and needs be entirety referenced by a convolutionoperation to produce a single Activation Map of 28×28×1. This operationis repeated for each of the 128 Feature Maps (Kernels), stacking the outputActivation Map to eventually produce the output array of 28×28×128. Thistotals approx 102MB of kernel reads and when considering this output arrayis only 1 of 6 convolution operators within a single Inception module, whichin turn is only 1 of 9 inception modules, it is not surprising that workingdirectly off SDRAM is detremental to performance.

    4.1.4 Hardware ReuseThe performance of GoogLeNet (specifically, network-within-network topologies)accelerator will largely depend on finding an efficient manner to split theoperations while balancing concurrency, data reuse, and on-chip cache space.

    A prerequisite to extracting/leveraging parallelism through hardware designinvolves careful analysis of the algorithm to determine a pattern of data execution

  • 4.2. PROPOSED ARCHITECTURE 26

    that meets the FPGA resources budget while enabling the largest portion ofcomputational work to be completed in a single loop iteration.

    The limited FPGA resources requires thoughtful allocation to accelerate onlystructures of high utilization, ideally the derived execution pattern needs to begeneral enough such that a single instantiation of the CNN Accelerator can bereused for all 9 inception modules of the network.

    Typically this generalization is achieved through a trivial application of a loopcounter and limiting concurrency to operations within single slice of the inputarray depth (LxWx1). Therefore by setting the loop control register, variouscalculation depths can be handled.

    A consequence of this method prevents the chaining of multiple operators withvarying depths (loop iterations), without resorting to intermediate data buffers.Another factor complicating hardware reuse, the module and the internal operatorsmust also be compatible accross 4 different input array widths in addition to thevariable depth.

    The variation along all three input dimensions would conventionally indicatethe only suitable elements for concurrent processing are the fixed exit conditionsof the inner most two loops of the typical convolution (Listing 4.1) and can bethough as to represent a slice of the convolution kernel (NxNx1). Subsequently,following through with these methods of large intermediate buffers and loops, theanalysis will begin to lead back to the extreamly poor performance architecturediscussed in section 4.1.3.

    4.2 Proposed Architecture

    4.2.1 Data Partitioning4.2.1.1 Depth Slicing

    When dealing with large datasets and limited on-chip cache, the emphasize shiftstowards finding a suitable pattern of memory access that is simple (to reduce statelogic), repeatable, and compatible with the DDR row/bank access limitations. Thefollowing proposed partition method tackles the aforementioned complicationsof hardware/data reuse, varying input/convolution sizes, and also proposed, is amethod of chaining convolution operators for increased throughput per acceleratorpass.

    Current implementations fundamentally retain the “hierarchy” defined by thenetwork layers and operational blocks. This adherence simplifies the design ofCNN dataflow architectures and additionally, guarantees the logical correctnessof it’s implementation, but in this case, becomes an unnecessary limitation whenscaled up to research class neural networks.

  • 4.2. PROPOSED ARCHITECTURE 27

    Due to the scope and complexity of dissolving the abstraction boundaries thatinherently preserve the temporal sequences of all dependencies, and time-costs ofRTL development, it becomes necessary to have the ability to evaluate the logicalcorrectness of developing algorithm before RTL entry.

    Through leveraging known Loop optimization techniques, it should be possibleto ensure the functional correctness of the new modified processing flow if:

    • Original pseudocode description of function + Legal loop transformations• Results in the pseudocode of modified (partitioned) function

    Listing 4.1 represents a typical square kSize×kSize Convolution and will beused in the following section to provide clarity to the proposed data manipulations.

    1 for(feat=0; feat < numKerns; feat++){2 for(chnl=0; chnl < kSizeZ; chnl++){3 for(row=0; row < dSizeY; row++){4 for(col=0; col < dSizeX; col++){5 for(i=0; i < kSize; i++){6 for(j=0; j < kSize; j++){7 out_Map[feat][row][col]+=inArr[chnl][row+i][col+j]*

    kernel[feat][chnl][i][j];8 }}}}}}}

    Listing 4.1: 2D Convolution Pseudocode

    By partitioning the large input matrices (and their respective kernels) acrossthe depth dimension to produce kSizeZ/N block slices of dSizeX × dSizeY ×Nwhere N is selected to prioritize fitting the full input width and height dimensionsof the input array. This results in a flow where every kSizeZ/N iteration of loop 2(line 2) of Listing 5.1, only requires data available within chip boundaries. Thisworks by reducing the variable size and lifetime to N, shifting the requirement offull array depth for loop processing, and replacing with the less resource intensiverequirement of shared access to the output array.

    This technique allows a large input array to be quickly convoluted in smallerportions while only requiring per-element summation to assemble the completedoutput array.

    1 for(feat=0; feat < numKerns; feat++){2 for(partn=0; partn < kSizeZ/N; partn ++){3 for(chnl=0; chnl < N; chnl++){4 for(row=0; row < dSizeY; row++){5 for(col=0; col < dSizeX; col++){6 for(i=0; i < kSize; i++){7 for(j=0; j < kSize; j++){8 out_Map[feat][row][col]+=inArr[chnl+(partn*N)][row+i][

    col+j]*kernel[feat][chnl+(partn*N)][i][j];

  • 4.2. PROPOSED ARCHITECTURE 28

    9 }}}}}}}

    Listing 4.2: Partitioned 2D Convolution

    *This principle will be combined with Stream processing (Section 4.2.2)to produce partially convoluted data blocks feeding the (3x3/5x5) convolutionoperators. This combination will functionally enabling full input depthprocessing by leveraging a unique characteristic inherent in 1x1 convulsions.

    4.2.1.2 Static Width

    To enable efficient optimization and enable reusing of a single hardware structurefor the different inception module, the variable loop exit conditions must be madestatic or reordered such that the variable bounds are at the top of the nestedloop structure. Then with loop bounds known at synthesis time (and no datadependencies), it becomes possible to “unroll” the inner loops and schedule theiterations for concurrent execution.

    Fixing the convolution kernel to 5x5 was done for the prototype implementationto further reduce the design to a single square convolution operator. Using a 5x5convolution operator to perform a 3x3 can be simply performed by centering the3x3 kernel and setting the boarder elements to zero. With this optimization, the3x3 Convolution operator essentially becomes free as the 5x5 structure is requiredirrespective of the 3x3 operation.

    Module Original Partitioned Size

    Inception (3a) 28x28x256 (256/N)*28x28xNinception (3b) 28x28x480 (480/N)*28x28xN

    inception (4a) 14x14x512 (512/N)*14x14xNinception (4b) 14x14x512 (512/N)*14x14xNinception (4c) 14x14x512 (512/N)*14x14xNinception (4d) 14x14x528 (528/N)*14x14xNinception (4e) 14x14x832 (832/N)*14x14xN

    inception (5a) 7x7x832 (832/N)*7x7xNinception (5b) 7x7x1024 (1024/N)7x7xN

    Table 4.3: Partitioning: 3 New Static Array Sizes

    Employing only the above partitioning optimizations still results in 3 arrayssizes, serving to complicate further optimizations by requiring either looping of

  • 4.2. PROPOSED ARCHITECTURE 29

    small parallel structures, or separate structures to enable compatibility. To enablehardware pipelining of Line 4&5 in Listing 5.1 without further optimization, thevariable loop limits controlling the width and height would necessitate either:

    • A single, many stage (high latency) pipeline designed for the largest controlvariable(s) (28*28) with multiplier zero filling and stage pass-through toforce compatibility with the other 2 input sizes

    • Three separate pipelines each specialized for their respective input size

    Both options induce enormously poor allocation of resources, therefore byfixing these loop variables to a static value, resources (such as DSP FP multiplier)can be better concentrated to increase simultaneous multiplications and reducetotal overall loop iterations. The logical choice for the fixed width is arrays of7x7xN, as the other 2 sizes can be implemented with multiple 7x7xN structuresin parallel.

    P3

    P1

    P4

    P2

    (a) Input array split into 4 Partitions

    P3

    P1

    P4

    P2

    (b) Kernel at edge of partition boundary

    Figure 4.5: Partitioned sub-array requires neighboring elements to completeconvolution

    Splitting the input to convolution operator can not be performed throughsimple truncation without violating the boundary condition of the mathematicaloperator. When depicted as a sliding window, the region surrounding the outputelement is weighted and summed, therefore when the window reaches an artificialboundary produced by the partitioning method, the values of the extra elements(ie. 2 elements for 5x5 kernel) in the next partition must also be made available tothe operator (Figure 5.4).

    To provide the ability to correctly rearrange the elements for convolutionbuffer, a control state machine must recognize which partition it is currently

  • 4.2. PROPOSED ARCHITECTURE 30

    responsible for, and either pad the extra elements with zeros as in when a “true”boundary exists, or populating with elements from the neighbouring partition.

    4.2.2 Convolution CascadeThe Inception module utilizes 1x1 convolutions to perform a depth reductionbefore the main 3x3 or 5x5 convolution. Utilizing kernels width of only one, thisenables the combination of stream processing and data reordering for convolutionto be performed without additional on-chip memory utilization (beyond kernelstorage).

    1×1×nConvolution Kernels

    Input Arrayofdepth n

    f num

    of

    Filter

    Kerne

    ls

    Multip

    ly

    Accum

    ulate

    Buffe

    r

    Figure 4.6: Streaming Convolution Array Access Pattern

    Significantly, these characteristics will be heavily exploited to bypass theinnate data-dependencies of cascading 2 convolution stages and allow the secondstage begin before the completion of the first.

    By interweaving the computation of the different feature filters and traversing“depth before breadth” through the input array, the output data will be produced inan order that (combined with above partitioning methods) allows a smaller partialblock to be processed by the second stage. This is possible because this smallerblock is actually identical to the kSizeZ/N partitioned block; therefore requiringonly a summation of the final output blocks to produce the full convolution output.

    4.2.3 Processing TopologyThe above principles are combined to produce a modular Processing Elementwhere a streaming input removes the need for large input buffering with the on

  • 4.2. PROPOSED ARCHITECTURE 31

    chip memory. This in turn enables multiple processing elements to handle aparallel pipelined computation of single (1x1) and square (3x3/5x5) Convolutionssuch as those of the GoogleLeNet Inception modules (Figure 4.7).

    StreamInfrom DMA

    Partial Convolution PEs

    Split

    conv0

    conv1

    convn

    Concat

    StreamOutto DMA

    Figure 4.7: Conceptual Parallel Processing Element Layout

  • Chapter 5

    Implementation and Analysis

    5.0.1 Processing ElementsEach Processing Element contain a pipelined architecture containing a streamingconvolution unit, an intermediate Array Block Buffer, followed by the larger”square” convolution operator.

    Figure 5.1: Processing Element Architecture

    The intermediate Array Block Buffer serves to convert from the dataflowdomain of single data word/cycle (of the 1x1 convoluter) into an array datapacket for variable cycle computation (controlflow) of the square convolutionoperator. This combination architecture maximizes data reuse by enabling registercontrol of the numKerns variable (Line 1 in Listing /reflstlisting). Control ofthis register is necessary because of the ratio of 1x1 to 3x3 (or 1x1 to 5x5)varies between different inception modules; without this control, some inceptionmodules branches will require some of the DDR bandwidth to re-access inputvalues to compute the unfinished 3x3 or 5x5 feature kernels.

    Since Majority (5 of 9) of the Inception module uses 14×14 as the input arraysize, the processing element was implemented with 2 convolution modules with aninput array size of 7×7 that share a single buffer control state machine and operatesimilar to a SIMD architecture but on a higher (convolution module) level.

    32

  • 33

    5.0.1.1 Streaming Convolution

    On every cycle,the streaming convoluter is able to accept a new 32bit float value(initiation interval = 1), perform 16 multiplications and pass the results to theaccumulate block through a 512bit wide (16*32bit) FIFO implemented withXilinx Shift-Register-Logic. This ensures the greatest possible throughput (tomatch DMA) to avoid any foreseeable bottlenecks due to the serial communicationand high input/low reuse data nature.

    In Figure 5.1, separating the Multiply and Accumulate blocks was notnecessary and was only performed to simplify pipeline development and remainconceptually identical to a single pipelined Multiply-Accumulate unit.

    The three cycle latency of the Xilinx DSP48E Floating point multiply can be”hidden” by staggering the use over three sets of 16 DSP multipliers, effectivelyreducing the initiation interval to 1 cycle. Unfortunately if the same method wasused to create a single cycle (initiation interval) pipelined accumulate operationwould require a staggering 112 DSP units due to the approximately seven cycledelay1 for floating point addition.

    By describing the above Multiply-accumulate unit to Vivado HLS, it wasable to was able to achieve the same initiation interval with only 32 DSP48Eunits (Figure 5.3). This lead to the exploration and eventual adaptation of theVivado HLS generated multiply unit as well, although DSP utilization remainedthe same (48×DSP48E Slices), it only required 33.9% and 33.2% less FF andLUTs (respectively) than the initial design.

    The one-third reduction in LUT and FF utilization was eventually tracedback (in part) to clever/complex utilization of the dedicated DSP interconnectresources[41] to pass values between stages instead of general FPGA fabric. 2

    1 Xilinx DSP User guide[41] describes advanced topologies that can reduce this delay at the costof more DSP resource utilization 2 The operation of the DSP48 Slice is extremely complex withmultiple modes of operation and case-specific optimization techniques; due to time restrictions,exhaustive analysis is beyond the scope of this thesis.

  • 34

    Stream 16xMultiplier+-----------------+---------+-------+--------+--------+| Name | BRAM_18K| DSP48E| FF | LUT |+-----------------+---------+-------+--------+--------+|DSP | -| -| -| -||Expression | -| -| 0| 74||FIFO | -| -| -| -||Instance | -| 48| 2288| 2240||Memory | -| -| -| -||Multiplexer | -| -| -| 41||Register | -| -| 1131| 1|+-----------------+---------+-------+--------+--------+|Total | 0| 48| 3419| 2356|+-----------------+---------+-------+--------+--------+|Available | 730| 740| 269200| 129000|+-----------------+---------+-------+--------+--------+|Utilization (%) | 0| 6| 1| 1|+-----------------+---------+-------+--------+--------+

    Figure 5.2: Utilization Report: Multiply Unit (HLS Version)

    16xAccumulator+-----------------+---------+-------+--------+--------+| Name | BRAM_18K| DSP48E| FF | LUT |+-----------------+---------+-------+--------+--------+|DSP | -| -| -| -||Expression | -| -| 0| 517||FIFO | -| -| -| -||Instance | -| 32| 5952| 4816||Memory | -| -| -| -||Multiplexer | -| -| -| 49||Register | -| -| 2061| 1|+-----------------+---------+-------+--------+--------+|Total | 0| 32| 8013| 5383|+-----------------+---------+-------+--------+--------+|Available | 730| 740| 269200| 129000|+-----------------+---------+-------+--------+--------+|Utilization (%) | 0| 4| 2| 4|+-----------------+---------+-------+--------+--------+

    Figure 5.3: Utilization Report: Accumulator Unit (HLS Version)

  • 35

    5.0.1.2 Square Convolution

    P1 P2Padded Partition (11×11)(7×7) (7×7)

    (a) Padded 7×7 Partition becomes 11×11

    Padded Partition (11×11)

    Shift-Register Buffer (11×5)

    7×1 Output Buffer

    5×5 Kernel

    (b) Elements for one output row

    Figure 5.4: Convolution Buffer Array

    The square convolution operation demands the largest portion of the totalcomputation effort, therefore the emphasis to prioritize on-chip data reuse helpsto maximize calculation throughput by keeping the DSP multipliers continuouslyfed with data, while reducing off-chip memory fetching.

    As determined in Section 4.2.1.2, a square 5×5 convolution with an inputarray size of 7×7 would allow for the design of a single fundamental calculationcore molecule that is capable of:

    • Perform both 3×3 and 5×5 convolution operations

    • Compatible with all Inception Module input array sizes through replicationor time multiplexing

    • Easily replicated and controlled (SIMD style operation)

    The convolution core molecule uses values buffered in a shift register arrayof size 11×5 (Each are 32bits wide). As depicted in Figure 5.4, the array isspecifically dimensioned to hold all the additional padding values required for theconvolution of a input array row and correspondingly produces a seven elementrow vector.

    The buffer was designed as an array of Serial-In Parallel-Out (SIPO) shiftregisters in an attempt to utilize Xilinx’s 7-Series SRL (Shift Register using LUTs)

  • 36

    Input ArrayShift Register Buffer

    (11×5)ShiftUp

    Shift Out

    Shift Row In

    Figure 5.5: Shift Register Buffer

    optimization[42] to reduce FPGA flip-flop utilization, but current attempts couldnot get the Vivado HLS tool to recognize the SRL optimization.

    A shift register topology is ideal because after performing the seven convolutionoperations corresponding to a horizontal row, only eleven new elements 1 need tobe shifted into the buffer in order to restart the row convolution (Figure 5.5). The5×5 kernel constants are held in a set of 25 registers as they are constantly reusedfor the entire 7×7 frame before the next set of 25 constants are required. Thisshifting window pattern reuses 44 of the 55 buffer elements to vastly reduce non-buffer memory accesses, while maintaining a simple symmetric control flow toallow SIMD operation.

    The core molecule operates synchronously and without need for inter-modulecommunication because each internal buffer array is already padded with theneighboring partition elements. This was achieved with simple mux logic thatsends overlapping data to each module during the buffer’s shift-in read stage.The intention of using mux logic was to allow for the dynamic allocation of theprocessing modules into two different operating modes for better efficiency overa range of Inception module attributes.

    The currently implemented operating mode allows multiple computation coremolecules to be ”Stiched” together by loading with identical kernel coefficientsand overlapping padding to handle larger input array sizes (ie. 2 modules for14×14, and 4 modules for 28×28). The anticipated drawback of this mode isreduced efficiency during the later stages of the GoogLenet, where partitioningis not required because the raw input size drops to 7×7, this creates a situationwhere it is beneficial to allocate the molecules to each tackle a different set kernel1 a padded 7×7 row becomes eleven elements wide

  • 37

    filters and consequently they will require identical (non-padded) data instead.Due to limited time, and heavy resource utilization of the unoptimized shift

    register array, the convolution module was limited to 2 concurrent computationcore instances operating in ”Stiched” mode1. This 2 core configuration willbe utilized for testing and is reflected in the utilization and performance resultspresented in this chapter.

    +-----------------+---------+-------+--------+--------+| Name | BRAM_18K| DSP48E| FF | LUT |+-----------------+---------+-------+--------+--------+|DSP | -| -| -| -||Expression | -| -| 0| 1182||FIFO | -| -| -| -||Instance | -| 251| 28471| 27768||Memory | 22| -| 4544| 5952||Multiplexer | -| -| -| 15994||Register | -| -| 31209| 5|+-----------------+---------+-------+--------+--------+|Total | 22| 251| 64224| 50901|+-----------------+---------+-------+--------+--------+|Available | 730| 740| 269200| 129000|+-----------------+---------+-------+--------+--------+|Utilization (%) | 3| 33| 23| 39|+-----------------+---------+-------+--------+--------+

    Figure 5.6: Utilization Report: 5x5 Convolution with two core molecules

    5.0.2 Accelerator ControlThe proposed accelerator was originally conceived for stand alone operationwith minimal control oversight, and therefore contain all the data manipulationcontrol states within the processing element module. The control interface of theaccelerator are a series of values describing the topology of the network, in whichthe accelerator will calculate and set internal registers for the necessary loops anditerations. This interface was intended to allow a small topology descriptor headerfile to be provided along with the network weights on portable media such as anSDcard.

    Due to time limitations, the topology header read module remains unfinished.In its current implementation, the control registers are memory mapped andexposed onto an AXI-Lite bus for hardware accelerator operation as defined inthis thesis.

    For simplicity a Xilinx Microblaze microprocessor was used to initiate thetopology control registers in the accelerator through the use of the followingcontrol data structure:1 For up to 14×14 input size

  • 38

    Figure 5.7: Hardware Accelerator System View

    1 enum ConvType{CONV_ONE = 0, CONV_THREE, CONV_FIVE};23 struct cmd4 {5 ConvType mode;6 short in_dimen; // Input Array Depth=width7 short in_depth; // Input Array Depth8 short K_1x1_feat; // # of Feature Filters 1x19 short K_5x5_depth; // # of Feature Filters 3x3 or 5x5

    1011 u32 K_1x1_addr; // DDR_offset: first 1x1 Kernel12 u32 B_1x1_addr; // DDR_offset: first 1x1 Bias Vector13 u32 K_5x5_addr; // DDR_offset: first 3x3/5x5 Kernel14 u32 B_5x5_addr; // DDR_offset: first 3x3/5x5 Bias Vector15 u32 out_addr; // DDR_offset: Output Array16 short stride; // Convolution stride for downsampling17 short batches; // # of 3x3/5x5 Feature Maps = numKerns18 };

    Listing 5.1: Accelerator Control

    5.0.3 Data TransferIn order to efficiently stream data to the accelerator’s 1x1 convolution blocks,a Xilinx AXI DMA controller was elected over the simpler option of usinga Microblaze processing core to directly regulate data transfers. The use ofa processing core would enable simpler control over the ordering and flow ofdata to the accelerator by directly interfacing to the processor pipeline through

  • 39

    16 dedicated 32bit AXI-Stream links. While an enticing interface option,early exploration on the feasibility for applications involving large datasets andfrequent constant retrieval from external memory, quickly saturated the 32bitwidth bus limitation of the Microblaze. ‘ Neural networks intrinsically place alarge emphasis tasks based on retrieving pre-calculated values (network weights),consequently nearly every value to be forwarded to the accelerator from theMicroblaze pipeline must therefore arrive externally through the data bus.

    Employing a DMA controller over a processing core in directing data flow,not only frees the core for other duties outside the occasional DMA instructionissue, but also avoids the throughput costs of the pre-requisite register load fromexternal memory prior to transmitting on the pipeline stream links. Additionally,the DMA read/write bus can be expanded up to 1024 bit width, far beyond thelimitation of the Microblaze. This attribute is profound even in platforms withseemingly “low” DDR widths, such as the 16bit DDR width on the developmentboard used for this study (Diligent Nexys Video).

    Assuming a data burst from the same row and bank of the SDRAM, operatingat DDR frequency equivalence of 800MHz, the 16bit width at DDR, transfersa 32bit word every 2 DDR data beats. With the memory controller and designlogic frequency at one fourth (100MHz) of the memory and under ideal burstcondition, can generate up to 128 bits of data for every processing cycle.

    Out of the three Xilinx DMA IPs, the AXI DMA core is the only controllerthat supports vectored I/O by enabling the Scatter-Gather (SG) engine, whilemore FPGA resource intensive, this feature is vital for efficiently facilitating thepatterned memory access described earlier in Section 4.2.2 for the streaming 1x1Convolution.

    The Scatter-Gather engine itself enables logic that, through the instantiationof a control structure consisting of a linked-list of “Buffer Descriptors”, supportsthe collection (Gather) of data from multiple buffers and writing to a stream; or inthe reverse, writing portions of the incoming stream to multiple buffers (Scatter).Essentially it can be thought of as a fifo list of register commands that can beconstructed by the processor and passed to hardware for execution, with eachdescriptor able to specify a unique address offset and transfer length.

    Combining the SG engine with Multichannel mode adds three new descriptorfields to describe a simple 2D memory access patterns. With the HSIZE parameterdefining the length in bytes of each burst, a STRIDE field to express the offset, andVSIZE dictating the iterations. This capability is idea for tasks featuring numerousrepetitions of non-sequential single word transfers followed by a fixed offset,a common pattern in two-dimensional array accesses. Without these featuresenabled, the processing core will be continuously interrupted to prepare DMAcommands with burst lengths of only 4 bytes.

  • Chapter 6

    Results

    The original intention was to instantiate and test 2 Processing Elements but LUTutilization including the support framework (DMA, DDR Memory controller,Microblaze, etc.) pushed beyond the available resources in the Xilinx Artix7A200to 120% utilization.

    It should be noted that during the design of the square convolution kernelbuffer that, as a matter of convenience, it was implemented with the samecapabilities as the Input Array Buffer, including several expensive large widthMUX logic to perform line shifting (not required for the Kernel buffer). Althoughchanging the kernel buffers to only use simple registers would save someresources, the savings would not be significant enough to vastly reduce theutilization and Place & Route may still fail timing due to congestion.

    Additionally, if 2 processing elements were used for testing, the utilizationwould leave no room for the Xilinx Debug core to gather results. Therefore in thefollowing sections, only a single processing element set up as a 14×14 input sizewill be used (Technically, two 7×7 Convolution units working in unison).

    6.1 Testing LimitationsAs a consequence of limited time, the MaxPool operator in the Inception moduleremains incomplete, and therefore excludes complete evaluation of the acceleratorover the entire GoogLeNet network topology. The exclusion of MaxPoolcapabilities is actually not uncommon in convolution accelerator designs due tothe low calculation (purely comparative) nature of the MaxPool procedure, andmay be more resource efficient to implement with the microblaze.

    40

  • 6.1. TESTING LIMITATIONS 41

    Access Pattern Cycles

    Frame Skipping 238,289Sequential (Test Case) 37,290Rotated Array (Impl. Design) 37,663

    Table 6.1: Pipeline Latency for different DDR access patterns

    6.1.1 Performance6.1.1.1 Memory Bandwidth

    Under initial testing, it was observed that the data stream from DMA toAccelerator was unable to sustain transfer of a new 32bit value per clock cyclebeyond the first 6 vectors containing 192 floating point values. This pattern isconsistent with an empty DMA buffer induced by inefficient non sequential readsfrom DDR memory; this was confirmed by temporarily modifying the DMAaccess pattern (HSIZE, VSIZE, STRIDE) and observing the results.

    The difference between sequential and non-sequential is approx. 6 cycle perword, and is in line with the DDR memory timing cycles [43] of 11-11-11 (RCT-RP-CL), indicating that the latency between reads on different rows requires 33cycles (TRP + TRCD + CL).

    By factoring in the fabric to memory controller clock ratio of 1:4 (1:8 DDRequivalence), gives a latency of approx. 4 cycles between transfers (5 cycles percompleted word transfer), which confirms that the delays observed are due tonon-sequential reads1 that require SDRAM rows to be opened and closed for eachword.

    Considering that the processing pipeline has been designed to accept newvalues every cycle, this latency severely under utilizes the resources allocatedto the accelerator.This poor performance from non-sequential DDR reads wasanticipated during design and can resolved by simply ”rotating” the storage arraysuch that depth becomes width, resulting in sequential reads to produce the DMAstream.

    The slight increase in cycles from the modified rotated array as compared tothe earlier artificial test scenario are due to the periodic kernel buffering access.These memory reads were sequential before rotating the storage pattern, but haveminimal negative impact due to the high data reuse, greatly reduces the frequencyof kernel loads. The extra latency cycles can be completely eliminated if therotated storage pattern was only applied to the input/output arrays but due to timerestrictions, was not implemented in the test architecture.1 The approx. 1 cycle difference between measured and calculated (Table 6.1) is due to includingsquare convolution stage cycles

  • 6.1. TESTING LIMITATIONS 42

    6.1.1.2 Floating Point Throughput

    While comparing FLOPS throughput is not a comprehensive performance measurement,as it does not relate any information on memory bandwidth utilization or theefficiency in getting a task complete. It does provide some insight into howwell the design utilizes the allocated DSP resources, such that if there are majordifference in throughput, may indicate that resources are poorly allocated tostructures that are mostly idle .

    When the implemented processing element was scaled up to a full size input,it completed computations for a 16×16×192 against 16 1×1 Feature filters (eachof size 1×192) and 5x5 square convolution on 16×16×16 over 8 Feature maps(each of size 5×5×16) in approx 80,000 cycles.

    Out of the 80K cycles, the second stage sat idle for nearly 52K cycles,indicating that the DMA transfer of 1 value/clock cycle (16× 16× 192 = 49KValues) was the bottleneck. By doubling the transfer width to 64bits the totallatency drops to 55k cycles. This forms a more balanced pipeline stage, butdemonstrates the inherent inefficiency of adapting to allow for any input depth;because the first stage is streaming based, the duration depends on the input depth.

    Actual throughput (Table 6.2) when the streaming width is sufficiently wideis improved due to the pipelined architecture with an pipeline interval of 28,492Cycles.

    Stage Interval (Cycles) Latency (cycles)

    1x1 Convolution 1.573 M FP-Op5x5 Convolution 1.254 M FP-OpTotal (1 partial pass) 2.827 M FP-Op 28,492 55,190

    FP Throughput @100MHz 9.92 GFLOPS

    Total DDR Transfer 227KB 55,190 Cycles 411.3MB/s

    Table 6.2: Measured Throughput

    Following the work of [19] from 2015, they present an neural networkaccelerator solution that is claimed to ”Outperform previous approaches significantly”,achieving 61.62 GFLOPS at 100MHz operating frequency. Although this required2240 DSP48E1 slices from a Virtex7 VX485T; several times more then availableon the Artix7 (740 DSP48E1), by scaling their results down to the range of Artix7resources, should provide an indication as to the expected order of FLOPS in awell designed architecture.

    331 DSP48E12240 DSP48E1

    ×61.62 GFLOPS = 9.11 GFLOPS

  • 6.1. TESTING LIMITATIONS 43

    The architecture in this paper achieved 9.92 GFLOPS using 331 DSP slices,this is comparable (slightly greater) to the scaled 9.11 GFLOPS expected froma comparable FPGA implementation of [19], where the emphasis was placed onmemory optimizations to maximize DSP loading (minimize DSP Idling/ waitingfor data). This comparative result indicates that the proposed solution in thisthesis, is effective in continuously providing data to the computation units.

    In it’s current prototype configuration (including the support framework), thedesign only uses 48.5 of the 365 Block RAM tiles (13.3% utilization). Thisleaves room to alleviate memory access contentions between the DMA and theconvolution block buffers that may arise when the number of processing elementsare increased. By allocating more memory to the kernel buffers there will be areduction in reads issued by the convolution block.

    6.1.2 EfficiencyThe major benefit of this architecture is that it enables the piece-wise processingof very large network datasets while decreasing the mass of data crossing on/offchip boundary by serializing two normally conflicting operations in a single pass.

    Additionally, this architecture also incorporates the ability to separate thesecond stage square convolution from the first stage streaming operator in orderto reuse the intermediate buffered values for the next set of 3x3/5x5 feature filterkernels re-fetching values for every set of filter kernels, as would be required fora typical ”single production, single consumer”, dataflow architecture.

    The throughput efficiency of the architecture is inherently related to the inputdepth, where if an Inception module was specified to have an input array ofsignificant depth may cause the second square convolution stage to sit idle forshort periods. This is unavoidable due to the variable input/output sizes foundthroughout GoogLeNet. If the network sizes were set to a fixed ratio (as is typicalfor adapting ANN to embedded systems), then peak efficiency can be guaranteedfor every layer in the network. This modification would reduce the usability ofaccelerator to implementing only specially adapted network topologies (typicallimitation of embedded ANNs).

    From Table 6.2, it is also evident that there is still lots of utilized externalDDR memory bandwidth able to accommodate an increase to 64 bit width DMAtransfers. By doubling the words per DMA transfers and instantiating two first-stage streaming multiply accumulate units would allow the accelerator to handlenetwork topologies with high depth inputs. This is a viable modification becausethe first stage only requires a fraction of the DSP units as the second squareconvolution stage (80 vs. 251 respectively).

  • Chapter 7

    Conclusions

    7.1 ConclusionAs noted by Szegedy et al. [2], the recent trend of increasingly greater numberof network parameters to improve performance is endangering Neural Networksto a realm of ”purely academic curiosity”, their method was demonstrated inILSRC2014 to not only better all other entries in top 5 accuracy, but achievedthis with an order of magnitude less network parameters. While their methodvastly reduced weight parameters, it also introduced new attributes, requiring theexploration of new architectures and optimizations for efficient implementation.

    This architecture proposed in this thesis presents a series of techniques thatcombine to enable FPGA based embedded platforms to leverage the new advancesin Convolutional Neural Network topologies. Specifically networks that relyon similar principles (proven effective in GoogLeNet) of using smaller (1x1)convolutions as a form of dimensional reduction/abstraction before the larger mainconvolution operation.

    From a product development standpoint, it still remains to be seen, whetherspecially designed hardware structures such as the architecture proposed here (inthis t