Building and Using Virtual FPGA Clusters in Data … and Using Virtual FPGA Clusters in Data Centers ... (in alphabetical order) Alvi ... investigated using familiar programming models

Building and Using Virtual FPGA Clusters in Data Centers

by

Naif Tarafdar

A thesis submitted in conformity with the requirements

for the degree of Masters of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

c Copyright 2017 by Naif Tarafdar

Abstract

Building and Using Virtual FPGA Clusters in Data Centers

Naif Tarafdar

Masters of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2017

This thesis presents a framework for creating network FPGA clusters in a heterogeneous

cloud data center. Our main objective is to abstract away the details of creating inter-

FPGA fabrics, by automating the FPGA network connections and the networking to

connect multiple FPGA clusters together. The FPGA clusters are created using a logical

kernel description describing how a group of FPGA kernels are to be connected, and an

FPGA mapping �le. This work lastly looks at acquiring FPGAs as virtual resources from

the data center using the cloud managing software OpenStack. This work �rst partitions

the user circuit onto multiple FPGAs using a user-speci�ed mapping, creates the FPGA

fabric for inter-FPGA connection, generates the OpenStack calls to reserve the compute

devices, create the network connections, generate the bitstreams, programs the devices,

and con�gure the devices with the appropriate MAC addresses, creating ready-to-use

network device that can interact with other network device in the data center.

ii

Acknowledgements

I would like to thank my supervisor Professor Paul Chow. The completion of this thesis

and the many valuable life skills I have acquired in the process can be attributed to his

guidance, and his patience to work with me. I have learnt the value of humility through

the good times and determination and discipline through hard times.

I would also like to thank my Krav Maga instructor Steven Tierney who has taught

me the valuable lesson of persevering through practice and training through his famous

saying: \If you train like a cupcake you will �ght like a cupcake."

I would like to thank my friends (in alphabetical order) Alvi Salahuddin, Ankita

Sinha, Cassandra Kardos, Owais Khan, Rajsimman Ravichandiran, Sara Chung, Thanus

Mohanarajan and Vanessa Courville. Over the past couple years especially the last few

months while I wrapped up my thesis I have been very busy, and I want to thank you

for your patience, support and love through these times.

My parents, Sha�que Tarafdar and Tasnin Tarafdar. Many of the lessons you have

taught me growing up has cultivated into who I am today, and I would not be here today

if not for that.

Also to my wonderful colleagues (also in alphabetical order) Andrew Shorten, Charles

Lo, Daniel Rozkho, Daniel Ly-Ma, Ehsan Ghasemi, Eric Fukuda, Fernando Martin Del

Campo, Jasmina Vasiljevic, Jin Hee Kim, Joy Chen, Joshua San Miguel, Julie Hsiao,

Justin Tai, Karthik Ganesan, Mario Badr, Nariman Eskandari, Roberto Dicecco, Sanket

Pandit, Shehab Elsayed, Vincent Mirian and Xander Chin. You have all helped con-

tribute to the wonderful work environment in PT-477. This thesis would not be possible

without all of you.

I would also like to thank the SAVI team who has helped me a lot over the years.

Professor Alberto Leon-Garcia, Hadi Bannazadeh and Thomas Lin. You have all con-

tributed a large part to this work and I look forward to continue working with all of you

in the years to come.

iii

I would like to thank Kenneth Samuel. You have been like family over the past few

years. We have gone through the hardships of engineering and have travelled the world.

Throughout it all you have kept me honest while always encouraging me to reach my full

potential.

Lastly I would like to thank my sister Nawar Tarafdar. Over the past 18 years you

have been my best friend, and none of this would be possible without you. You helped

me focus when I need to but also helped take my mind o� the stresses of life when I

needed it the most. Thank you.

iv

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Cloud Computing and Data Centers . . . . . . . . . . . . . . . . . . . . . 6

2.3 Network Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Software-De�ned Networking . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Internet-of-Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Smart Applications on Virtualized Infrastructure (SAVI) Testbed . . . . 11

2.6.1 OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7.1 FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7.2 Cloud Cluster Management Tools . . . . . . . . . . . . . . . . . . 16

2.8 Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Base Infrastructure: Cloud Resources and FPGA Platform 20

3.1 SAVI Infrastructure Modi�cations . . . . . . . . . . . . . . . . . . . . . . 21

v

3.1.1 OpenStack Resource Manager . . . . . . . . . . . . . . . . . . . . 21

3.1.2 PCIe Passthrough and OpenStack Image . . . . . . . . . . . . . . 22

3.1.3 Networking Backend . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Xilinx SDAccel Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2 FPGA Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Design Flow for FPGA Development in the Cloud . . . . . . . . . . . . . 28

3.3.1 Extended Design Flow for Multi-FPGA Applications . . . . . . . 30

4 Design Alternatives 31

4.1 SnuCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Modi�cations for SnuCL OpenStack Support . . . . . . . . . . . . . . . . 32

4.3 Cluster Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 FPGA Network Cluster Infrastructure 36

5.1 Logical View of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.1 Sub-Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Physical Mapping of the Kernels . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 FPGA Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 SDAccel Platform Modi�cations . . . . . . . . . . . . . . . . . . . . . . . 44

5.4.1 FPGA Application Region . . . . . . . . . . . . . . . . . . . . . . 45

5.4.2 Input Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4.3 Output Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5 Scaling up FPGA Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 FPGA Software Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.7 Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.8 Limitations of the Infrastructure . . . . . . . . . . . . . . . . . . . . . . 53

vi

6 Evaluation 54

6.1 Resource Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1.1 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.2 Micro-experiment Setup . . . . . . . . . . . . . . . . . . . . . . . 56

6.1.3 Application Case-study . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.4 Query Implementation Details . . . . . . . . . . . . . . . . . . . . 60

6.1.5 Case Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Conclusion 65

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.1.1 Physical Infrastructure Upgrades . . . . . . . . . . . . . . . . . . 66

7.1.2 Scalability and Reliability . . . . . . . . . . . . . . . . . . . . . . 66

7.1.3 FPGA Cluster Debugging . . . . . . . . . . . . . . . . . . . . . . 71

7.1.4 True FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . 72

Bibliography 73

vii

Chapter 1

Introduction

Big data and data center computing has evolved into a multi-billion dollar industry [1].

This involves the use of many compute elements on a large scale (thousands or more)

for large-scale compute problems with many terabytes of data. Computation problems

that were once simple, scale exponentially in complexity in the data center scale [2]. The

complexity is due to the large amount of data, communication between compute nodes

and computation power. At that large a scale considerations in power consumption

and heat dissipation are as important at as computation power as these become the

dominating variables in our cost calculations.

Cloud computing allows the sharing of data center resources among multiple tenants.

This is done by setting up infrastructure to multiplex these resources in time and in

space. A common method to do this is with virtualization, which abstracts away physical

details and maps a virtual machine onto a physical server. Similar data center resources

for standard compute CPUs are commercially available in services such as Amazon Web

Services and Microsoft Azure [3, 4].

1

Chapter 1. Introduction 2

1.1 Motivation

Field-Programmable Gate Arrays (FPGAs) recently have proven to be a good compu-

tation alternative in data centers due to their compute capabilities and power e�ciency.

One example is the Microsoft Catapult project where FPGAs were deployed in the Bing

search engine [5]. With a 10 % power increase they were able to see a 95 % performance

increase. FPGAs allow users to create customized circuitry for their application. The

performance and power-savings multiply to a large scale in a data center scale. Provi-

sioning FPGA resources from a shared cloud similar to the provisioning of CPUs can

be very useful to allow many other users to create their own FPGA computing clusters.

This is a problem some have investigated but the problem that remains is a thorough

implementation of provisioning an FPGA cluster within a fully heterogeneous environ-

ment, where it can communicate to any other network device (be it CPU, another FPGA

cluster, Internet-of-Things device).

1.2 Goal

Our goal is to provide an easy way to orchestrate large FPGA clusters from a large pool of

heterogeneous cloud resources. Our two main objectives are ease of use and performance.

Ease of use requires us to abstract away the details of connecting large clusters. We

investigated using familiar programming models to connect these large clusters, such as

the accelerator model, that has a CPU o�oading computation to multiple accelerators.

Through our investigation we noticed that this has performance limitations and thus we

opted to create a model in which multiple FPGAs are connected together directly as one

accelerator, rather than have individual FPGAs as their own accelerators.

Our model allows users to have a uniform view of their entire circuit (which can

span multiple FPGAs), and design their large circuit on a logical level, where they are

not concerned with physical mappings of their circuit onto FPGAs, and the connecting


infrastructure between FPGAs. Our model also allows users to easily scale up their

designs by specifying with a simple pragma the number of times to replicate a sub-circuit

or the number of times to replicate the entire circuit.

1.3 Contributions

My contributions allow a user of a cloud computing system to provision a ready to use,

easy to scale FPGA cluster. The contributions can be broken down as follows:

1. Shown that a lightweight, low-overhead protocol is critical to have e�cient coordi-

nation of applications using multiple FPGAs.

2. Shown the importance of low-latency direct interconnects between Flogs provides

signi�cant performance improvement compared to having communications through

host CPUs.

3. Created infrastructure to provision a non-network connected FPGA with a single

virtual machine.

4. Created a design ow to program cloud FPGAs with and without FPGAs in an

e�cient manner.

5. Investigated an FPGA cluster model that uses a distributed OpenCL model to

connect multiple FPGAs in a single environment.

6. Created an FPGA Cluster Generation tool that creates FPGA network clusters, by

connecting network FPGAs. This contribution can be divided into the following

sub-contributions:

(a) Created a script to translate a logical description of a circuit with no notion

of FPGA mappings, and an FPGA mapping �le into physically partitioned

FPGAs. This contains extra logic to handle the networking between FPGAs.


(b) Created a script to assign unique network MAC addresses to FPGAs in the

datacenter.

1.4 Overview

The rest of this thesis is as follows:

� Chapter 2 will introduce background into FPGAs, data centers, cloud-computing,

and FPGA virtualization.

� Chapter 3 will describe the backend data center infrastructure used and the FPGA

infrastructure.

� Chapter 4 will describe an FPGA cluster generation tool that uses a distributed

OpenCL environment, and its limitations.

� Chapter 5 will introduce the infrastructure made of our �nal design, from top-level

software scripts, to low-level FPGA modules.

� Chapter 6 evaluates the infrastructure with microbenchmarks and a large case

study.

� Chapter 7 provides future work and concludes the thesis.

Chapter 2

Background

This chapter introduces some background information on Field Programmable Gate Ar-

rays, their use in data centers, cloud-computing and the back-end data center environ-

ment that is used in the work of this thesis.

2.1 Field-Programmable Gate Arrays

This thesis revolves around provisioning Field-Programmable Gate Array (FPGA) clus-

ters for a user from a resource pool managed by a cloud resource manager. FPGAs

provide a �ne-grained latency sensitive computing alternative to the standard CPU en-

vironment.

An FPGA is a silicon chip with a programmable switching fabric that allows the

formation of customized circuits [6]. In contrast to the standard CPU environment where

the circuitry stays constant and the circuitry performs actions based on instructions, an

FPGA changes its circuitry depending on the application.

This is implemented with the use of Look-up Tables (LUTs), which can implement

various logic functions (such as Boolean AND, OR, NOT operations). Furthermore these

LUTs are combined with memory elements ( ip- ops) that are grouped into logic-blocks

for more complex applications that require memory. On top of logical functions there are

5

Chapter 2. Background 6

hardwired heterogeneous DSP blocks, memory blocks, external components (Ethernet,

JTAG, USB) that can also be incorporated into the user implemented circuitry. FPGA

CAD tools �rst place physical hardware into logical hardware blocks, and then these

logical hardware blocks onto the physical circuit available [7].

FPGAs are conventionally programmed with a low-level hardware description lan-

guage that describes low-level physical circuitry, such as Verilog or VHDL [8, 9]. The

low-level design work is di�cult and has a niche market making it di�cult for new users

to adopt. To mitigate these costs there have been pushes in high-level synthesis (HLS)

that translates high-level languages such as C, and C++ into physical circuit descrip-

tions. Examples of these HLS compilers include the Vivado HLS tools (C, C++ to HDL)

[10], LegUp (C to HDL)[11], Xilinx SDAccel (OpenCL to HDL) [12], and Altera OpenCL

SDK (OpenCL to HDL)[13]. Furthermore, the OpenCL environments include platform

architectures where many of the interfaces are abstracted away for the FPGA developer

such as the PCIe interface, the Ethernet interface and the o�-chip DRAM.

2.2 Cloud Computing and Data Centers

Data centers are large clusters of many compute devices [14], which can scale to the

order of thousands of servers. Traditionally these have been large CPU farms, used for

a multitude of applications that require large amounts of data and computation. These

data centers allow the for the provision of large-scale services that process a large amount

of data such as social media services, email, search engines [15], etc. The large scale

of storage, compute and networking resources allow companies to service a signi�cant

number of users but at the same time many challenges arise. Computation problems that

were once simple, scale exponentially in complexity at the data center scale [2]. This is

mainly due to communication complexity across multiple nodes, these complexities range

from reliability of nodes, reliability of messages, and consistency of data across multiple


nodes. On top of computation and communication complexities, a big expense in the

data center is the energy requirements to run the servers on that large a scale [16]. For

example The Lakeside Technology Center in Chicago requires 180 MW of power, which

is the second largest power customer to Commonwealth Edison (second only to Chicago's

O'Hare Airport) [17].

Data centers require a large capital investment, which is not a problem for companies

such as Microsoft, Google, or Facebook. However smaller companies that would like to

use compute resources on a large scale may not be able to a�ord and maintain their own

data centers. Cloud computing provides these resources as a service to third parties [18].

The bene�t is the sharing of infrastructure such as storage, computing and networking.

NIST de�nes cloud cloud computing by the following characteristics:

1. On Demand Self Service: provision resources at any time.

2. Broad Network Access: all devices can communicate to any other device on the

network.

3. Rapid Elasticity: Cluster sizes of devices can be changed easily.

4. Resource Pooling: Resources organized into pools for multiple clients.

5. Measured Service: Metrics and tools in place to measure usage.

2.3 Network Stack

The communication through networks is done through layered partitions, where each

layer provides a service to the layer above [19]. Figure 2.1 shows the layers within the

network stack. The Transport Layer provides full end-to-end transmission between a

host and destination on the Internet. This layer is not concerned with the path a packet

may take on the network, only the start and end points. The Network Layer similar to

the Transport Layer also is only concerned with the host and destination of a network


path, speci�ed by an IP address. The Data Link Layer is concerned with the local hops

a packet must take within the network, where each hop is speci�ed by a MAC address.

The physical layer is responsible for the physical transmission (e.g optic �bre, Ethernet

cable) of the information between links.

Figure 2.1: This illustrates the network stack from the transport layer and below.

A network can consist of many switches and hosts. Typically the translation between

an IP address (the end-to-end path description) to the MAC address (where to go on

the next hop) is done on intermediate network switches[19]. An example multi-switch

network is shown in Figure 2.2.


Figure 2.2: This shows an example of a small network connected by switches (S) and hosts(H). When two hosts wish to communicate they would specify each other's IP address.The switch receiving a packet will decide the next hop by matching the IP address ofthe destination to the next hop address speci�ed by a MAC address (determines whichswitch to go to next).

2.4 Software-De�ned Networking

Software-De�ned Networking (SDN) is a concept that enables programmatic control of

entire networks via an underlying software abstraction [20]. This is achieved by the

separation of the network control plane from the data plane as shown in Figure 2.3. SDN

opens the door for users to test custom network protocol and routing algorithms, and

furthermore, it allows the creation, deletion, and con�guration of network connections to

be dynamic. The current de facto standard protocol for enabling SDN is OpenFlow [21].

In OpenFlow, the control plane is managed by a user program running on a CPU that

leverages APIs exposed by an SDN Controller. The SDN controller, often referred to as

the \network operating system", abstracts away network details from the user programs.

The controller manages the data plane and creates con�gurations in the form of ows.


Figure 2.3: System diagram of an SDN, where user-de�ned control programs managenetwork switches.

The control plane is generally responsible for managing the data plane, and creates

con�gurations in the form of ows. These ows describe the overall behaviour of the

network, and can be used to specify custom paths through the network based on packet

headers, or even specify operations on the packets themselves (e.g. drop packets, modify

headers, etc.). While the switches in the data plane can handle simple header matching

and modi�cation of header �elds, more complicated features, such as pattern-matching

within the payload or modifying the payload data, require the packets to be forwarded

up to the control plane for processing in software. Per-packet software-based processing

often incurs signi�cant latencies and reduces line-rate. The switches in the data plane

can handle simple matching of ows, however if a packet does not match a ow, it is

either handled by a default ow or forwarded up to the control level for the routing to

be handled in software.


This creates an opportunity for FPGAs: FPGAs can combine the best of both worlds

with the recon�gurable nature of software programs in the control plane, and the low-

latency of the switches in the data plane. An example of a project using FPGAs in SDN

can be seen in [22]. This project was implemented with virtualized FPGAs in a data

center, where two virtualized FPGAs were inserted into the data path of a network ow.

Packets that normally would have been sent to the control plane for custom processing

were instead re-directed to the FPGAs for processing. Using this approach, the through-

put of the packets is the same as a direct path through a switch; whereas when then

packets were handled by software running in the control plane, only half the expected

throughput was observed.

2.5 Internet-of-Things

Internet-of-Things (IOT) introduces the idea that \things" not restricted to standard

computation tools can connect to the Internet [23]. These include sensors measuring

tra�c, heat, pollution, etc. These are used to create a smart environment allowing us

to gather information and make control decisions accordingly [24]. One example can be

the installation of sensors at tra�c lights to detect the presence of vehicles waiting at

the light. The connection of these devices also brings forth a large amount of data that

otherwise would not be available. This data can be used for analytics such as the analysis

of pollution levels within the city

2.6 Smart Applications on Virtualized Infrastructure

(SAVI) Testbed

The SAVI testbed is a Canada wide multi-tier heterogeneous testbed and it can be

seen in Figure 2.4 [25]. This testbed contains various heterogeneous resources such as


FPGAs, GPUs, Network Processors, IOT sensors and conventional standard CPUs. The

virtualization of these resources are still being researched (our work investigates the

FPGA platforms). Previous virtualization work on this testbed includes the work by

Byma et al. [26] which provides partial FPGA regions as OpenStack resources. Other

resources such as GPUs and network processors are given to the user either by providing

the entire machine without virtualization or with the use of PCIe-passthrough. PCIe-

passthrough is when the hypervisor allows a virtual machine to have complete access to

a PCIe device. Once a virtual machine acquires this device, no other virtual machine

can reserve that device.

Figure 2.4: System diagram of the SAVI multi-tier architecture that has a CORE withmany CPU Compute Servers and Edges physically dispersed around Canada. Each Edgeis made up of compute CPUs and other heterogeneous devices (e.g FPGAs, GPUs, IOTSensors)..

The multi-tier property refers to the network architecture of SAVI. SAVI can be


seen as multiple cloud networks. The core network consists of a large number of CPUs

that provide the back-bone of the data center. This core network is then connected to

several edges dispersed around Canada. Each of these edges is a miniature cloud network

that also contains the heterogeneous devices. Many of these heterogeneous devices are

connected directly to the network through high performance 10G switches. These devices

are treated the same way any CPU would be treated as many of these devices are assigned

network ports with valid MAC and IP addresses. These devices are addressable by any

other node (CPU or other device) on the network, once they are registered to the network.

This allows for example: an IOT sensor in Toronto, that can then send the data to an

FPGA cluster in Victoria and then have the data be accessible by a CPU cluster in

Calgary. Furthermore the multi-tier architecture allows a lot of the processing to be

done on the edge network close to the heterogeneous devices before being sent to the

large CORE where we have more compute resources.

2.6.1 OpenStack

OpenStack is the cloud managing tool used by SAVI [27]. It is divided into several

services. The two main OpenStack services that we employ in our platform are Nova and

Neutron, which are typically interfaced with a client machine. Nova is responsible for the

deployment of compute infrastructure from the platform. This involves the generation of

virtual machines on physical machines [28]. The client machine when requesting a virtual

machine speci�es two �elds: a software image, and the avor. The software image refers to

all the software that is to be installed on the virtual machine, this includes the operating

system and any other applications that we want to initialize our virtual machine with.

These images are typically kept in a repository and can be updated by users of the

testbed. The avor refers to the physical speci�cations of the virtual machine, such as

number of CPU cores, RAM, hard drive space.

Neutron is responsible for the provisioning of network resources[29]. We can create


network ports within our cluster, and these ports are assigned MAC addresses and IP

addresses that will be valid within the cluster. When creating virtual machines these

ports are created implicitly, but we can explicitly create additional ports for non-virtual

devices or non-CPU devices.

2.7 Related Work

In this section we describe previous work in virtualized FPGAs and other Cluster Man-

agement Tools in the cloud.

2.7.1 FPGA Virtualization

There has been previous academic work in providing FPGAs as virtualized resources

within the cloud manager tool OpenStack. The work presented by Byma et al. proposes

FPGA resources sitting directly on the network to be allocated as OpenStack resources

[26]. The hypervisor is programmed into hardware and communicates to the OpenStack

controller via the network. Furthermore the FPGA application region in this case is split

into four smaller regions allowing multiple users to share a single FPGA device. This

also requires modifying OpenStack to communicate to the hardware hypervisor in the

FPGA.

Another important work that is most similar to our work is the Leap Project [30].

The focus of the Leap is to provide an operating system for an FPGA. They abstract

away many details such as memory and I/O by providing analagous system calls to

interact with physical FPGA hardware. They provide multi-FPGA support through a

mapping �le that they use to describe a multi-FPGA cluster. This requires the user

to �rst physically connect multiple-FPGAs using any communication medium. Once

the user creates this cluster, they will then create a con�guration �le describing the

physical connection and their mediums in the cluster. This cluster is now seen as a


large computation device that is now ready to be programmed by the user through their

abstraction layer.

The work proposed by Chen et al. also virtualizes FPGAs in OpenStack but moves

away from FPGAs sitting directly on the network [31]. This proposes implementing the

hypervisor in software by modifying KVM, which is a popular Linux hypervisor [32].

Instead of an FPGA sitting directly on the network it is coupled together with a virtual

machine. Similar to the previous work this also requires the modifying OpenStack to

communicate to the software hypervisor.

Several industrial pursuits have started investigating provisioning FPGA resources

from a cloud. One example is the Maxeler MPC-X project [33]. This project provides

a virtualized FPGA resource to a user that can be implemented with a variable number

of FPGAs. The user �rst allocates resources for the given cluster of FPGAs in the

virtualized FPGA resource. Once the cluster has been made, the details are abstracted

from the user during application run-time.

IBM's SuperVessel looks at providing an FPGA as a cloud resource that shares mem-

ory (through CAPI) with a CPU also provisioned with OpenStack [34]. In this model a

single FPGA is provisioned to the user as an accelerator to which the user can upload

FPGA code to be run and compiled onto the FPGA. This simpli�es the process of pro-

visioning an FPGA and running code to be accelerated on the FPGA but works with a

single FPGA. The user in this model can also use pre-uploaded FPGA applications as

services that can be provided by companies or other users of the infrastructure.

Microsoft has also continued their work with data center FPGAs with the second

iteration of Catapult [35]. The model here looks at providing a backbone infrastruc-

ture for multiple FPGAs to be connected together through a high performance network

switch. CPUs are tightly coupled with FPGAs, and the FPGAs are connected on the

switch. FPGAs communicate amongst each other through a low overhead custom trans-

port layer.Microsoft's view of the multi-FPGA fabric looks at the problem at an FPGA


granularity, where the user divides their large circuit across multiple FPGAs and the

users circuits are aware of FPGA boundaries.

Lastly, Amazon AWS has recently announced that they are introducing Xilinx Ultra-

Scale+ VU9P FPGAs connected to VMs via a virtual JTAG connection to their cloud

resource pool [36], and dedicated PCI-e x16 connections. They provide two avors of

FPGAs with their CPUs, one being a single FPGA accelerator, with another being an

8-FPGA ring. The 8-FPGA avor is connected via a 400 GBps bidrectional low-latency

network.

2.7.2 Cloud Cluster Management Tools

Another aspect of this project is to provide orchestration of clusters within our cloud

environment. Heat is a component in OpenStack that can orchestrate clusters using an

orchestration template, which describes the virtual machines and networking within your

cluster [37]. This allows the creation of interesting network topologies within your own

cluster. Heat can be combined with user applications that can modify these clusters

using other metrics such as performance, resource utilization, CPU usage.

Other tools exist that combines orchestration and load balancing using the aforemen-

tioned metrics. The usual work ow for these tools are as follows. The tool �rst reserves

a set of resources from a larger pool of compute nodes for a certain application. The allo-

cated resources are then connected for the application and monitoring. The monitoring

is used for user statistics as well as fault tolerance within the cluster.

These tools are helpful for getting optimal, reliable performance on a cluster as well

as debugging a cluster. Debugging a cluster can be a daunting task as there are many

variables within the cluster. These tools monitor events to gauge the status of di�er-

ent processes within an application and present the problem to the user in an easy to

understand representation.

Most of these tools currently work for CPU clusters (e.g Apache Mesos, Slurm) and


GPU clusters (e.g NVidia Management Library) [38, 39, 40]. Our challenge is to expand

clustering capabilities to FPGAs by developing our own orchestration tool and then to

investigate monitoring and updating our clusters using FPGA metrics, which will di�er

from the current set of CPU and GPU metrics current tools use.

Comparison of our Cluster Generator to Other Tools

In our work, we devlop a cluster generation tool that takes as input a number of com-

putation kernels and their connections. It allows us to easily create large multi-FPGA

clusters. We can compare this to the Microsoft Catapult project. The �rst iteration of the

Catapult project has statically connected FPGAs in a �xed torus and lacks exibility [5].

Their second version of the project has a similar network connected model to our design,

where all FPGAs are connected to a network switch [35]. Some key di�erences between

our project and the Catapult project is the model in which we describe our problem. The

Catapult project breaks the problem into FPGA boundaries, and it requires the user to

think in terms of physical FPGAs. In our model the user is not concerned with FPGA

boundaries, and designs kernels independent of FPGAs. Our model also allows for easy

scalability where we can scale up our designs with the simple use of a pragma.

We can also compare our work to the Leap FPGA project. Similar to our Catapult

comparison we provide easy scaling that is not available in the Leap project. Further-

more our tools sit on top of a cloud managing tool which can create arbitrary FPGA

connections. In the Leap project, the user has to physically connect the FPGAs in a

user speci�ed topology. With respect to topology our work is more exible. However the

connection medium in Leap is exible where in our design we assume network connected

FPGA clusters.


2.8 Level of Abstraction

Our work looks at using OpenStack to provision FPGA network clusters. This is similar

to the other OpenStack works cited but this is on a larger scale as we are looking at

multi-FPGA clusters. The physical layout of our FPGAs are similar to that of Catapult

with the FPGAs as network-connected devices, but in our environment these clusters

are provisioned with OpenStack. Furthermore our backend data center is a large pool

of heterogeneous resources where not only FPGAs are network connected but they are

connected to CPUs, and IOT devices (receivers, sensors etc.). Our FPGA cluster is seen

as any other network device with a network MAC address and IP address, which can

be communicated to from any network connected device in the data center. Unlike the

Amazon EC2 F1 project our work provides FPGAs as part of the network backend where

they can communicate directly to any virtual machine or other network device directly

in the network. Lastly our work builds on top of this infrastructure by providing simple

cluster provisioning tools that communicate to OpenStack to generate the infrastructure.

This infrastructure request uses a logical description �le that describes the user kernels

and how they are connected. This logical description �le is FPGA independent and also

provides methods of scaling up nodes within a cluster, introducing schedulers or even

replicating an entire cluster. An FPGA mapping �le is also provided that maps each

kernel speci�ed in the logical �le to a particular FPGA.

Our level of abstraction is demonstrated in Figure 2.5. This work is not true virtu-

alization, instead it provides the infrastructure needed for true virtalization. Our work

easily creates FPGA infrastructure from a pool of cloud resources by using an FPGA

independent description of a circuit and an FPGA dependent mapping of the circuit.

This gets translated into a physically partitioned FPGA circuit with FPGA network in-

terconnections automatically generated from the cloud. However this still requires some

user speci�cation of where to place kernels, which means this is not true virtualization

as the physical speci�cations of the FPGA is not hidden from the user. We can build


virtualization on top of this by creating a virtual FPGA that can be made out of many

FPGAs in the cloud. True virtualization will be able to characterize the number of phys-

ical resources required given the user speci�cation and then our tools can be invoked to

create the physical cluster out of the resources available in the cloud. We explore this in

Section 7.1.

Figure 2.5: This illustrates the level of abstraction stack that we provide and where webelieve true virtualization should exist.

Chapter 3

Base Infrastructure: Cloud

Resources and FPGA Platform

This tool provides a high-level abstraction to acquire FPGA clusters from a virtualized

environment. We de�ne our infrastructure stack in Figure 3.1.

Figure 3.1: Our infrastructure stack. We provide API at each layer and abstract awaymost of this stack from the user. The user supplies the top-layer and we return a fullyconnected FPGA cluster.

In Chapters 4 and 5 we present the implementation of our FPGA cluster generation

tools. In this work we de�ne an FPGA cluster as an environment that has multiple

Fig's connected in a manner that makes inter-FPGA communication easier. The �rst

20

Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform21

model looks at the FPGA cluster as a CPU connected to multiple FPGA accelerators.

Coordination between accelerators is handled by the CPU and thus requires inter-FPGA

communication to happen through the CPU, which became the bottleneck of our design.

The second design model we looked at connected multiple FPGAs using Ethernet on the

FPGA and allowed FPGAs to directly communicate with one another, eliminating the

CPU bottle-neck in communication. The two design alternatives explored in Chapters 4

and 5 utilize virtual CPUs tightly coupled with FPGAs, as OpenStack is used to provision

FPGAs to the user. This chapter explores the modi�cations made to allow SAVI to

support the provisioning of a single FPGA which refers to the OpenStack Compute

Commands, OpenStack Network Command and Cloud Network Port Registration parts

of our infrastructure stack.

3.1 SAVI Infrastructure Modi�cations

The SAVI infrastructure as explained in Section 2 includes the physical servers, the

heterogeneous devices and the networking capabilities.

3.1.1 OpenStack Resource Manager

OpenStack is the virtualized resource manager that is used by the SAVI infrastructure.

This includes physical servers managed with hypervisors connected to high-performance

network switches that are also managed with software-de�ned networking tools.

When a user requests a virtual machine from SAVI, the request speci�es the phys-

ical speci�cations of the virtual machine ( avor) and the software image of the virtual

machine. Figure 3.2 shows what a virtual machine request looks like.

Each physical server has an agent. An agent is a program running on the server

that is responsible for communicating with OpenStack. The agent is sent requests to

make/remove virtual machines with certain speci�cations and software images, and re-


Figure 3.2: A standard OpenStack virtual machine request.

quests for access to physical heterogeneous devices available in the physical server. Past

work in FPGA virtualization has looked into creating custom agents to manage FPGA

virtual machines [26]. Our approach is di�erent, as we wish to keep these modi�cations

to a minimum.

The only changes we made to include avors supporting PCIe FPGA devices, and a

few con�gurations on the KVM server to support passthrough of a speci�c PCIe device.

The other approach could be to modify OpenStack to support our FPGA environment

but that would make adoption in other OpenStack environments more di�cult.

3.1.2 PCIe Passthrough and OpenStack Image

First, we provide the FPGA as part of a VM using PCIe passthrough, which is when

the VM is given full access to a PCIe device on the physical server. OpenStack noti�es

the software hypervisor on the physical server of the VM parameters using the avor

discussed in Section 2.6.1. These parameters also include information about any PCIe

devices required by the user. This involves con�guring the hypervisor to pass control of

the PCIe device to a speci�c VM by adding the PCIe vendor and device ID of the FPGA

to the OpenStack con�guration script on the physical server. The cloud management

system then provisions the VM including the requested PCIe device(s). Figure 3.3 shows


two example VMs with PCIe-connected FPGAs. Once a virtual machine is assigned a

PCIe device it is given full access to the device and cannot be shared by another virtual

machine.

Figure 3.3: This �gure illustrates an example of two virtual machines on a single server.One virtual machine with one PCIe FPGA and the other one has two PCIe FPGAs.

Secondly, we have created multiple OpenStack avors corresponding to the PCIe de-

vices. Each avor describes the con�gurations of the desired VM. These con�gurations

include the number and type (speci�ed by the device ID and vendor ID) of PCIe de-

vices. We made two avors, one lightweight avor and another for a full development

environment. The lightweight avor, which consists of only two CPU cores and 2 GB of

memory, is intended for the CPU on the VM to act as a mere controller for the FPGA.

The full development environment, which consists of four CPU cores and 8 GB of mem-


Physical Speci�cation ValueNumber of Cores 2

Disk Space 10 GBRAM 1 GB

Table 3.1: Physical speci�cations of vir-tual machine used within FPGA clus-ters

Physical Speci�cation ValueNumber of Cores 4

Disk Space 40 GBRAM 8 GB

Table 3.2: Physical speci�cations ofstandalone FPGA design station

ory, provides a complete environment to create and test FPGA designs as well as control

the FPGA. The speci�cations of these VMs are shown in Table 3.1 and Table 3.2.

Next we made a software image for our virtual machine that will host the FPGA.

This is the base software image. The cluster designs described later in this section add

more software support to the base software image. The base software image contains the

Xilinx SDAccel 2015.3 Tools and PCIe driver. The lightweight image contains a subset of

this tool-chain and is limited to only the PCIe driver and programmer. Virtual machines

using this image cannot generate bitstreams but can program FPGAs and communicate

to the FPGA via the a software driver. In later clusters we require at least one machine

to have the full tool-chain as this machine will be used to develop the bitstreams that

will then be distributed amongst the cluster.

3.1.3 Networking Backend

Physical compute servers, FPGAs and IOT devices are physically connected directly to

network switches. These network switches are managed by SAVI's network manager

Janus [41]. Devices attached to the network switch need to generate network ports

registered with Janus. The registering of these ports requires the port number and an

IP address along with MAC address. Once registered, Janus uses OpenFlow to route all

tra�c destined for a speci�c IP address and MAC address to the registered port. Janus

also ensures that all tra�c that has an invalid destination or source (not registered) is

dropped within the network. The registration of a port �rst requires the creation of a


virtual port in OpenStack. The OpenStack tool Neutron is used to create a new virtual

port that has a new MAC address and IP address. Once these have been registered,

all packets destined for our device must use the IP address and/or the MAC address in

their header �eld. Also, all packets that do not have a matching destination �eld to any

destination in the virtual network are then dropped. These requirements can be bypassed

using custom networking ows that can be programmed onto the switch. The type of

device is independent of the network port, which allows us to use the same mechanism

to assign IP and MAC addresses to FPGAs and IOT devices within our network.

3.2 Xilinx SDAccel Platform

All the design alternatives explored here use the Xilinx SDAccel Platform [12] (or a

modi�ed version of the platform). This platform provides the user a set of APIs to

program an FPGA, send an FPGA data, and read data processed by the FPGA. This

platform can be seen as an FPGA hypervisor as this is responsible for managing the

FPGA interface around the user application. This is explained in Section 3.2.2.

3.2.1 OpenCL

OpenCL is a heterogeneous programming platform that allows a user to communicate

to devices via a host application [42]. These devices include GPUs, CPUs, and most

recently, FPGAs. Interactions between the host and the devices are called OpenCL

Events. OpenCL Events can be pro�led and synchronized, even between devices, which

becomes even more challenging when these devices are on the network. Figure 3.4 shows

the heterogeneous environment OpenCL provides. The host is linked with OpenCL host

libraries that the host can use to interact with the devices. Furthermore, the code running

on the device, known as kernels, are coded in the OpenCL language. This is a language

very similar to C, however there are more parallel constructs within the language.


Figure 3.4: The heterogeneous environment provided by OpenCL

Each OpenCL vendor provides an interface to their device called an Installable Client

Driver (ICD). The ICD provides the interface between standard OpenCL API calls to

speci�c device driver implementations of the OpenCL API. A multi-platform OpenCL

application loads the vendor devices by traversing through a list of �les that specify the

vendor speci�c ICD implementations. The ICDs are loaded and then subsequent OpenCL

host API calls are redirected to the ICD for the speci�c device. More information on the

OpenCL speci�cations can be found in [42].

3.2.2 FPGA Hypervisor

In our design we use the Xilinx SDAccel [12] platform as an FPGA hypervisor, where

the hypervisor is used to provide some basic services. The FPGA in this model is a

PCIe-connected device and the platform �rst provides a driver to communicate to the

FPGA. This is done through OpenCL, which provides the API to communicate to and

manage devices.

OpenCL is both a programming language for heterogeneous devices and a program-

ming API for a host application (conventionally run on a CPU) to manage and commu-

nicate to OpenCL compatible devices [42]. This environment gathers all the OpenCL

devices connected to the processor usually locally via PCIe. In the SDAccel Platform,

as shown in Figure 3.5, the OpenCL API communicates to a driver provided by Xilinx

called the Hardware Abstraction Layer (HAL) that provides driver calls to send/receive

data from the FPGA and program the Application Region, in the FPGA. The Appli-


cation Region is programmed using partial recon�guration, and the region around the

Application Region is the Hypervisor in our model. In this platform the kernels within

the Application Region can be OpenCL kernels, Vivado HLS kernels, or even hand-coded

Verilog/VHDL kernels. The PCIe Module is a master to a DMA engine to read/write to

o�-chip DRAM. This is used to communicate data to the Application Region. The PCIe

Module is also a master to an ICAP module (not shown) responsible for programming

the Partial Recon�g region with a bitstream sent from the user in software. The HAL

driver provides an API that abstracts away the addresses required to control the various

slaves of the PCIe master.


Figure 3.5: System diagram of the SDAccel platform

3.3 Design Flow for FPGADevelopment in the Cloud

In Chapter 4 we describe the design ow for the development of large scalable FPGA

clusters. The infrastructure described in this chapter however does present us with a

new design ow for FPGA development on a small scale. We deployed our FPGA cloud

service in May 2015. Since then it has been used by students within the University of

Toronto as part of their own FPGA development environment. Our infrastructure lays

the groundwork for a new design ow that helps utilize and share the FPGAs e�ectively.

This is done through the use of software simulation of FPGAs. The software tools

provided within the SDAccel environment allow for simulating the Application Region

completely in software, with no change to the user software application that is calling

the application. The simulated Application Region is wrapped to provide the exact same


interface for the Hardware Abstraction Layer as is done in the actual hardware. In

this way the same HAL can be used during software simulation to transfer data to and

from the simulated Application Region. This is supported by the standard SDAccel tool

provided by Xilinx.

Our environment gives the user exibility to provision a VM containing the FPGA

development tools with and without a physical FPGA. This creates a new design ow as

follows:

1. The user develops their application on a VM without an FPGA. The user requests

a VM with a avor that does not have the FPGA and the software image containing

the FPGA software tools. The user tests their design using the software-simulated

FPGA.

2. Once the user is ready to migrate their work to a physical FPGA, they save a

snapshot of their VM. This is done through an OpenStack API to save the state of

a VM.

3. The snapshot is then uploaded to the OpenStack software image repository. The

user then requests a new VM with a avor that has the FPGA and the software

image snapshot saved in Step 2.

4. Now the user can test their application on a physical FPGA. After testing, they

can migrate their application back to a VM without an FPGA. They once again

will save a snapshot of their VM but this time migrate to a machine without an

FPGA.

This design ow allows for easy sharing of the FPGA. Cloud managers can track

usage of the physical FPGAs by using monitoring functions provided by OpenStack.

This also has further implications towards re-usability of FPGA applications as func-

tions. Similar to software applications , we can create FPGA applications as virtualized


resources, upload the application to OpenStack and have them available as a software

image to be readily available to everyone.

3.3.1 Extended Design Flow for Multi-FPGA Applications

Multiple FPGA applications can also be deployed with the infrastructure described in

this chapter. Communication between FPGAs can be done through either the virtual

machine or directly through the network. The integration of the software simulation of

FPGAs along with actual physical implementations on an FPGA allow for an incremental

design ow. The design ow of multi-FPGA applications in our environment is as follows:

1. Implement all parts of the multi-FPGA application design as a chain of software

simulated FPGAs. (use an OpenStack image with the software tools and avor

that does not have the FPGA).

2. Implement and test each individual network function as an FPGA-o�oaded design.

3. Incrementally, as we complete each part of the multi-FPGA application, swap the

software-based function with the FPGA-based implementation.

4. If the multi-FPGA application remains functionally correct, then repeat Steps 2

and 3 for the next part of the application. Repeat until the whole application is

implemented using FPGAs.

Chapter 4

Design Alternatives

This Chapter introduces our �rst iteration of a cluster generation tool. We use our

own Cluster Generation tool to create an MPI software cluster. On top of the software

cluster we use an OpenCL Network Environment tool called SnuCL [43] to create an

OpenCL platform out of FPGA connected virtual machines (SnuCL was modi�ed for

FPGA support). We will �rst introduce SnuCL, then highlight the additions we have

to add to our OpenStack environment, describe our cluster generation tool and lastly

describe and analyze the results observed.

4.1 SnuCL

SnuCL provides a single OpenCL environment for a host device communicating to a

cluster of network-connected CPUs and GPUs. The communication between the host and

devices within the network cluster is implemented using the Message Passing Interface

(MPI). In MPI, a single application is split into a set of processes that can then be run

on top of a cluster of network devices. The processes communicate with each other using

messages as there is no shared memory between the processes. The physical location

of these processes are independent of the MPI implementation of the application. The

underlying physical infrastructure is speci�ed to MPI run-time (and not when the MPI

31

Chapter 4. Design Alternatives 32

application is being compiled or implemented).

In SnuCL the host and each of the devices are executed in separate processes. Tra-

ditionally in OpenCL, the host will use the device speci�c ICD on the same machine to

implement OpenCL functions; however in SnuCL the host process �rst sends the message

to the particular device process. The device process is responsible for using the ICD to

relay information back and forth to the device, and then to relay the information back to

the host process. The underlying communication between the host and device processes

as well as the network architecture are handled by MPI and hidden from the user. Thus,

this can provide a shared view of the OpenCL environment to a user, abstracting away

the locations of the devices. Figure 4.1 highlights how SnuCL works in a cluster and

then gets logically transformed into Figure 3.4. More details on the implementation of

SnuCL can be found in [43].

Figure 4.1: Simpli�ed SnuCL Cluster, this gets logically translated into Figure 3.4

4.2 Modi�cations for SnuCL OpenStack Support

To support this in OpenStack we had to make our own OpenStack avor and disk image.

A avor can specify the type of PCIe device as well as the number of PCIe devices of

that speci�c type (e.g avor for 1 FPGA, 2 FPGAs, 1 GPU, 2 GPUs etc.). In addition to


PCIe devices, a avor also de�nes other machine speci�cations such as memory and hard-

disk space. Once these avors are created, they can be used to create multiple virtual

machines described by the speci�cations of the avor. In our current implementation,

our avor grants virtual machines 40 GB of hard disk space and 8 GB of RAM; however

we can shrink this requirement as most of the processing is done on the FPGA.

SnuCL was modi�ed to work with the Xilinx SDAccel environment. SnuCL was

previously tested on CPU and GPU clusters and required slight modi�cations to work

with FPGA devices. Our virtual machine disk image is implemented using the CentOS 6.6

image, as this supports the SDAccel driver. The following software is installed onto the

CentOS image:

1. Xilinx SDAccel 2015.1. This version has ICD support, which is needed for SnuCL.

2. OpenMPI 1.6.4 which is needed for SnuCL.

3. SnuCL modi�ed for FPGA support.

Once the software tools are installed onto the CentOS image, a snapshot of the

virtual machine is taken. A snapshot refers to the creation of a new virtual machine disk

image that includes everything installed on a running virtual machine at the moment

the snapshot was taken. This snapshot can now be used to build new virtual machines

that come pre-packaged with our custom software tools. With the appropriate avor

and virtual machine disk image, we can make fully functioning virtual machines that are

ready to launch distributed FPGA OpenCL applications.

4.3 Cluster Orchestration

This section goes over the automation of clusters within our environment. Our cluster

orchestration takes a Cluster Generation File (CGF ) as input, acquires the requested

resources and forms a cluster. OpenStack is used to make virtual machines with the


avor that includes the FPGA resources, and using the virtual machine disk image that

has the necessary software tools.

There are several avors that correspond with the FPGA device, with the avors

di�ering by the number of FPGA devices. The avor with the largest number of available

FPGAs less or equal to the number of requested FPGAs is used. This is repeated until

the total number of FPGAs requested is reserved, to ensure the highest degree of locality

possible. Our orchestration system uses SnuCL to allow FPGAs in di�erent physical

machines (thus di�erent virtual machines) to be able to combined within one OpenCL

environment. However if we wanted a small number of FPGAs that are available in one

physical machine we can use standard OpenCL by reserving the avor associated with

the number of FPGAs.

In our cloud SnuCL environment after we reserve virtual machines for our devices,

another virtual machine is created to represent the host machine, this is of a di�erent

avor than the rest of the cluster as this machine does not require a PCIe device.

(a) The client requests a cluster with a cluster generation �le (CGF), that is translatedto the appropriate OpenStack commands.

(b) The cluster generated with OpenStack is prepared by connecting the appropriatenodes and preparing the nodes �le required for MPI.

Figure 4.2: Demonstrates the two steps when orchestrating a cluster. First the cluster isreserved using OpenStack and second the cluster is prepared and connected for SnuCL

Once the cluster is formed, then the nodes are connected so that SnuCL works between

them. This involves modifying the �rewall between these nodes and ensuring that there

is ssh access between the host node and the cluster nodes (OpenMPI spawns processes


on other nodes by executing them through ssh). A nodes �le is then generated specifying

the IP addresses of the other nodes in the cluster and moved to the host node. The

SnuCL cluster application can now run on the host node by specifying the nodes �le

to MPI. The generation, connection of the nodes and preparation of the nodes �le is

all done automatically with the cluster generation tool. The user, after requesting the

cluster from the tool, would then just have to login to the host virtual machine and run

their cluster application. Figure 4.2 shows the formation of clusters using the cluster

generation tool.

4.4 Results

We implemented simple video processing kernels in OpenCL and ran the kernels with and

without SnuCL, both versions on the FPGA. The video kernels perform object tracking

and recognition on the FPGA.

We averaged the execution time per frame over the execution of 25 frames in our

environment. In the SnuCL environment the execution time per frame is averaged over

20 executions of 25 frames at a time. The average execution of the kernel in the SnuCL

library is 224 ms while it is 2.23 ms directly on the FPGA virtual machine. This is a

100-fold slowdown when going to SnuCL.

This experiment highlights that there is a lot of future work required in the com-

munication protocol in our SnuCL environment. SnuCL was available and easy to use

but the overhead introduced by this system leaves room for improvement. SnuCL im-

plements their communication through MPI, which is readily available, however a more

light-weight protocol can be investigated to replace or enhance SnuCL. On top of a com-

munication protocol, direct communication to the FPGA between compute kernels can

also prove to be bene�cial. In Chapters 5 and 6, we explore our �nal design alternative

which uses direct FPGA communication without MPI software overhead.

Chapter 5

FPGA Network Cluster

Infrastructure

This chapter addresses our second design alternative. This design alternative provides

a cluster of network connected FPGAs to the user given a description of what a cluster

of kernels will look like. The work in this chapter is based on the paper [44]. Authors

Thomas Lin has helped with the networking back-end required, and Eric Fukuda has

helped with the application case study.

This alternative builds from our �rst design alternative by allowing users to work at

a high level with the cloud client. The user provides a description of their desired FPGA

cluster. This description is on a logical level and describes how di�erent FPGA kernels

are to be connected together. Along with the logical description the user provides an

FPGA mapping. This FPGA mapping speci�es the number of FPGAs the user requires

and places the kernels on the appropriate FPGAs. Kernel connections across FPGAs are

implemented via Ethernet. Furthermore kernels may also fan out to schedulers instead

of making direct kernel connections. The intricacies of the network connections and

schedulers are discussed later in Section 5.5.

In this work we de�ne a logical cluster description as a cluster description without a

36

Chapter 5. FPGA Network Cluster Infrastructure 37

notion of an FPGA mapping, and a physical cluster description is after the logical cluster

is partitioned and placed onto the appropriate physical FPGAs.

5.1 Logical View of Kernels

The kernels in this system are streaming kernels and they use the AXI stream protocol

for input and output. The AXI stream interface our system uses has the following �elds

(a subset of all the �elds o�ered by the protocol):

� 32 bit data �eld. Stores the data of each transfer.

� 32 bit dest �eld. Stores the destination of each transfer. The destination corre-

sponds to an address of each kernel on the FPGA.

� 1 bit last �eld. For a packet with multiple transfers this is asserted on the last

transfer of the packet.

� 1 bit ready �eld. This is asserted downstream to notify the stream that it is ready

for input.

� 1 bit valid �eld. This is asserted on a valid transfer.

These bit �elds correspond to a single it of a transfer, an AXI stream packet can

correspond to multiple its, where the concluding it will have the last �eld asserted.

This is the protocol that we use within our module. However when we transfer packets

over the Ethernet we do not have a dest �eld as the Ethernet module does not use a

dest �eld. We append the 32 bit dest �eld as part of the header of our Ethernet packet.

For simplicity we currently use a 32 bit dest �eld because this easily aligns to a 32 bit

word boundary. This is the case because packets read from the Ethernet module are

read 32 bits at a time. This creates a signi�cant overhead when there are packets being

distributed within an FPGA as large multi- it packets will transmit 32 bit dest �elds for


each it transfer. To save on wires on the FPGA we can look to shrink the destination

overhead.

All kernel inputs to the system are addressed by a speci�c dest entry. Logically

speaking, unless otherwise stated, any kernel output can connect to any input. This can

be seen as all kernels being connected to a large logical switch. These kernels may be

mapped to the same FPGA or to di�erent FPGAs. Furthermore these kernels can be

replicated with directives in the input scripts and they can be scheduled in di�erent ways

with the use of schedulers.

Figure 5.1: This highlights the simple logical view a kernel cluster. In this situation allthe kernels output to a switch and their input is addressed with the switch.

Figure 5.2 shows an XML �le of the logical �le the user speci�es corresponding to the

the logical cluster in Figure 5.1. Each kernel is assigned an address that corresponds to

the address of the input port of each kernel. There is also a replication �eld that speci�es

the number of times we wish to replicate this kernel.

5.1.1 Sub-Clusters

In Figure 5.1 we show three kernels connected via one logical switch. All kernels are

connected to each other in a fully connected network. Edges can be removed if we


Figure 5.2: Example logical cluster XML �le.

directly connect kernels. Figure 5.3 shows four kernels with direct connections between

some of the kernels. Such sub-clusters are then connected to the logical switch.

We can also have our own schedulers where the output of a kernel might not be


Figure 5.3: This is an example of a directly connected sub-cluster that would be connectedto the logical switch

connected to all the other kernel inputs but to a subset of kernel inputs arbitrated by a

scheduler. This type of sub-cluster is shown in Figure 5.4 and explained in more detail

in Section 5.5. Figure 5.5 shows how multiple sub-clusters can be connected to the same

logical switch.

Figure 5.4: This is an example of a sub-cluster where a kernel fans out to a local schedulerthat arbitrates between 3 kernels within the sub-cluster

5.2 Physical Mapping of the Kernels

Each kernel in the logical topology is mapped to a physical FPGA. More than one kernel

can be mapped to an FPGA. Direct kernel connections on the same FPGA are simply

connected within the FPGA. Kernels with connections that cross an FPGA boundary

are wrapped with logic to help with the crossing. Figure 5.6 shows a sample mapping

�le our infrastruture will take as input.


Figure 5.5: This highlights the how the subclusters would �t with the logical switch.

Figure 5.6: This illustrates the network stack from the transport layer and below.

When connections on the large logical switch are divided across multiple FPGAs,

the logical switch is implemented as physical switches on each of the FPGAs. Figure 5.1

shows three kernels fully connected with a logical switch. Now let's consider the following

scenario: Kernels A and B are on FPGA 1 and Kernel C is on FPGA 2. The physical

mapping is shown in Figure 5.7.

Figure 5.7 shows the logical switch split into two physical switches. The inputs to

the respective kernels on the two FPGAs always come from the physical switch on the

FPGA. The �rst FPGA sends all packets addressed to Kernel C to the switch in the

second FPGA. Furthermore the second FPGA's switch sends all packets dedicated for

Kernels A and B to the �rst FPGA. The output of each of the kernels feed into the physical


Figure 5.7: This �gure translates the logical cluster speci�ed in 5.1 into a physical clusterwith two FPGAs.

switch on that FPGA. The physical switch can determine the destination FPGA of each

packet.

For edges between kernels that are not connected to the large logical switch (sub-

clusters), the direct connections must also be facilitated between FPGAs.

5.3 FPGA Infrastructure

To facilitate the connection of FPGAs in the network we need speci�c hardware modules

and software modules. The hardware that we use is the SDAccel framework that was

modi�ed to include Ethernet capabilities. Furthermore this does not support the high-

level OpenCL calls, instead we directly use the HAL to communicate to the FPGA.

Figure 5.8 shows the experimental version of SDAccel shell from Xilinx with Ethernet

capabilities before we make modi�cations to support our infrastructure.


Figure 5.8: This �gure shows the experimental version of the SDAccel platform beforemodi�cations were made to support our infrastructure.

This experimental version of the shell does not include an application region. Instead

it has a Microblaze soft processor that is con�gured with a program to send packets

through the Ethernet. The Microblaze processor has a program that reads from a certain

address in o�-chip memory. This address contains the packet the user wishes to send

over the Ethernet. The o�-chip memory is populated by the software application that

uses the HAL to send the packet via PCIe to the FPGA.


5.4 SDAccel Platform Modi�cations

Figure 5.9 shows the modi�ed Ethernet platform. The modi�cations to the base platform

are as follows:

� Add an application region. However unlike the default non-Ethernet version of

SDAccel (as seen in Figure 3.5) Application Region is not part of a partial recon-

�gurable region.

� Removed the processor from the critical path to send packets from the Ethernet.

This is necessary for the application region to process packets at line-rate as the

processor introduces too much overhead. Now the application region can directly

stream packets to and from the Ethernet.

� The processor was still kept in the shell for debugging purposes as well as the

con�guration of some hardware blocks

� The PCIe module can also drive signals in the application region which is also used

for the con�guration of hardware blocks from the software driver.

With the modi�cations our system-level multi-FPGA system includes many virtual

lightweight CPUs that are coupled with FPGAs. The CPUs are responsible for con�gur-

ing certain hardware modules within the application region required for the networking

of FPGAs. The network interfaces of the FPGA are physically connected to a network

switch. With the help of speci�c hardware modules and the networking backend SAVI

provides, we can connect the FPGAs in our own speci�c topologies as speci�ed by the

user. Figure 5.10 shows the multi-FPGA system.

The virtual machines with FPGAs are generated with the OpenStack avor of a

lightweight CPU and a single FPGA device. The software image is a stripped down

version of the Xilinx Vivado Tools that only has FPGA programming capabilities. The

FPGA Software driver waits to receive a bitstream over the network. Once a bitstream


Figure 5.9: This �gure shows the experimental version of the SDAccel platform aftermodi�cations were made to support our infrastructure.

is received the FPGA is programmed and the FPGA hardware modules are con�gured

with the appropriate network metadata. The machine without the FPGA is generated

with an OpenStack avor that has more CPU cores and memory, the software image

used has the complete Xilinx tools to make the bitstreams.

5.4.1 FPGA Application Region

The FPGA Application region includes helper modules for the User Kernel to interface

directly with the network through the Ethernet interface. The helper modules are re-

sponsible for �ltering packets, formatting packets and arbitrating for the network port.

The Application Region is shown in Figure 5.11.


Figure 5.10: This �gure shows how a multi-FPGA system is situated in our environment.

The con�guration bus is used to con�gure the input and the output modules. These

signals are driven by the PCIe Module on the FPGA, which receives signals from the

PCIe-connected virtual CPU.

5.4.2 Input Module

All the packets that the FPGA receives via the Ethernet are forwarded to the input

module. The packets that are received at the network port follow the Ethernet packet

convention with a 14-byte header. On top of this we add our own protocol by appending

two bytes (Kernel Address) to specify the destination kernel for the packet, as we may

have multiple kernels on the FPGA that are requesting input packets.

Figure 5.12 shows the protocol details used by our FPGA infrastructure. Each FPGA


Figure 5.11: This �gure shows the details of the application region. The input and outputmodules are both con�gured by the con�guration bus.

Figure 5.12: The Ethernet protocol plus our custom protocol to di�erentiate the kernel.

in our infrastructure is assigned a MAC address within the SAVI infrastructure. The

process by which we get the MAC address is discussed in Section 5.6. The destination

MAC address should match the MAC address assigned to the particular FPGA. The

source MAC address will be the source MAC address of the FPGA or of the virtual

machine within SAVI that is sending the FPGA data. The next two bytes, according to

the Ethernet frame protocol, are the ether-type that we hardcoded to 0x7400, and the

last �eld is the address of the kernel within the FPGA

The Input Module consists of an Input Bridge and an Input Demultiplexer. The

Input Bridge is con�gured after the FPGA is programmed with the bitstream and before

the application can run. The Input Bridge behaves as both a �rewall and converts a

packet from an Ethernet Packet into an AXI Stream packet. The Input Bridge's �rewall

is con�gured with the MAC address assigned to the FPGA. The Input Bridge also drops


the Ethernet header and adds a dest �eld as part of the AXI stream. The dest �eld

corresponds to the Kernel Address speci�ed within the header. This Input Demultiplexer

either outputs to kernels on this FPGA that are expecting Ethernet input, or it outputs

to kernels on a di�erent FPGA; in this case all packets matching the corresponding dest

�eld will be sent straight to the output module. The input to the switch comes from

both the Ethernet module and all other user kernels that can output to any other kernel

on the FPGA. An example of an Input Module is shown in Figure 5.13. For details refer

to Section 5.2.

Figure 5.13: The input module consisting of the Input Bridge (labeled IB) and the InputDemultiplexer (labeled ID). In this example the dest �elds 0x2, 0x3 feed into di�erentUser Kernels on this FPGA and 0x4 feeds into another FPGA by going through theOutput Module

5.4.3 Output Module

This module receives streams from the User Kernels and from the Input Demultiplexer.

The Output Module consists of Packet Formatters (PF) and an Output Switch. Each

stream (either from the User Kernels or from the Input Module) needs a Packet Formatter

before it can be sent out to the network. Each stream is formatted with the appropriate

MAC headers. The source MAC address is that of the FPGA. The destination MAC

address is of the destination FPGA or virtual machine. The ether-type is 0x7400 as it


was in the input stream and then we append the dest of the stream into the header of the

packet. All the Packet Formatters are fed into an output-switch that arbitrates using the

last �eld of the AXI stream. The output switch uses a round-robin scheduling algorithm.

The output module is shown in Figure 5.14. The input to the Packet Formatter is an

AXI stream with a dest �eld. The formatter uses the dest �eld as the kernel address

when it is outputting to the network.

Figure 5.14: The output module for two streams consisting of Packet Formatters (labeledPF) for each stream that needs to be output

5.5 Scaling up FPGA Clusters

Nodes within the cluster can be replicated as well without replicating the entire cluster.

Replicating a node within the cluster will require all nodes that fan-in to that speci�c

node to now include a Scheduler. The Schedulers currently support any-cast, which uses

a round-robin scheduler, or broadcast. Figure 5.15 shows how a node is replicated within

a cluster and where a Scheduler is inserted.

The Schedulers are also FPGA kernels. If the replicated kernels span across multiple

FPGAs the scheduler will be placed on the FPGA with the most replications of that

kernel to reduce latency for the more common case. For example, in Figure 5.15, if

two of three replications are on FPGA 1, and the other is on FPGA 2, then the script


Figure 5.15: This shows the replication of Node 2. The replicated nodes are Node 2 1,Node 2 2 and Node 2 3. Node 1 has a Scheduler that fans out to the replicated nodes.

will place the Scheduler on FPGA 1. The script will then create connections from the

Scheduler to the replicated nodes and one connection to the Output Module on FPGA 1.

The remaining replicated kernel will be connected to the Input Module on FPGA 2.

Figure 5.16 illustrates this scenario.

5.6 FPGA Software Drivers

Each virtual machine with an FPGA is responsible for sending control signals to the

FPGA. These control signals are to con�gure the Input Module and the Output Module

with the appropriate MAC addresses. We choose to use the software to con�gure the

Input and Output Modules because the alternative is to encode the MAC addresses

in hardware, which will require resynthesizing FPGA bitstreams for di�erent physical

FPGAs when replicating the cluster. Our approach allows us the option to generate our

cluster with one set of FPGAs and then replicate the clusters with the same bitstreams


Figure 5.16: The physical con�guration if Node 1, Node 2 1 and Node 2 2 are on FPGA 1and Node 2 3 is on FPGA 2.

to more FPGAs.

The software drivers can con�gure the Input Bridge and the Packet Formatters in

the hardware because the PCIe module in the hardware is a master (a driver of signals)

to these modules. This means that writing to a certain address on the PCIe module

can be used to send data to the Input Bridge or the Packet Formatter. We can write

to di�erent addresses of the PCIe module with the HAL driver that was provided in the

SDAccel tool kit. When a virtual machine with an FPGA is booted, the software driver

is accepting bitstreams. Once a bitstream is received it will be programmed with the

HAL and the Input Bridge and Packet Formatters will also be con�gured by the HAL.

Our justi�cation to provide the Packet Formatters as software con�gurable blocks is due

to scalability. If we wish to scale up our cluster with more network FPGAs, the MAC

address of each FPGA can be con�gured by software instead of synthesizing bitstreams

on a per FPGA level.

Each FPGA obtains a network connection by �rst receiving a network port from

the OpenStack service, Neutron. Each network port consists of a MAC address, and

IP address. This port is then registered with the physical port on the network switch


that has the FPGA connection. Our scripts can determine the physical switch port of a

particular FPGA connection by observing which physical server hosts the virtual machine

containing the PCIe server. In our setup we have one FPGA per physical server. If this

were to change we would need a new mechanism to infer the physical network port of

a particular FPGA. Once the port returned by Neutron is registered with the physical

port, it is now accessible on the network from any other device in the SAVI data center,

including other virtual CPUs, IoT devices and FPGA clusters.

5.7 Tool Flow

We summarize the use of our system by describing the tool ow. First the user submits a

logical cluster description and FPGA mapping �le to a global FPGA parser. Eventually,

these could be generated by a higher-level framework or application. OpenStack calls

are generated to create virtual machines, which are light-weight CPU virtual machines

connected to an FPGA, and one virtual machine dedicated to synthesize bitstreams.

Subsequent OpenStack calls are generated to create network ports, each with valid MAC

and IP addresses. These ports are registered with the SAVI switch and now all packets

sent to these addresses will be forwarded to the right switch port. After all the Open-

Stack calls are generated, the individual FPGAs are synthesized on the large virtual

machine dedicated to synthesizing bitstreams. Once the bitstreams are synthesized they

are forwarded to the individual FPGAs to be programmed onto the FPGA. Once pro-

grammed, the Packet Formatters are con�gured by the FPGA software driver running

on the light-weight CPU attached to the FPGA via PCIe. After the user submits the

initial cluster description �les, the rest of the calls are automatically generated by our

infrastructure.


5.8 Limitations of the Infrastructure

The main limitation of this version of the SDAccel platform is the lack of a partial

recon�gurable region o�ered in the default platform. Due to this limitation each new

application region requires programming the entire FPGA. This recon�guration turns o�

the PCIe interface momentarily and on a physical machine, requires a hard-reboot to

be visible. In the context of an FPGA allocated to a virtual machine, the rebooting of

the entire physical server is not feasible as there may be other virtual machines on that

physical server. Future SDAccel Platforms will include the Ethernet in their standard

base and thus will have the application region within a partial recon�gurable region.

The workaround that we have is instead of programming the FPGA via the software

driver, we have a separate machine dedicated to managing FPGA bitstreams. This

machine is physically connected via JTAG to each of the machines. This is necessary

as the partial recon�g ow is not available through the HAL. The bitstream server is

responsible for programming the FPGA. Furthermore without rebooting the physical

machine the PCIe interface is not available to con�gure the Input Bridge and Packet

Formatters. This is also done via the JTAG/UART connection on the server. This is

a temporary workaround until we have partial recon�guration available for the SDAccel

application region in the platform.

Chapter 6

Evaluation

This Chapter explores our results. First we quantify the resource overhead, latency

and throughput of our FPGA infrastructure. Finally, we test a full application using

a database acceleration application. The designs are implemented on the Alpha Data

7V3 card, which has the following speci�cations: a Xilinx Virtex 7 XC7VX690TFFG-

1157 FPGA (433200 LUTs, 866400 Flip Flops, 1470 BRAM Tiles), two 8GB ECC-

SODIMM for memory speeds up to 1333MT/s and Dual SFP+ cages for high speed

optical communication including 10 Gigabit Ethernet.

Our network infrastructure connects the 10 GbE SFP ports using 10 GbE to 1 GbE

transceivers to a network switch. The switch can support 10 GbE links, but due to

the 1 GbE FPGA core that is in our FPGA Hypervisor we have to use a 1 GbE cable.

The goal of the evaluation is to demonstrate that our FPGA network modules add little

overhead with respect to throughput and very little latency overhead. The absolute

latency and throughput numbers are limited by the 1 GbE network connection but the

infrastructure we have built can be used on 10 GbE, or better, systems where we would

expect these numbers to be better. We also wish to highlight the scalability of our

infrastructure with a case study, demonstrating that by simply changing a directive in

the script, our clusters can replicate with the throughput scaling accordingly.

54

Chapter 6. Evaluation 55

6.1 Resource Overhead

The resource overhead from our infrastructure is shown in Table 6.1. Absolute numbers

are given with the percentage of the device total shown in brackets.

Table 6.1: Resource Overhead of our System

Hardware Setup LUTS Flip-Flops BRAM

SDAccel Base 53346 64550 228(12.3 %) (7.45 %) (15.5 %)

SDAccel Base with 62344 76124 228Ethernet Support (14.4 %) (8.79 %) (15.5 %)

Input ModuleInput Bridge 87 170 2

(0.02 %) (0.019 %) (1.36 %)Input Demultiplexer 82 124 0with 16 outputs (0.019 %) (0.014 %) (0 %)

Output ModuleEthernet FIFO 26 12 2Controller (0.006 %) (0.014 %) (1.36 %)Output Switch 517 138 0with 16 inputs (0.119 %) (0.016 %) (0 %)Packet Formatter 230 252 2(one per network (0.053 %) (0.029 %) (1.36 %)output stream)

Total Available 433200 866400 1470

The SDAccel Base refers to the standard SDAccel environment that has no network

connection for the FPGA. The SDAccel Base with Ethernet Support includes a 1 Gb

Ethernet port. We can see that the addition of the Ethernet port requires only 2.1% of

the resources of the whole device. The Input Module is divided into a Firewall and the

Input Switch. The size of the �rewall is independent of the number of network input

streams. The size of the input switch is dependent on the number of streams. Table 6.1

shows the overhead corresponding to a 16-port switch. The Output Module is divided

into the Ethernet FIFO Controller, the Output Switch and the Packet Formatter. The

Ethernet FIFO Controller overhead is independent of the number of output streams. The

Output Switch size, analogous to the Input Switch size is dependent on the number of


output streams. The number of Packet Formatters we have on our FPGA is dependent

on the number of output streams. It can be seen that the resource usage of the Firewall,

Input and Output Modules and Packet Formatter is small relative to the device and not

signi�cant in terms of resources.

6.1.1 Microbenchmarks

Our microbenchmarks consist of an application that is a direct connection between the

Input Module and the Output Module of an Application Region. The goal of this is to

show the overhead of our Input and Output Modules and to show that they can handle

packets at line-rate as all of the modules are of single-cycle latency.

6.1.2 Micro-experiment Setup

For Microbenchmark 0 the CPU is directly connected to the FPGA. The CPU sends

packets to the raw network interface and the FPGA echoes them back. The packets

traverse through the Input Module, the Application Region FIFO and exit through the

Output Module back into the CPU. The CPU for this data-point is not a virtual machine

and the speci�cations of it are as follows: Intel Xeon 3.5 GHz CPU E5-2637, four cores

with hyperthreading, 32 GB RAM.

Latency

The round-trip latencies are shown in Figure 6.2. There is no switch latency and no

virtualization overhead for Microbenchmark 0. However after that point we notice a

linear progression as we increase FPGAs. Each extra FPGA on the path requires two

trips to the switch. When we compare this to the second iteration of the Microsoft

Catapult project which also used network connected FPGAs we see that we are on the

order of 20 times worse than their network [35]. This is mainly due to our current

infrastructure limitation of 1 Gb/s Network module used.


(a) Microbenchmark 0 is a CPU directly con-

nected to an FPGA (not through network

switch).

(b)Microbenchmark 1 is a CPU connected with

a network switch to an FPGA Chain of length

1.

(c)Microbenchmark 2 is a CPU connected with a network switch to an FPGA

Chain of length 2.

(d) Microbenchmark 3 is a CPU connected with a network switch to an FPGA Chain of

length 3.

Figure 6.1: Microbenchmarks 1 to 3 have a network hop (NH). Each network hop travelsto the network switch connected to all the FPGAs. Microbenchmark 0 does not use avirtualized CPU, where as the others use virtual CPUs provisioned in SAVI.

An example path for a single FPGA is as follows:

1. Virtual CPU to switch

2. Switch to FPGA 1

3. FPGA 1 to switch

4. Switch to Virtual CPU


Figure 6.2: Round-trip latency observed across the microbenchmarks.

Throughput

Figure 6.3 shows the throughput for the di�erent microbenchmarks. The red line is the

bandwidth limit of the network cable. The throughputs of Microbenchmarks 0 to 3 are

measured with the iperf tool [45]. This is a network tool used to measure throughput

of network connections. A directly connected CPU to the FPGA (CPU + FPGA) sat-

urates the network link, thus showing that our FPGA infrastructure can keep up at

line-rate. Next we look at connecting a virtual machine to an FPGA chain of one, two

and three FPGAs (VM + n FPGA). We notice a drop in throughput due to the fact

that the virtual machine is a weaker CPU than the directly connected CPU and due

to some virtualization overhead. Since the FPGA is not the bottleneck we notice that

the throughput as we increase the length of the FPGA chain from one to two to three

remains the same. To further demonstate that the FPGA is not the bottleneck we look

at two additional datapoints. The �rst datapoint are two virtual machines connected in

the SAVI network (VM + VM). The throughput observed between two virtual machines

is half of a virtual machine connected to an FPGA chain. This is because the data en-

ters the software and network stack twice (on both machines). The second additional

datapoint is the calculated throughput within the FPGA (Internal FPGA B/W). The


internal FPGA bandwidth is at 4 Gb/second, much higher than the network link rate.

The internal FPGA throughput is calculated by using the bus width, which is 4-bytes

wide and multiplying that by the clock speed, which is 125 MHz. The network switch is

designed to switch at 4G rates and therefore is not the bottleneck of our system.

Both the Input and Output Modules work with single-cycle latency. The Input Mod-

ule needs a four-cycle warm-up period before it bursts the rest of the packet and the

Output Module requires a �ve-cycle warm-up period. These warm-up periods are ac-

commodated with additional FIFOs, which adds to the latency but does not a�ect the

throughput.

Figure 6.3: Throughput observed across the microbenchmarks.

6.1.3 Application Case-study

Our application case study is a database query accelerator. Several works, such as [46, 47]

have shown FPGAs are a good target for such applications as they can perform low-

latency, high-throughput applications. Furthermore, frameworks such as Apache Drill

have shown that distributed clusters are a good way to accelerate database services [48].

The combination of those observations suggest that a distributed FPGA cluster is ideal


for a database query accelerator.

The application we have built is a naive implementation of a query. The query is

broken down into several sub-queries. Even though it is a naive implementation, the

purpose of the infrastructure is to show that laying out the circuit is easy, and so is

replication of that circuit (changing one number in the logical cluster �le).

6.1.4 Query Implementation Details

The query is composed of �ve streaming components connected as a chain:

1. SQL Read: This component is responsible for reading SQL columns and outputting

the data in a format that enables the rest of the components to process the data.

2. SQL Where: This operation is used to match column predicates and values with

respect to a boolean operation (equal, greater than, less than, etc.)

3. SQL Like: This operation is used on a string column data and is used to match a

string using a substring.

4. SQL Group: This operation aggregates di�erent records using a grouping operation,

such as counting.

5. SQL Write: This component is responsible for separating the stream coming out

of SQL Group into columns.

Figure 6.4 shows how the streaming components are connected to form a single query

engine. Our infrastructure allows us to easily replicate the number of query processing

engines, even across multiple FPGAs. When considering the number of processing en-

gines, we �rst observe the resource usage of one replication of this processing engine.

This is shown in Table 6.2.

The Block RAM utilization limits our replication so we are limited to two query

processing engines per FPGA. In our logical FPGA cluster �le we would specify this


Figure 6.4: The sub-components chained together as one query processing engine.

Table 6.2: Resource Overhead of a single Query Processing Engine

Characteristic Total amount Percentage of FPGALUTS 11561 2.669 %

Flip Flops 17176 1.982 %Block RAM 504 34.286 %

as six replications (maximum two replications per FPGA, with three FPGAs) and in

our FPGA mapping we would divide the kernel nodes onto three FPGAs. We do the

replication with a scheduler. The scheduler is located on one FPGA and forwards the

data to either the replicated engines on the same FPGA or to another FPGA. This

would send all the data to one destination and then the scheduler would be responsible

for forwarding the data to the appropriate query processing engine. The �rst FPGA

has a scheduler connected to two replicated query processing engines. The second and

third FPGAs also have two replicated query processing engines connected directly to the

Input Module as opposed to a Scheduler. The Scheduler on the �rst FPGA is responsible

for scheduling work to all six replicated query processing engines across three FPGAs.

This makes it simpler for the user since they do not have to change their interface to the

cluster as they change the number of replications.

However the infrastructure also provides an easier approach by introducing a sched-


uler. This would send all the data to one destination and then the scheduler would be

responsible to forwarding the data to the appropriate query processing engine. This is

the model that we used in our experiment as the user application remained the same

between 1 replication all the way up to 6 replications across 3 FPGAs. This �rst FPGA

in this cluster is shown in Figure 6.6. The second and third FPGA looks like the FPGAs

in Figure 6.5.

Figure 6.5: One FPGA with two entire clusters replicated.


Figure 6.6: One FPGA with two entire clusters replicated and the scheduler.

6.1.5 Case Study Evaluation

Our evaluation compares the throughput of one replication versus six replications across

three FPGAs. As expected Figure 6.7 shows that the throughput increases as the repli-

cations increase and we expect it to continue to increase until it reaches the maximum

of the FPGA chains observed earlier at about 240 MB/s. This would be at about 12

replications, which would require six FPGAs. The throughput limit of 240 MB/s is due

to the speed of the CPU inputting table data into the FPGA chain. With a faster CPU

we could theoretically saturate the network cable throughput limit of 1 GB/s, which can

be increased with a faster network.


Figure 6.7: Throughput of a query processing engine

Chapter 7

Conclusion

The ability to provision FPGA clusters will become essential if systems like the Microsoft

Catapult project are to become more generally accessible. Our infrastructure provides

a lightweight cluster provisioning tool. This tool with a logical cluster description and

FPGA mapping can generate scalable clusters from a heterogeneous cloud. Moreover

these clusters are connected to the network as network devices ready to interact with

other network devices. Our infrastructure makes it easy to scale up as with a simple

pragma we saw throughput scale almost linearly from one to six replicated processing

units in our database acceleration case study. With this success, our approach is seen to

work, but there is much that can be done to improve this �rst step.

7.1 Future Work

This section describes the future work that we plan to explore in the future. This includes

short-term goals such as physical infrastructure upgrades, reliability protocol upgrades

and lastly the implementation of true virtualization.

65

Chapter 7. Conclusion 66

7.1.1 Physical Infrastructure Upgrades

The limitations of our experiments �rst come from physical infrastructure limitations. A

few infrastructure upgrades that we plan to address in the short-term are:

1. Upgrade 1G physical Ethernet links. This will involve upgrading the FPGA IP

in the SDAccel shell from a 1G core to the 10G core. This will in turn result in

better latency and throughput in our applications. Our additional infrastructure

introduced should be able to scale to these cores as all our cores have a single cycle

latency.

2. Add more physical FPGAs to the network. This will include more of the same

FPGA and other types of FPGAs. Our infrastructure should easily port to other

infrastructures as our input and output modules use simple AXI streams and should

be able to interface with any Ethernet module that uses an AXI stream interface.

7.1.2 Scalability and Reliability

This subsection explores how to enforce reliability on a network with more nodes and

nodes that are many network hops away. This includes reliability on the network and

reliability for the compute nodes.

Networking Scalability and Reliability

Our infrastructure builds on top of RAW Ethernet frames with an additional two bytes

to address speci�c FPGA kernels within an FPGA. This is a lightweight transmission

protocol, and it is suitable in our small environment where all the FPGAs are connected

to the same network switch. There is at most one network hop between FPGAs and

the CPU virtual machines that we communicate with are on the same network EDGE

within the SAVI infrastructure, which is also one network hop between VMs and FPGAs.

However at a larger scale this can result in reliability concerns, where we can expect errors


in packets such as corrupt packets, dropped packets, duplicated packets, or out of order

packets. Furthermore the user has to limit their packet sizes to less than 1536 bytes as

that is the physical limit for packets in the data-link layer (Raw Ethernet frame limit).

These issues can be alleviated by building on top of the network stack.

Figure 7.1: The proposed network stack that builds on top of the standard network stack.

Our current system implements most of the stack shown in Figure 7.1. The physical

layer is currently 1G Ethernet cables that we wish to upgrade to 10G cables once we

upgrade the FPGA core. The data-link and network layer are currently handled with a

combination of OpenStack and SAVI's network registration system. Our call to Open-

Stack (with OpenStack's networking API Neutron) gives us the IP and MAC address that

we then register to a physical port on the network switch in SAVI. This uses Software

De�ned Networking to update the routing tables within our network to route packets

addressed to the particular IP and MAC address returned by Neutron to the registered

physical port. Our custom layer, the kernel layer refers to the extra two bytes that are

used to address to a particular kernel within the FPGA. The layer that we are missing

is the transport layer, which is where we can implement network reliability as we scale

to larger networks. Figure 7.2 shows where the transport layer module would �t. In this

example we support three transport layer protocols but that can be modi�ed depending

on the application and the amount of resources we wish to utilize on the FPGA. The

ether-type within a raw Ethernet frame (currently hard-coded to 0x7400) would help

multiplex the packet to be handled by the appropriate transport layer.

TCP and UDP are two transport layers that are used by many networking applica-


Figure 7.2: The input module modi�ed to include a transport layer.


Flip Flops 35588 4.11 %Block RAM 392 26.7 %

Table 7.1: Resource Overhead of TCP Transport Layer onFPGA


Flip Flops 72 0.00831 %Block RAM 0 0 %

Table 7.2: Resource Overhead of UDP Transport Layer onFPGA

tions. Both of these transport layers will fragment large packets for the user as needed;

this will remove the small packet size restriction presented by using direct raw Ether-

net frames within our network. TCP provides reliable transmission as it handles the

retransmission of packets on packet drop, whereas UDP is connectionless and does not

retransmit, so it does not provide the reliable connection provided by TCP. We have

example TCP and UDP cores implemented on the FPGA and their overheads are shown

in Tables 7.1 and 7.2

Implementing with TCP and/or UDP will allow the FPGA clusters to directly inter-

face with certain distributed applications using the same transport layer. Such examples


are adding a node to a distributed �le-system which uses TCP (e.g Hadoop Distributed

File-System [49]), or using UDP for multi-media applications such as Voice over IP [50].

These are two alternative transport layers that are popular in distributed applications

but we are not limited to these. There has also been research in implementing custom

transport layers for data centers. This involves using assumptions of the data center

environment and using such assumptions to provide a light-weight (at least relative to

TCP) reliable transport layer. Some examples of works that do this are [51, 52, 35].

Scalability and Reliability of Compute Nodes

Upon scaling the cluster we should expect compute nodes to fail [15]. In the conven-

tional CPU domain this can be due to many reasons, such as CPU power failures, disk

failures, and memory failures. These failures can also be due to network link failures,

router failures, or network congestion. Some of these failures can be addressed with the

implementation of a reliable transport layer but not all. For example a reliability in the

transport layer will guarantee the delivery of the packet as long as the path to the node

exists, however if the network path is destroyed without an alternate path, then this will

cause an issue that has to be dealt with by the application. For our FPGA clusters we

can experience failure due to many reasons, such as bitstream corruption, memory failure

and sometimes the FPGA can be stuck in an unforeseen state.

To ensure reliability we need to monitor our FPGA clusters. This will require an agent

process to alert our provisioning system about the \health" of our FPGA. Our agent

process can run on the FPGA Hypervisor and can send heart-beats to our provisioning

system notifying the cloud system manager of its health. This is analogous to CPU

servers that are managed with OpenStack [27]. These heart-beats can help the cloud

managing software determine if an FPGA is ready to be provisioned. However, failures

can occur after being provisioned as well. When an unrecoverable failure happens to

an FPGA in the cluster, this can result in an application failing. To mitigate this in a


distributed CPU application, redundant compute devices can be used to replace failed

compute nodes [53]. We can introduce redundancy in an FPGA cluster as well by over-

provisioning FPGA devices in our FPGA cluster and run compute tasks in parallel on

redundant nodes. This can also be wasteful of resources as extra redundant resources are

provisioned by the cloud even if it is unnecessary. An example of redundant provisioning

is shown in Figure 7.3.

(a) The original clusters before failure. (b) Cluster after node 2 fails. The tra�c is redi-rected to the redundant instantiation. Note that theoutgoing tra�c is also sent to redundant FPGA3 tomaintain FPGA3's state

Figure 7.3: Demonstrates how tra�c is duplicated to a redundant cluster to maintainit's state, and how tra�c is redirected from a node in the original cluster to a node inthe redundant cluster upon failure.

An alternative to over-provisioning for FPGA clusters is to provision the extra com-

pute nodes after failure. First this requires the monitoring of the FPGA health. Fur-

thermore this requires the saving of context on the FPGA and migrating that context to

a new FPGA. This migration has a time cost with regards to provisioning and program-

ming the FPGA that is minimized by over-provisioning. This is yet to be investigated as

there is a trade-o� between over-provisioning and provisioning on demand.


7.1.3 FPGA Cluster Debugging

The standard FPGA design ow includes simulating individual circuits, integrating the

circuits into larger systems, simulating the larger systems with a testbench and then to

implement the design on the FPGA. Furthermore there could still be problems within the

circuit for which probes are inserted into the FPGA design that interface with speci�c

debugging tools to notify the user when a certain signal in the circuit becomes a certain

value, or a Boolean operation of these signals. This allows the user to debug these values

in real time.

The design ow in our environment should be similar as we do not modify the circuits

provided by the user. Assuming these circuits are fully simulated we can then integrate

them into our multi-FPGA cluster. However once it is in the cluster we do not have a

uni�ed debugging view as we would on a single FPGA. One alternative for now can be to

run the probing tool (Altera's Signaltap or Xilinx's Chipscope) [54, 55] from each FPGA

in the cluster. This can be cumbersome especially for really large clusters. Furthermore a

user will have to navigate through a lot of automatically generated hardware that initially

was abstracted away (inter-FPGA connections, schedulers, switches).

Another area of future work is to provide a uni�ed debugging interface for such

clusters. One possible implementation of this is to forward local probing information

running on individual FPGA environments to a centralized view of the FPGA. The user

can be given the option to view the cluster as a logical cluster or even a physical cluster.

The logical cluster will abstract away all the automatically generated hardware and the

FPGA mapping of the kernels. This possible implementation is detailed in Figure 7.4. In

this example the user has a global view of the logical cluster. The global debugger shares

information with the user using local FPGA debugging tools on the individual FPGAs.


Figure 7.4: An example multi-FPGA cluster that is attached to a debugger.

7.1.4 True FPGA Virtualization

Section 2.8 highlights our level of abstraction and the di�erences between true virtual-

ization and what we provide. Our level of abstraction does not hide the physical details

of the hardware provided as we require an FPGA mapping. Our �rst step to provide

true virtualization is to abstract away these mappings by automatically generating the

mapping. This allows the user to provision a cluster purely based on logical FPGA kernel

connections. This is a multi-FPGA placement problem. Several works have looked into

placing circuits across multiple FPGAs [56, 57]. These works explore FPGAs on the

same die or on the same board. They model the inter-FPGA I/O, which is then taken

into consideration by the FPGA placement and routing tools while it is trying place and

route a user circuit. We can take a similar approach but we would have to model the

network connection. This will involve modeling our FPGA hardware blocks (input and

output modules) that we append to allow for multiple FPGA connections, and network


switches in the data center. This for example will try to place FPGA kernels that are

tightly coupled together on the same FPGA, or FPGAs as close as possible within the

data center network. This level of service provisioning is quite analogous to the Software

as a Service model provided by cloud managers. Software as a service is when the user

requests a software application and the underlying physical hardware is provisioned and

managed by a cloud manager.

Another model of virtualization that can be built on top of this infrastructure is In-

frastructure as a Service. This is similar to the CPU provisioning provided by OpenStack.

In this model the size of the processor and its peripherals are speci�ed and that is then

mapped onto a physical resource. In our infrastructure, we can provide di�erent avors

of FPGA sizes that would abstract away the fact that these are actually multi-FPGA

clusters. Our goal here is to hide the physical implementation of the logical FPGA the

user requests. This can be the number of FPGAs that is actually used to create the

logical FPGA, as well as the type of FPGA. We can create heterogeneous clusters, where

we can stitch together di�erent kinds of FPGAs into large clusters, forming large logical

FPGAs. Once we provide the large FPGA (comprised of multiple FPGAs to the user)

we will have a similar problem as the Software as a Service model where we will have to

map the kernels onto the physical FPGAs beneath the virtual FPGA. Furthermore if the

kernels do not �t this will require context switching kernels and swapping them in and

out of the FPGAs.

Bibliography

[1] InformationWeek. Big Data, Analytics Market To Hit $203 Bil-

lion In 2020. https://www.informationweek.com/big-data/

big-data-analytics-market-to-hit-$203-billion-in-2020-/d/d-id/

1327092, 2016.

[2] ApCon. The Case for Scalability in Large Enterprise Data Centers.

https://www.apcon.com/sites/default/files/Resources%20for%20Download/

apcon_ebook_4_april_2014.pdf, 2014.

[3] Amazon Web Services Inc. Amazon Web Services (AWS). http://aws.amazon.com,

2014.

[4] Microsoft Inc. Microsoft Azure. https://azure.microsoft.com, 2015.

[5] A. Putnum and et al. A Recon�gurable Fabric for Accelerating Large-scale Datacen-

ter Services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International

Symposium on, pages 13{24. IEEE, 2014.

[6] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA architecture: Survey and

challenges. Foundations and Trends in Electronic Design Automation, 2(2):135{253,

2008.

[7] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang,

Kenneth Kent, and Jonathan Rose. VPR 5.0: FPGA CAD and architecture ex-

74

https://www.informationweek.com/big-data/big-data-analytics-market-to-hit-$203-billion-in-2020-/d/d-id/1327092



https://www.apcon.com/sites/default/files/Resources%20for%20Download/apcon_ebook_4_april_2014.pdf

https://www.apcon.com/sites/default/files/Resources%20for%20Download/apcon_ebook_4_april_2014.pdf

http://aws.amazon.com

https://azure.microsoft.com

Bibliography 75

ploration tools with single-driver routing, heterogeneity and process scaling. ACM

Transactions on Recon�gurable Technology and Systems (TRETS), 4(4):32, 2011.

[8] IEEE Standard for Verilog Hardware Description Language. Verilog Hardware De-

scription Language. IEEE Std 1364-2005, pages 1{560, 2006.

[9] IEEE Standard for VHDL Language Reference Manual. VHDL Language Reference

Manual. IEEE Std 1364-2005, pages c1{626, 2009.

[10] Xilinx Inc. Vivado High Level Synthesis. https://www.xilinx.com/products/

design-tools/vivado/integration/esl-design.html, 2016.

[11] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,

Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. LegUp: High-level

Synthesis for FPGA-based Processor/Accelerator Systems. In International Sympo-

sium on Field Programmable Gate Arrays, FPGA '11, pages 33{36, New York, NY,

USA, 2011. ACM.

[12] Xilinx Inc. SDAccel Development Environment. https://www.xilinx.com/

products/design-tools/software-zone/sdaccel.html, 2016.

[13] Intel Inc. Intel FPGA SDK. https://www.altera.com/products/

design-software/embedded-software-developers/opencl/overview.htmll,

2016.

[14] SAP Data Center. How a Data Center Works. http://www.sapdatacenter.com/

article/data_center_functionality/, 2016.

[15] Albert Greenberg, James Hamilton, David A Maltz, and Parveen Patel. The cost

of a cloud: research problems in data center networks. ACM SIGCOMM computer

communication review, 39(1):68{73, 2008.

https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html

https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html

https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html

https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html

https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.htmll

https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.htmll

http://www.sapdatacenter.com/article/data_center_functionality/

http://www.sapdatacenter.com/article/data_center_functionality/

Bibliography 76

[16] Tech Republic. How Power Works in a Data Center: What

you Need to know. http://www.techrepublic.com/article/

how-power-works-in-a-data-center-what-you-need-to-know/, 2014.

[17] Data Center Knowledge. World's Largest Data

Centers. http://www.datacenterknowledge.com/

special-report-the-worlds-largest-data-centers/

worlds-largest-data-center-350-e-cermak/, 2016.

[18] IBM Inc. What is Cloud Computing. https://www.ibm.com/cloud-computing/

learn-more/what-is-cloud-computing, 2016.

[19] Alberto Leon-Garcia and Indra Widjaja. Communication networks. McGraw-Hill,

Inc., 2003.

[20] Nick McKeown. Software-De�ned Networking. INFOCOM Keynote Talk, 2009.

[21] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,

Jennifer Rexford, Scott Shenker, and Jonathan Turner. OpenFlow: Enabling Inno-

vation in Campus Networks. ACM SIGCOMM Computer Communication Review,

38(2):69{74, 2008.

[22] Stuart Byma, Naif Tarafdar, Talia Xu, Hadi Bannazadeh, Alberto Leon-Garcia,

and Paul Chow. Expanding OpenFlow Capabilities with Virtualized Recon�gurable

Hardware. In FPGA '15 Proceedings of the 2015 ACM/SIGDA International Sym-

posium on Field-Programmable Gate Arrays, pages 94{97, 2015.

[23] Feng Xia, Laurence T Yang, Lizhe Wang, and Alexey Vinel. Internet of things.

International Journal of Communication Systems, 25(9):1101, 2012.

[24] Jesse M Shapiro. Smart cities: quality of life, productivity, and the growth e�ects

of human capital. The review of economics and statistics, 88(2):324{335, 2006.

http://www.techrepublic.com/article/how-power-works-in-a-data-center-what-you-need-to-know/

http://www.techrepublic.com/article/how-power-works-in-a-data-center-what-you-need-to-know/

http://www.datacenterknowledge.com/special-report-the-worlds-largest-data-centers/worlds-largest-data-center-350-e-cermak/



https://www.ibm.com/cloud-computing/learn-more/what-is-cloud-computing

https://www.ibm.com/cloud-computing/learn-more/what-is-cloud-computing

Bibliography 77

[25] Joon-Myung Kang et al. SAVI Testbed: Control and Management of Converged Vir-

tual ICT Resources. In IFIP/IEEE International Symposium on Integrated Network

Management, pages 664{667. IEEE, 2013.

[26] Stuart Byma et al. FPGAs in the Cloud: Booting Virtualized Hardware Accelerators

with OpenStack. In Field-Programmable Custom Computing Machines (FCCM).

IEEE, 2014.

[27] Omar Sefraoui et al. OpenStack: Toward an Open-Source Solution for Cloud Com-

puting. In International Journal of Computer Applications, 2012.

[28] OpenStack Inc. Welcome to Nova's developer documentation! http://docs.

openstack.org/developer/nova/, 2016.

[29] OpenStack Inc. OpenStack Networking (neutron). http://docs.

openstack.org/icehouse/install-guide/install/apt/content/

basics-networking-neutron.html, 2016.

[30] K Fleming, Hsin jung Yang, M Adler, and J. Emer. The LEAP FPGA operating

system. In Field Programmable Logic and Applications (FPL), pages 1{8, 2014.

[31] Fei Chen et al. Enabling FPGAS in the Cloud. In Computing Frontiers, 2014.

[32] KVM. Kernel Virtual Machine. http://www.linux-kvm.org, 2015.

[33] Maxeler Technologies. MPC-X Series. https://www.maxeler.com/products/

mpc-xseries, 2015.

[34] IBM Research. OpenPOWER Cloud: Accelerating Cloud Computing. https://

www.research.ibm.com/labs/china/supervessel.html, 2016.

[35] Adrian Caul�eld et al. A Cloud-Scale Acceleration Architecture. In Proceedings of

the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Octo-

ber 2016.

http://docs.openstack.org/developer/nova/

http://docs.openstack.org/developer/nova/

http://docs.openstack.org/icehouse/install-guide/install/apt/content/basics-networking-neutron.html



http://www.linux-kvm.org

https://www.maxeler.com/products/mpc-xseries

https://www.maxeler.com/products/mpc-xseries

https://www.research.ibm.com/labs/china/supervessel.html

https://www.research.ibm.com/labs/china/supervessel.html

Bibliography 78

[36] Amazon. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/

instance-types/f1/, 2016.

[37] OpenStack Inc. OpenStack Orchestration. http://docs.openstack.org/wiki/

Heat/, 2016.

[38] Apache Software Foundation. Apache Mesos. https://mesos.apache.org, 2015.

[39] Andy Yoo, Morris Jette, and Mark Grondona. SLURM: Simple Linux Utility for

Resource Management. In Job Scheduling for Sategies for Parallel Processing, pages

44{60. Springer Berlin Heidelberg, 2003.

[40] NVidia Inc. NVidia Cuda Zone, Cluster Management Library. https://developer.

nvidia.com/cluster-management, 2015.

[41] Joon-Myung Kang, Lin T., Bannazadeh H., and A. Leon-Garcia. Software-De�ned

Infrastructure and the SAVI Testbed. In TRIDENTCOM, 2014.

[42] The Khronos Group. OpenCL Standard. https://www.khronos.org/opencl/,

2015.

[43] Kim Jungwon, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee.

SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings

of the 26th ACM international conference on Supercomputing, pages 341{352. ACM,

2012.

[44] Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia,

and Paul Chow. Enabling Flexible Network FPGA Clusters in a Heterogenous Cloud

Data Center. In International Symposium on Field-Programmable Gate Arrays.

ACM, February 2017. To appear.

[45] Iperf. Iperf { The TCP/UDP Bandwidth Measurement Tool. https://iperf.fr,

2014.

https://aws.amazon.com/ec2/instance-types/f1/

https://aws.amazon.com/ec2/instance-types/f1/

http://docs.openstack.org/wiki/Heat/

http://docs.openstack.org/wiki/Heat/

https://mesos.apache.org

https://developer.nvidia.com/cluster-management

https://developer.nvidia.com/cluster-management

https://www.khronos.org/opencl/

https://iperf.fr

Bibliography 79

[46] Christopher Dennl, Daniel Ziener, and Jurgen Teich. On-the- y composition of

FPGA-based SQL query accelerators using a partially recon�gurable module library.

In Field Programmable Custom Computing Machines (FCCM), pages 45{52, 2012.

[47] Christopher Dennl and et al. Acceleration of SQL Restrictions and Aggregations

through FPGA-Based Dynamic Partial Recon�guration. In Field Programmable

Custom Computing Machines (FCCM), pages 25{28, 2013.

[48] Michael Hausenblas and Jacques Nadeau. Apache Drill: interactive ad-hoc analysis

at scale. Big Data, 1(2):100{104, 2013.

[49] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The

hadoop distributed �le system. In 2010 IEEE 26th symposium on mass storage

systems and technologies (MSST), pages 1{10. IEEE, 2010.

[50] Sachin Garg and Martin Kappes. An experimental study of throughput for UDP

and VoIP tra�c in IEEE 802.11 b networks. In Wireless Communications and

Networking, 2003. WCNC 2003. 2003 IEEE, volume 3, pages 1748{1753. IEEE,

2003.

[51] Sang-Woo Jun, Ming Liu, Shuotao Xu, et al. A transport-layer network for dis-

tributed fpga platforms. In 2015 25th International Conference on Field Pro-

grammable Logic and Applications (FPL), pages 1{4. IEEE, 2015.

[52] David Sidler, Zsolt Istv�an, and Gustavo Alonso. Low-latency tcp/ip stack for data

center applications. In Field Programmable Logic and Applications (FPL), 2016 26th

International Conference on, pages 1{4. EPFL, 2016.

[53] Mirantis. Understanding your options: Deployment topologies for High

Availability (HA) with OpenStack. https://www.mirantis.com/blog/

understanding-options-deployment-topologies-high-availability-ha-openstack/,

2012.

https://www.mirantis.com/blog/understanding-options-deployment-topologies-high-availability-ha-openstack/

https://www.mirantis.com/blog/understanding-options-deployment-topologies-high-availability-ha-openstack/

Bibliography 80

[54] Altera Veri�cation Tool. Signaltap II embedded logic analyzer, 2006.

[55] ChipScope Pro Xilinx. 11.1 software and cores user guide, Xilinx. Inc., Apr, 2009.

[56] Kalapi Roy-Neogi and Carl Sechen. Multiple FPGA partitioning with performance

optimization. In Proceedings of the 1995 ACM Third International Symposium on

Field-programmable gate arrays, pages 146{152. ACM, 1995.

[57] Nam Sung Woo and Jaeseok Kim. An e�cient method of partitioning circuits for

multiple-FPGA implementation. In Proceedings of the 30th International Design

Automation Conference, pages 202{207. ACM, 1993.

Documents

Building and Using Virtual FPGA Clusters in Data … and Using Virtual FPGA Clusters in Data Centers ... (in alphabetical order) Alvi ... investigated using familiar programming models