Upload
ngodat
View
226
Download
7
Embed Size (px)
Citation preview
Building and Using Virtual FPGA Clusters in Data Centers
by
Naif Tarafdar
A thesis submitted in conformity with the requirements
for the degree of Masters of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
c Copyright 2017 by Naif Tarafdar
Abstract
Building and Using Virtual FPGA Clusters in Data Centers
Naif Tarafdar
Masters of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2017
This thesis presents a framework for creating network FPGA clusters in a heterogeneous
cloud data center. Our main objective is to abstract away the details of creating inter-
FPGA fabrics, by automating the FPGA network connections and the networking to
connect multiple FPGA clusters together. The FPGA clusters are created using a logical
kernel description describing how a group of FPGA kernels are to be connected, and an
FPGA mapping �le. This work lastly looks at acquiring FPGAs as virtual resources from
the data center using the cloud managing software OpenStack. This work �rst partitions
the user circuit onto multiple FPGAs using a user-speci�ed mapping, creates the FPGA
fabric for inter-FPGA connection, generates the OpenStack calls to reserve the compute
devices, create the network connections, generate the bitstreams, programs the devices,
and con�gure the devices with the appropriate MAC addresses, creating ready-to-use
network device that can interact with other network device in the data center.
ii
Acknowledgements
I would like to thank my supervisor Professor Paul Chow. The completion of this thesis
and the many valuable life skills I have acquired in the process can be attributed to his
guidance, and his patience to work with me. I have learnt the value of humility through
the good times and determination and discipline through hard times.
I would also like to thank my Krav Maga instructor Steven Tierney who has taught
me the valuable lesson of persevering through practice and training through his famous
saying: \If you train like a cupcake you will �ght like a cupcake."
I would like to thank my friends (in alphabetical order) Alvi Salahuddin, Ankita
Sinha, Cassandra Kardos, Owais Khan, Rajsimman Ravichandiran, Sara Chung, Thanus
Mohanarajan and Vanessa Courville. Over the past couple years especially the last few
months while I wrapped up my thesis I have been very busy, and I want to thank you
for your patience, support and love through these times.
My parents, Sha�que Tarafdar and Tasnin Tarafdar. Many of the lessons you have
taught me growing up has cultivated into who I am today, and I would not be here today
if not for that.
Also to my wonderful colleagues (also in alphabetical order) Andrew Shorten, Charles
Lo, Daniel Rozkho, Daniel Ly-Ma, Ehsan Ghasemi, Eric Fukuda, Fernando Martin Del
Campo, Jasmina Vasiljevic, Jin Hee Kim, Joy Chen, Joshua San Miguel, Julie Hsiao,
Justin Tai, Karthik Ganesan, Mario Badr, Nariman Eskandari, Roberto Dicecco, Sanket
Pandit, Shehab Elsayed, Vincent Mirian and Xander Chin. You have all helped con-
tribute to the wonderful work environment in PT-477. This thesis would not be possible
without all of you.
I would also like to thank the SAVI team who has helped me a lot over the years.
Professor Alberto Leon-Garcia, Hadi Bannazadeh and Thomas Lin. You have all con-
tributed a large part to this work and I look forward to continue working with all of you
in the years to come.
iii
I would like to thank Kenneth Samuel. You have been like family over the past few
years. We have gone through the hardships of engineering and have travelled the world.
Throughout it all you have kept me honest while always encouraging me to reach my full
potential.
Lastly I would like to thank my sister Nawar Tarafdar. Over the past 18 years you
have been my best friend, and none of this would be possible without you. You helped
me focus when I need to but also helped take my mind o� the stresses of life when I
needed it the most. Thank you.
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Cloud Computing and Data Centers . . . . . . . . . . . . . . . . . . . . . 6
2.3 Network Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Software-De�ned Networking . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Internet-of-Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Smart Applications on Virtualized Infrastructure (SAVI) Testbed . . . . 11
2.6.1 OpenStack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.1 FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7.2 Cloud Cluster Management Tools . . . . . . . . . . . . . . . . . . 16
2.8 Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Base Infrastructure: Cloud Resources and FPGA Platform 20
3.1 SAVI Infrastructure Modi�cations . . . . . . . . . . . . . . . . . . . . . . 21
v
3.1.1 OpenStack Resource Manager . . . . . . . . . . . . . . . . . . . . 21
3.1.2 PCIe Passthrough and OpenStack Image . . . . . . . . . . . . . . 22
3.1.3 Networking Backend . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Xilinx SDAccel Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 FPGA Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Design Flow for FPGA Development in the Cloud . . . . . . . . . . . . . 28
3.3.1 Extended Design Flow for Multi-FPGA Applications . . . . . . . 30
4 Design Alternatives 31
4.1 SnuCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Modi�cations for SnuCL OpenStack Support . . . . . . . . . . . . . . . . 32
4.3 Cluster Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 FPGA Network Cluster Infrastructure 36
5.1 Logical View of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Sub-Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Physical Mapping of the Kernels . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 FPGA Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 SDAccel Platform Modi�cations . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 FPGA Application Region . . . . . . . . . . . . . . . . . . . . . . 45
5.4.2 Input Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.3 Output Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Scaling up FPGA Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 FPGA Software Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.7 Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.8 Limitations of the Infrastructure . . . . . . . . . . . . . . . . . . . . . . 53
vi
6 Evaluation 54
6.1 Resource Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.1 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.2 Micro-experiment Setup . . . . . . . . . . . . . . . . . . . . . . . 56
6.1.3 Application Case-study . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.4 Query Implementation Details . . . . . . . . . . . . . . . . . . . . 60
6.1.5 Case Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Conclusion 65
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.1.1 Physical Infrastructure Upgrades . . . . . . . . . . . . . . . . . . 66
7.1.2 Scalability and Reliability . . . . . . . . . . . . . . . . . . . . . . 66
7.1.3 FPGA Cluster Debugging . . . . . . . . . . . . . . . . . . . . . . 71
7.1.4 True FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 73
vii
Chapter 1
Introduction
Big data and data center computing has evolved into a multi-billion dollar industry [1].
This involves the use of many compute elements on a large scale (thousands or more)
for large-scale compute problems with many terabytes of data. Computation problems
that were once simple, scale exponentially in complexity in the data center scale [2]. The
complexity is due to the large amount of data, communication between compute nodes
and computation power. At that large a scale considerations in power consumption
and heat dissipation are as important at as computation power as these become the
dominating variables in our cost calculations.
Cloud computing allows the sharing of data center resources among multiple tenants.
This is done by setting up infrastructure to multiplex these resources in time and in
space. A common method to do this is with virtualization, which abstracts away physical
details and maps a virtual machine onto a physical server. Similar data center resources
for standard compute CPUs are commercially available in services such as Amazon Web
Services and Microsoft Azure [3, 4].
1
Chapter 1. Introduction 2
1.1 Motivation
Field-Programmable Gate Arrays (FPGAs) recently have proven to be a good compu-
tation alternative in data centers due to their compute capabilities and power e�ciency.
One example is the Microsoft Catapult project where FPGAs were deployed in the Bing
search engine [5]. With a 10 % power increase they were able to see a 95 % performance
increase. FPGAs allow users to create customized circuitry for their application. The
performance and power-savings multiply to a large scale in a data center scale. Provi-
sioning FPGA resources from a shared cloud similar to the provisioning of CPUs can
be very useful to allow many other users to create their own FPGA computing clusters.
This is a problem some have investigated but the problem that remains is a thorough
implementation of provisioning an FPGA cluster within a fully heterogeneous environ-
ment, where it can communicate to any other network device (be it CPU, another FPGA
cluster, Internet-of-Things device).
1.2 Goal
Our goal is to provide an easy way to orchestrate large FPGA clusters from a large pool of
heterogeneous cloud resources. Our two main objectives are ease of use and performance.
Ease of use requires us to abstract away the details of connecting large clusters. We
investigated using familiar programming models to connect these large clusters, such as
the accelerator model, that has a CPU o�oading computation to multiple accelerators.
Through our investigation we noticed that this has performance limitations and thus we
opted to create a model in which multiple FPGAs are connected together directly as one
accelerator, rather than have individual FPGAs as their own accelerators.
Our model allows users to have a uniform view of their entire circuit (which can
span multiple FPGAs), and design their large circuit on a logical level, where they are
not concerned with physical mappings of their circuit onto FPGAs, and the connecting
Chapter 1. Introduction 3
infrastructure between FPGAs. Our model also allows users to easily scale up their
designs by specifying with a simple pragma the number of times to replicate a sub-circuit
or the number of times to replicate the entire circuit.
1.3 Contributions
My contributions allow a user of a cloud computing system to provision a ready to use,
easy to scale FPGA cluster. The contributions can be broken down as follows:
1. Shown that a lightweight, low-overhead protocol is critical to have e�cient coordi-
nation of applications using multiple FPGAs.
2. Shown the importance of low-latency direct interconnects between Flogs provides
signi�cant performance improvement compared to having communications through
host CPUs.
3. Created infrastructure to provision a non-network connected FPGA with a single
virtual machine.
4. Created a design ow to program cloud FPGAs with and without FPGAs in an
e�cient manner.
5. Investigated an FPGA cluster model that uses a distributed OpenCL model to
connect multiple FPGAs in a single environment.
6. Created an FPGA Cluster Generation tool that creates FPGA network clusters, by
connecting network FPGAs. This contribution can be divided into the following
sub-contributions:
(a) Created a script to translate a logical description of a circuit with no notion
of FPGA mappings, and an FPGA mapping �le into physically partitioned
FPGAs. This contains extra logic to handle the networking between FPGAs.
Chapter 1. Introduction 4
(b) Created a script to assign unique network MAC addresses to FPGAs in the
datacenter.
1.4 Overview
The rest of this thesis is as follows:
� Chapter 2 will introduce background into FPGAs, data centers, cloud-computing,
and FPGA virtualization.
� Chapter 3 will describe the backend data center infrastructure used and the FPGA
infrastructure.
� Chapter 4 will describe an FPGA cluster generation tool that uses a distributed
OpenCL environment, and its limitations.
� Chapter 5 will introduce the infrastructure made of our �nal design, from top-level
software scripts, to low-level FPGA modules.
� Chapter 6 evaluates the infrastructure with microbenchmarks and a large case
study.
� Chapter 7 provides future work and concludes the thesis.
Chapter 2
Background
This chapter introduces some background information on Field Programmable Gate Ar-
rays, their use in data centers, cloud-computing and the back-end data center environ-
ment that is used in the work of this thesis.
2.1 Field-Programmable Gate Arrays
This thesis revolves around provisioning Field-Programmable Gate Array (FPGA) clus-
ters for a user from a resource pool managed by a cloud resource manager. FPGAs
provide a �ne-grained latency sensitive computing alternative to the standard CPU en-
vironment.
An FPGA is a silicon chip with a programmable switching fabric that allows the
formation of customized circuits [6]. In contrast to the standard CPU environment where
the circuitry stays constant and the circuitry performs actions based on instructions, an
FPGA changes its circuitry depending on the application.
This is implemented with the use of Look-up Tables (LUTs), which can implement
various logic functions (such as Boolean AND, OR, NOT operations). Furthermore these
LUTs are combined with memory elements ( ip- ops) that are grouped into logic-blocks
for more complex applications that require memory. On top of logical functions there are
5
Chapter 2. Background 6
hardwired heterogeneous DSP blocks, memory blocks, external components (Ethernet,
JTAG, USB) that can also be incorporated into the user implemented circuitry. FPGA
CAD tools �rst place physical hardware into logical hardware blocks, and then these
logical hardware blocks onto the physical circuit available [7].
FPGAs are conventionally programmed with a low-level hardware description lan-
guage that describes low-level physical circuitry, such as Verilog or VHDL [8, 9]. The
low-level design work is di�cult and has a niche market making it di�cult for new users
to adopt. To mitigate these costs there have been pushes in high-level synthesis (HLS)
that translates high-level languages such as C, and C++ into physical circuit descrip-
tions. Examples of these HLS compilers include the Vivado HLS tools (C, C++ to HDL)
[10], LegUp (C to HDL)[11], Xilinx SDAccel (OpenCL to HDL) [12], and Altera OpenCL
SDK (OpenCL to HDL)[13]. Furthermore, the OpenCL environments include platform
architectures where many of the interfaces are abstracted away for the FPGA developer
such as the PCIe interface, the Ethernet interface and the o�-chip DRAM.
2.2 Cloud Computing and Data Centers
Data centers are large clusters of many compute devices [14], which can scale to the
order of thousands of servers. Traditionally these have been large CPU farms, used for
a multitude of applications that require large amounts of data and computation. These
data centers allow the for the provision of large-scale services that process a large amount
of data such as social media services, email, search engines [15], etc. The large scale
of storage, compute and networking resources allow companies to service a signi�cant
number of users but at the same time many challenges arise. Computation problems that
were once simple, scale exponentially in complexity at the data center scale [2]. This is
mainly due to communication complexity across multiple nodes, these complexities range
from reliability of nodes, reliability of messages, and consistency of data across multiple
Chapter 2. Background 7
nodes. On top of computation and communication complexities, a big expense in the
data center is the energy requirements to run the servers on that large a scale [16]. For
example The Lakeside Technology Center in Chicago requires 180 MW of power, which
is the second largest power customer to Commonwealth Edison (second only to Chicago's
O'Hare Airport) [17].
Data centers require a large capital investment, which is not a problem for companies
such as Microsoft, Google, or Facebook. However smaller companies that would like to
use compute resources on a large scale may not be able to a�ord and maintain their own
data centers. Cloud computing provides these resources as a service to third parties [18].
The bene�t is the sharing of infrastructure such as storage, computing and networking.
NIST de�nes cloud cloud computing by the following characteristics:
1. On Demand Self Service: provision resources at any time.
2. Broad Network Access: all devices can communicate to any other device on the
network.
3. Rapid Elasticity: Cluster sizes of devices can be changed easily.
4. Resource Pooling: Resources organized into pools for multiple clients.
5. Measured Service: Metrics and tools in place to measure usage.
2.3 Network Stack
The communication through networks is done through layered partitions, where each
layer provides a service to the layer above [19]. Figure 2.1 shows the layers within the
network stack. The Transport Layer provides full end-to-end transmission between a
host and destination on the Internet. This layer is not concerned with the path a packet
may take on the network, only the start and end points. The Network Layer similar to
the Transport Layer also is only concerned with the host and destination of a network
Chapter 2. Background 8
path, speci�ed by an IP address. The Data Link Layer is concerned with the local hops
a packet must take within the network, where each hop is speci�ed by a MAC address.
The physical layer is responsible for the physical transmission (e.g optic �bre, Ethernet
cable) of the information between links.
Figure 2.1: This illustrates the network stack from the transport layer and below.
A network can consist of many switches and hosts. Typically the translation between
an IP address (the end-to-end path description) to the MAC address (where to go on
the next hop) is done on intermediate network switches[19]. An example multi-switch
network is shown in Figure 2.2.
Chapter 2. Background 9
Figure 2.2: This shows an example of a small network connected by switches (S) and hosts(H). When two hosts wish to communicate they would specify each other's IP address.The switch receiving a packet will decide the next hop by matching the IP address ofthe destination to the next hop address speci�ed by a MAC address (determines whichswitch to go to next).
2.4 Software-De�ned Networking
Software-De�ned Networking (SDN) is a concept that enables programmatic control of
entire networks via an underlying software abstraction [20]. This is achieved by the
separation of the network control plane from the data plane as shown in Figure 2.3. SDN
opens the door for users to test custom network protocol and routing algorithms, and
furthermore, it allows the creation, deletion, and con�guration of network connections to
be dynamic. The current de facto standard protocol for enabling SDN is OpenFlow [21].
In OpenFlow, the control plane is managed by a user program running on a CPU that
leverages APIs exposed by an SDN Controller. The SDN controller, often referred to as
the \network operating system", abstracts away network details from the user programs.
The controller manages the data plane and creates con�gurations in the form of ows.
Chapter 2. Background 10
Figure 2.3: System diagram of an SDN, where user-de�ned control programs managenetwork switches.
The control plane is generally responsible for managing the data plane, and creates
con�gurations in the form of ows. These ows describe the overall behaviour of the
network, and can be used to specify custom paths through the network based on packet
headers, or even specify operations on the packets themselves (e.g. drop packets, modify
headers, etc.). While the switches in the data plane can handle simple header matching
and modi�cation of header �elds, more complicated features, such as pattern-matching
within the payload or modifying the payload data, require the packets to be forwarded
up to the control plane for processing in software. Per-packet software-based processing
often incurs signi�cant latencies and reduces line-rate. The switches in the data plane
can handle simple matching of ows, however if a packet does not match a ow, it is
either handled by a default ow or forwarded up to the control level for the routing to
be handled in software.
Chapter 2. Background 11
This creates an opportunity for FPGAs: FPGAs can combine the best of both worlds
with the recon�gurable nature of software programs in the control plane, and the low-
latency of the switches in the data plane. An example of a project using FPGAs in SDN
can be seen in [22]. This project was implemented with virtualized FPGAs in a data
center, where two virtualized FPGAs were inserted into the data path of a network ow.
Packets that normally would have been sent to the control plane for custom processing
were instead re-directed to the FPGAs for processing. Using this approach, the through-
put of the packets is the same as a direct path through a switch; whereas when then
packets were handled by software running in the control plane, only half the expected
throughput was observed.
2.5 Internet-of-Things
Internet-of-Things (IOT) introduces the idea that \things" not restricted to standard
computation tools can connect to the Internet [23]. These include sensors measuring
tra�c, heat, pollution, etc. These are used to create a smart environment allowing us
to gather information and make control decisions accordingly [24]. One example can be
the installation of sensors at tra�c lights to detect the presence of vehicles waiting at
the light. The connection of these devices also brings forth a large amount of data that
otherwise would not be available. This data can be used for analytics such as the analysis
of pollution levels within the city
2.6 Smart Applications on Virtualized Infrastructure
(SAVI) Testbed
The SAVI testbed is a Canada wide multi-tier heterogeneous testbed and it can be
seen in Figure 2.4 [25]. This testbed contains various heterogeneous resources such as
Chapter 2. Background 12
FPGAs, GPUs, Network Processors, IOT sensors and conventional standard CPUs. The
virtualization of these resources are still being researched (our work investigates the
FPGA platforms). Previous virtualization work on this testbed includes the work by
Byma et al. [26] which provides partial FPGA regions as OpenStack resources. Other
resources such as GPUs and network processors are given to the user either by providing
the entire machine without virtualization or with the use of PCIe-passthrough. PCIe-
passthrough is when the hypervisor allows a virtual machine to have complete access to
a PCIe device. Once a virtual machine acquires this device, no other virtual machine
can reserve that device.
Figure 2.4: System diagram of the SAVI multi-tier architecture that has a CORE withmany CPU Compute Servers and Edges physically dispersed around Canada. Each Edgeis made up of compute CPUs and other heterogeneous devices (e.g FPGAs, GPUs, IOTSensors)..
The multi-tier property refers to the network architecture of SAVI. SAVI can be
Chapter 2. Background 13
seen as multiple cloud networks. The core network consists of a large number of CPUs
that provide the back-bone of the data center. This core network is then connected to
several edges dispersed around Canada. Each of these edges is a miniature cloud network
that also contains the heterogeneous devices. Many of these heterogeneous devices are
connected directly to the network through high performance 10G switches. These devices
are treated the same way any CPU would be treated as many of these devices are assigned
network ports with valid MAC and IP addresses. These devices are addressable by any
other node (CPU or other device) on the network, once they are registered to the network.
This allows for example: an IOT sensor in Toronto, that can then send the data to an
FPGA cluster in Victoria and then have the data be accessible by a CPU cluster in
Calgary. Furthermore the multi-tier architecture allows a lot of the processing to be
done on the edge network close to the heterogeneous devices before being sent to the
large CORE where we have more compute resources.
2.6.1 OpenStack
OpenStack is the cloud managing tool used by SAVI [27]. It is divided into several
services. The two main OpenStack services that we employ in our platform are Nova and
Neutron, which are typically interfaced with a client machine. Nova is responsible for the
deployment of compute infrastructure from the platform. This involves the generation of
virtual machines on physical machines [28]. The client machine when requesting a virtual
machine speci�es two �elds: a software image, and the avor. The software image refers to
all the software that is to be installed on the virtual machine, this includes the operating
system and any other applications that we want to initialize our virtual machine with.
These images are typically kept in a repository and can be updated by users of the
testbed. The avor refers to the physical speci�cations of the virtual machine, such as
number of CPU cores, RAM, hard drive space.
Neutron is responsible for the provisioning of network resources[29]. We can create
Chapter 2. Background 14
network ports within our cluster, and these ports are assigned MAC addresses and IP
addresses that will be valid within the cluster. When creating virtual machines these
ports are created implicitly, but we can explicitly create additional ports for non-virtual
devices or non-CPU devices.
2.7 Related Work
In this section we describe previous work in virtualized FPGAs and other Cluster Man-
agement Tools in the cloud.
2.7.1 FPGA Virtualization
There has been previous academic work in providing FPGAs as virtualized resources
within the cloud manager tool OpenStack. The work presented by Byma et al. proposes
FPGA resources sitting directly on the network to be allocated as OpenStack resources
[26]. The hypervisor is programmed into hardware and communicates to the OpenStack
controller via the network. Furthermore the FPGA application region in this case is split
into four smaller regions allowing multiple users to share a single FPGA device. This
also requires modifying OpenStack to communicate to the hardware hypervisor in the
FPGA.
Another important work that is most similar to our work is the Leap Project [30].
The focus of the Leap is to provide an operating system for an FPGA. They abstract
away many details such as memory and I/O by providing analagous system calls to
interact with physical FPGA hardware. They provide multi-FPGA support through a
mapping �le that they use to describe a multi-FPGA cluster. This requires the user
to �rst physically connect multiple-FPGAs using any communication medium. Once
the user creates this cluster, they will then create a con�guration �le describing the
physical connection and their mediums in the cluster. This cluster is now seen as a
Chapter 2. Background 15
large computation device that is now ready to be programmed by the user through their
abstraction layer.
The work proposed by Chen et al. also virtualizes FPGAs in OpenStack but moves
away from FPGAs sitting directly on the network [31]. This proposes implementing the
hypervisor in software by modifying KVM, which is a popular Linux hypervisor [32].
Instead of an FPGA sitting directly on the network it is coupled together with a virtual
machine. Similar to the previous work this also requires the modifying OpenStack to
communicate to the software hypervisor.
Several industrial pursuits have started investigating provisioning FPGA resources
from a cloud. One example is the Maxeler MPC-X project [33]. This project provides
a virtualized FPGA resource to a user that can be implemented with a variable number
of FPGAs. The user �rst allocates resources for the given cluster of FPGAs in the
virtualized FPGA resource. Once the cluster has been made, the details are abstracted
from the user during application run-time.
IBM's SuperVessel looks at providing an FPGA as a cloud resource that shares mem-
ory (through CAPI) with a CPU also provisioned with OpenStack [34]. In this model a
single FPGA is provisioned to the user as an accelerator to which the user can upload
FPGA code to be run and compiled onto the FPGA. This simpli�es the process of pro-
visioning an FPGA and running code to be accelerated on the FPGA but works with a
single FPGA. The user in this model can also use pre-uploaded FPGA applications as
services that can be provided by companies or other users of the infrastructure.
Microsoft has also continued their work with data center FPGAs with the second
iteration of Catapult [35]. The model here looks at providing a backbone infrastruc-
ture for multiple FPGAs to be connected together through a high performance network
switch. CPUs are tightly coupled with FPGAs, and the FPGAs are connected on the
switch. FPGAs communicate amongst each other through a low overhead custom trans-
port layer.Microsoft's view of the multi-FPGA fabric looks at the problem at an FPGA
Chapter 2. Background 16
granularity, where the user divides their large circuit across multiple FPGAs and the
users circuits are aware of FPGA boundaries.
Lastly, Amazon AWS has recently announced that they are introducing Xilinx Ultra-
Scale+ VU9P FPGAs connected to VMs via a virtual JTAG connection to their cloud
resource pool [36], and dedicated PCI-e x16 connections. They provide two avors of
FPGAs with their CPUs, one being a single FPGA accelerator, with another being an
8-FPGA ring. The 8-FPGA avor is connected via a 400 GBps bidrectional low-latency
network.
2.7.2 Cloud Cluster Management Tools
Another aspect of this project is to provide orchestration of clusters within our cloud
environment. Heat is a component in OpenStack that can orchestrate clusters using an
orchestration template, which describes the virtual machines and networking within your
cluster [37]. This allows the creation of interesting network topologies within your own
cluster. Heat can be combined with user applications that can modify these clusters
using other metrics such as performance, resource utilization, CPU usage.
Other tools exist that combines orchestration and load balancing using the aforemen-
tioned metrics. The usual work ow for these tools are as follows. The tool �rst reserves
a set of resources from a larger pool of compute nodes for a certain application. The allo-
cated resources are then connected for the application and monitoring. The monitoring
is used for user statistics as well as fault tolerance within the cluster.
These tools are helpful for getting optimal, reliable performance on a cluster as well
as debugging a cluster. Debugging a cluster can be a daunting task as there are many
variables within the cluster. These tools monitor events to gauge the status of di�er-
ent processes within an application and present the problem to the user in an easy to
understand representation.
Most of these tools currently work for CPU clusters (e.g Apache Mesos, Slurm) and
Chapter 2. Background 17
GPU clusters (e.g NVidia Management Library) [38, 39, 40]. Our challenge is to expand
clustering capabilities to FPGAs by developing our own orchestration tool and then to
investigate monitoring and updating our clusters using FPGA metrics, which will di�er
from the current set of CPU and GPU metrics current tools use.
Comparison of our Cluster Generator to Other Tools
In our work, we devlop a cluster generation tool that takes as input a number of com-
putation kernels and their connections. It allows us to easily create large multi-FPGA
clusters. We can compare this to the Microsoft Catapult project. The �rst iteration of the
Catapult project has statically connected FPGAs in a �xed torus and lacks exibility [5].
Their second version of the project has a similar network connected model to our design,
where all FPGAs are connected to a network switch [35]. Some key di�erences between
our project and the Catapult project is the model in which we describe our problem. The
Catapult project breaks the problem into FPGA boundaries, and it requires the user to
think in terms of physical FPGAs. In our model the user is not concerned with FPGA
boundaries, and designs kernels independent of FPGAs. Our model also allows for easy
scalability where we can scale up our designs with the simple use of a pragma.
We can also compare our work to the Leap FPGA project. Similar to our Catapult
comparison we provide easy scaling that is not available in the Leap project. Further-
more our tools sit on top of a cloud managing tool which can create arbitrary FPGA
connections. In the Leap project, the user has to physically connect the FPGAs in a
user speci�ed topology. With respect to topology our work is more exible. However the
connection medium in Leap is exible where in our design we assume network connected
FPGA clusters.
Chapter 2. Background 18
2.8 Level of Abstraction
Our work looks at using OpenStack to provision FPGA network clusters. This is similar
to the other OpenStack works cited but this is on a larger scale as we are looking at
multi-FPGA clusters. The physical layout of our FPGAs are similar to that of Catapult
with the FPGAs as network-connected devices, but in our environment these clusters
are provisioned with OpenStack. Furthermore our backend data center is a large pool
of heterogeneous resources where not only FPGAs are network connected but they are
connected to CPUs, and IOT devices (receivers, sensors etc.). Our FPGA cluster is seen
as any other network device with a network MAC address and IP address, which can
be communicated to from any network connected device in the data center. Unlike the
Amazon EC2 F1 project our work provides FPGAs as part of the network backend where
they can communicate directly to any virtual machine or other network device directly
in the network. Lastly our work builds on top of this infrastructure by providing simple
cluster provisioning tools that communicate to OpenStack to generate the infrastructure.
This infrastructure request uses a logical description �le that describes the user kernels
and how they are connected. This logical description �le is FPGA independent and also
provides methods of scaling up nodes within a cluster, introducing schedulers or even
replicating an entire cluster. An FPGA mapping �le is also provided that maps each
kernel speci�ed in the logical �le to a particular FPGA.
Our level of abstraction is demonstrated in Figure 2.5. This work is not true virtu-
alization, instead it provides the infrastructure needed for true virtalization. Our work
easily creates FPGA infrastructure from a pool of cloud resources by using an FPGA
independent description of a circuit and an FPGA dependent mapping of the circuit.
This gets translated into a physically partitioned FPGA circuit with FPGA network in-
terconnections automatically generated from the cloud. However this still requires some
user speci�cation of where to place kernels, which means this is not true virtualization
as the physical speci�cations of the FPGA is not hidden from the user. We can build
Chapter 2. Background 19
virtualization on top of this by creating a virtual FPGA that can be made out of many
FPGAs in the cloud. True virtualization will be able to characterize the number of phys-
ical resources required given the user speci�cation and then our tools can be invoked to
create the physical cluster out of the resources available in the cloud. We explore this in
Section 7.1.
Figure 2.5: This illustrates the level of abstraction stack that we provide and where webelieve true virtualization should exist.
Chapter 3
Base Infrastructure: Cloud
Resources and FPGA Platform
This tool provides a high-level abstraction to acquire FPGA clusters from a virtualized
environment. We de�ne our infrastructure stack in Figure 3.1.
Figure 3.1: Our infrastructure stack. We provide API at each layer and abstract awaymost of this stack from the user. The user supplies the top-layer and we return a fullyconnected FPGA cluster.
In Chapters 4 and 5 we present the implementation of our FPGA cluster generation
tools. In this work we de�ne an FPGA cluster as an environment that has multiple
Fig's connected in a manner that makes inter-FPGA communication easier. The �rst
20
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform21
model looks at the FPGA cluster as a CPU connected to multiple FPGA accelerators.
Coordination between accelerators is handled by the CPU and thus requires inter-FPGA
communication to happen through the CPU, which became the bottleneck of our design.
The second design model we looked at connected multiple FPGAs using Ethernet on the
FPGA and allowed FPGAs to directly communicate with one another, eliminating the
CPU bottle-neck in communication. The two design alternatives explored in Chapters 4
and 5 utilize virtual CPUs tightly coupled with FPGAs, as OpenStack is used to provision
FPGAs to the user. This chapter explores the modi�cations made to allow SAVI to
support the provisioning of a single FPGA which refers to the OpenStack Compute
Commands, OpenStack Network Command and Cloud Network Port Registration parts
of our infrastructure stack.
3.1 SAVI Infrastructure Modi�cations
The SAVI infrastructure as explained in Section 2 includes the physical servers, the
heterogeneous devices and the networking capabilities.
3.1.1 OpenStack Resource Manager
OpenStack is the virtualized resource manager that is used by the SAVI infrastructure.
This includes physical servers managed with hypervisors connected to high-performance
network switches that are also managed with software-de�ned networking tools.
When a user requests a virtual machine from SAVI, the request speci�es the phys-
ical speci�cations of the virtual machine ( avor) and the software image of the virtual
machine. Figure 3.2 shows what a virtual machine request looks like.
Each physical server has an agent. An agent is a program running on the server
that is responsible for communicating with OpenStack. The agent is sent requests to
make/remove virtual machines with certain speci�cations and software images, and re-
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform22
Figure 3.2: A standard OpenStack virtual machine request.
quests for access to physical heterogeneous devices available in the physical server. Past
work in FPGA virtualization has looked into creating custom agents to manage FPGA
virtual machines [26]. Our approach is di�erent, as we wish to keep these modi�cations
to a minimum.
The only changes we made to include avors supporting PCIe FPGA devices, and a
few con�gurations on the KVM server to support passthrough of a speci�c PCIe device.
The other approach could be to modify OpenStack to support our FPGA environment
but that would make adoption in other OpenStack environments more di�cult.
3.1.2 PCIe Passthrough and OpenStack Image
First, we provide the FPGA as part of a VM using PCIe passthrough, which is when
the VM is given full access to a PCIe device on the physical server. OpenStack noti�es
the software hypervisor on the physical server of the VM parameters using the avor
discussed in Section 2.6.1. These parameters also include information about any PCIe
devices required by the user. This involves con�guring the hypervisor to pass control of
the PCIe device to a speci�c VM by adding the PCIe vendor and device ID of the FPGA
to the OpenStack con�guration script on the physical server. The cloud management
system then provisions the VM including the requested PCIe device(s). Figure 3.3 shows
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform23
two example VMs with PCIe-connected FPGAs. Once a virtual machine is assigned a
PCIe device it is given full access to the device and cannot be shared by another virtual
machine.
Figure 3.3: This �gure illustrates an example of two virtual machines on a single server.One virtual machine with one PCIe FPGA and the other one has two PCIe FPGAs.
Secondly, we have created multiple OpenStack avors corresponding to the PCIe de-
vices. Each avor describes the con�gurations of the desired VM. These con�gurations
include the number and type (speci�ed by the device ID and vendor ID) of PCIe de-
vices. We made two avors, one lightweight avor and another for a full development
environment. The lightweight avor, which consists of only two CPU cores and 2 GB of
memory, is intended for the CPU on the VM to act as a mere controller for the FPGA.
The full development environment, which consists of four CPU cores and 8 GB of mem-
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform24
Physical Speci�cation ValueNumber of Cores 2
Disk Space 10 GBRAM 1 GB
Table 3.1: Physical speci�cations of vir-tual machine used within FPGA clus-ters
Physical Speci�cation ValueNumber of Cores 4
Disk Space 40 GBRAM 8 GB
Table 3.2: Physical speci�cations ofstandalone FPGA design station
ory, provides a complete environment to create and test FPGA designs as well as control
the FPGA. The speci�cations of these VMs are shown in Table 3.1 and Table 3.2.
Next we made a software image for our virtual machine that will host the FPGA.
This is the base software image. The cluster designs described later in this section add
more software support to the base software image. The base software image contains the
Xilinx SDAccel 2015.3 Tools and PCIe driver. The lightweight image contains a subset of
this tool-chain and is limited to only the PCIe driver and programmer. Virtual machines
using this image cannot generate bitstreams but can program FPGAs and communicate
to the FPGA via the a software driver. In later clusters we require at least one machine
to have the full tool-chain as this machine will be used to develop the bitstreams that
will then be distributed amongst the cluster.
3.1.3 Networking Backend
Physical compute servers, FPGAs and IOT devices are physically connected directly to
network switches. These network switches are managed by SAVI's network manager
Janus [41]. Devices attached to the network switch need to generate network ports
registered with Janus. The registering of these ports requires the port number and an
IP address along with MAC address. Once registered, Janus uses OpenFlow to route all
tra�c destined for a speci�c IP address and MAC address to the registered port. Janus
also ensures that all tra�c that has an invalid destination or source (not registered) is
dropped within the network. The registration of a port �rst requires the creation of a
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform25
virtual port in OpenStack. The OpenStack tool Neutron is used to create a new virtual
port that has a new MAC address and IP address. Once these have been registered,
all packets destined for our device must use the IP address and/or the MAC address in
their header �eld. Also, all packets that do not have a matching destination �eld to any
destination in the virtual network are then dropped. These requirements can be bypassed
using custom networking ows that can be programmed onto the switch. The type of
device is independent of the network port, which allows us to use the same mechanism
to assign IP and MAC addresses to FPGAs and IOT devices within our network.
3.2 Xilinx SDAccel Platform
All the design alternatives explored here use the Xilinx SDAccel Platform [12] (or a
modi�ed version of the platform). This platform provides the user a set of APIs to
program an FPGA, send an FPGA data, and read data processed by the FPGA. This
platform can be seen as an FPGA hypervisor as this is responsible for managing the
FPGA interface around the user application. This is explained in Section 3.2.2.
3.2.1 OpenCL
OpenCL is a heterogeneous programming platform that allows a user to communicate
to devices via a host application [42]. These devices include GPUs, CPUs, and most
recently, FPGAs. Interactions between the host and the devices are called OpenCL
Events. OpenCL Events can be pro�led and synchronized, even between devices, which
becomes even more challenging when these devices are on the network. Figure 3.4 shows
the heterogeneous environment OpenCL provides. The host is linked with OpenCL host
libraries that the host can use to interact with the devices. Furthermore, the code running
on the device, known as kernels, are coded in the OpenCL language. This is a language
very similar to C, however there are more parallel constructs within the language.
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform26
Figure 3.4: The heterogeneous environment provided by OpenCL
Each OpenCL vendor provides an interface to their device called an Installable Client
Driver (ICD). The ICD provides the interface between standard OpenCL API calls to
speci�c device driver implementations of the OpenCL API. A multi-platform OpenCL
application loads the vendor devices by traversing through a list of �les that specify the
vendor speci�c ICD implementations. The ICDs are loaded and then subsequent OpenCL
host API calls are redirected to the ICD for the speci�c device. More information on the
OpenCL speci�cations can be found in [42].
3.2.2 FPGA Hypervisor
In our design we use the Xilinx SDAccel [12] platform as an FPGA hypervisor, where
the hypervisor is used to provide some basic services. The FPGA in this model is a
PCIe-connected device and the platform �rst provides a driver to communicate to the
FPGA. This is done through OpenCL, which provides the API to communicate to and
manage devices.
OpenCL is both a programming language for heterogeneous devices and a program-
ming API for a host application (conventionally run on a CPU) to manage and commu-
nicate to OpenCL compatible devices [42]. This environment gathers all the OpenCL
devices connected to the processor usually locally via PCIe. In the SDAccel Platform,
as shown in Figure 3.5, the OpenCL API communicates to a driver provided by Xilinx
called the Hardware Abstraction Layer (HAL) that provides driver calls to send/receive
data from the FPGA and program the Application Region, in the FPGA. The Appli-
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform27
cation Region is programmed using partial recon�guration, and the region around the
Application Region is the Hypervisor in our model. In this platform the kernels within
the Application Region can be OpenCL kernels, Vivado HLS kernels, or even hand-coded
Verilog/VHDL kernels. The PCIe Module is a master to a DMA engine to read/write to
o�-chip DRAM. This is used to communicate data to the Application Region. The PCIe
Module is also a master to an ICAP module (not shown) responsible for programming
the Partial Recon�g region with a bitstream sent from the user in software. The HAL
driver provides an API that abstracts away the addresses required to control the various
slaves of the PCIe master.
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform28
Figure 3.5: System diagram of the SDAccel platform
3.3 Design Flow for FPGADevelopment in the Cloud
In Chapter 4 we describe the design ow for the development of large scalable FPGA
clusters. The infrastructure described in this chapter however does present us with a
new design ow for FPGA development on a small scale. We deployed our FPGA cloud
service in May 2015. Since then it has been used by students within the University of
Toronto as part of their own FPGA development environment. Our infrastructure lays
the groundwork for a new design ow that helps utilize and share the FPGAs e�ectively.
This is done through the use of software simulation of FPGAs. The software tools
provided within the SDAccel environment allow for simulating the Application Region
completely in software, with no change to the user software application that is calling
the application. The simulated Application Region is wrapped to provide the exact same
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform29
interface for the Hardware Abstraction Layer as is done in the actual hardware. In
this way the same HAL can be used during software simulation to transfer data to and
from the simulated Application Region. This is supported by the standard SDAccel tool
provided by Xilinx.
Our environment gives the user exibility to provision a VM containing the FPGA
development tools with and without a physical FPGA. This creates a new design ow as
follows:
1. The user develops their application on a VM without an FPGA. The user requests
a VM with a avor that does not have the FPGA and the software image containing
the FPGA software tools. The user tests their design using the software-simulated
FPGA.
2. Once the user is ready to migrate their work to a physical FPGA, they save a
snapshot of their VM. This is done through an OpenStack API to save the state of
a VM.
3. The snapshot is then uploaded to the OpenStack software image repository. The
user then requests a new VM with a avor that has the FPGA and the software
image snapshot saved in Step 2.
4. Now the user can test their application on a physical FPGA. After testing, they
can migrate their application back to a VM without an FPGA. They once again
will save a snapshot of their VM but this time migrate to a machine without an
FPGA.
This design ow allows for easy sharing of the FPGA. Cloud managers can track
usage of the physical FPGAs by using monitoring functions provided by OpenStack.
This also has further implications towards re-usability of FPGA applications as func-
tions. Similar to software applications , we can create FPGA applications as virtualized
Chapter 3. Base Infrastructure: Cloud Resources and FPGA Platform30
resources, upload the application to OpenStack and have them available as a software
image to be readily available to everyone.
3.3.1 Extended Design Flow for Multi-FPGA Applications
Multiple FPGA applications can also be deployed with the infrastructure described in
this chapter. Communication between FPGAs can be done through either the virtual
machine or directly through the network. The integration of the software simulation of
FPGAs along with actual physical implementations on an FPGA allow for an incremental
design ow. The design ow of multi-FPGA applications in our environment is as follows:
1. Implement all parts of the multi-FPGA application design as a chain of software
simulated FPGAs. (use an OpenStack image with the software tools and avor
that does not have the FPGA).
2. Implement and test each individual network function as an FPGA-o�oaded design.
3. Incrementally, as we complete each part of the multi-FPGA application, swap the
software-based function with the FPGA-based implementation.
4. If the multi-FPGA application remains functionally correct, then repeat Steps 2
and 3 for the next part of the application. Repeat until the whole application is
implemented using FPGAs.
Chapter 4
Design Alternatives
This Chapter introduces our �rst iteration of a cluster generation tool. We use our
own Cluster Generation tool to create an MPI software cluster. On top of the software
cluster we use an OpenCL Network Environment tool called SnuCL [43] to create an
OpenCL platform out of FPGA connected virtual machines (SnuCL was modi�ed for
FPGA support). We will �rst introduce SnuCL, then highlight the additions we have
to add to our OpenStack environment, describe our cluster generation tool and lastly
describe and analyze the results observed.
4.1 SnuCL
SnuCL provides a single OpenCL environment for a host device communicating to a
cluster of network-connected CPUs and GPUs. The communication between the host and
devices within the network cluster is implemented using the Message Passing Interface
(MPI). In MPI, a single application is split into a set of processes that can then be run
on top of a cluster of network devices. The processes communicate with each other using
messages as there is no shared memory between the processes. The physical location
of these processes are independent of the MPI implementation of the application. The
underlying physical infrastructure is speci�ed to MPI run-time (and not when the MPI
31
Chapter 4. Design Alternatives 32
application is being compiled or implemented).
In SnuCL the host and each of the devices are executed in separate processes. Tra-
ditionally in OpenCL, the host will use the device speci�c ICD on the same machine to
implement OpenCL functions; however in SnuCL the host process �rst sends the message
to the particular device process. The device process is responsible for using the ICD to
relay information back and forth to the device, and then to relay the information back to
the host process. The underlying communication between the host and device processes
as well as the network architecture are handled by MPI and hidden from the user. Thus,
this can provide a shared view of the OpenCL environment to a user, abstracting away
the locations of the devices. Figure 4.1 highlights how SnuCL works in a cluster and
then gets logically transformed into Figure 3.4. More details on the implementation of
SnuCL can be found in [43].
Figure 4.1: Simpli�ed SnuCL Cluster, this gets logically translated into Figure 3.4
4.2 Modi�cations for SnuCL OpenStack Support
To support this in OpenStack we had to make our own OpenStack avor and disk image.
A avor can specify the type of PCIe device as well as the number of PCIe devices of
that speci�c type (e.g avor for 1 FPGA, 2 FPGAs, 1 GPU, 2 GPUs etc.). In addition to
Chapter 4. Design Alternatives 33
PCIe devices, a avor also de�nes other machine speci�cations such as memory and hard-
disk space. Once these avors are created, they can be used to create multiple virtual
machines described by the speci�cations of the avor. In our current implementation,
our avor grants virtual machines 40 GB of hard disk space and 8 GB of RAM; however
we can shrink this requirement as most of the processing is done on the FPGA.
SnuCL was modi�ed to work with the Xilinx SDAccel environment. SnuCL was
previously tested on CPU and GPU clusters and required slight modi�cations to work
with FPGA devices. Our virtual machine disk image is implemented using the CentOS 6.6
image, as this supports the SDAccel driver. The following software is installed onto the
CentOS image:
1. Xilinx SDAccel 2015.1. This version has ICD support, which is needed for SnuCL.
2. OpenMPI 1.6.4 which is needed for SnuCL.
3. SnuCL modi�ed for FPGA support.
Once the software tools are installed onto the CentOS image, a snapshot of the
virtual machine is taken. A snapshot refers to the creation of a new virtual machine disk
image that includes everything installed on a running virtual machine at the moment
the snapshot was taken. This snapshot can now be used to build new virtual machines
that come pre-packaged with our custom software tools. With the appropriate avor
and virtual machine disk image, we can make fully functioning virtual machines that are
ready to launch distributed FPGA OpenCL applications.
4.3 Cluster Orchestration
This section goes over the automation of clusters within our environment. Our cluster
orchestration takes a Cluster Generation File (CGF ) as input, acquires the requested
resources and forms a cluster. OpenStack is used to make virtual machines with the
Chapter 4. Design Alternatives 34
avor that includes the FPGA resources, and using the virtual machine disk image that
has the necessary software tools.
There are several avors that correspond with the FPGA device, with the avors
di�ering by the number of FPGA devices. The avor with the largest number of available
FPGAs less or equal to the number of requested FPGAs is used. This is repeated until
the total number of FPGAs requested is reserved, to ensure the highest degree of locality
possible. Our orchestration system uses SnuCL to allow FPGAs in di�erent physical
machines (thus di�erent virtual machines) to be able to combined within one OpenCL
environment. However if we wanted a small number of FPGAs that are available in one
physical machine we can use standard OpenCL by reserving the avor associated with
the number of FPGAs.
In our cloud SnuCL environment after we reserve virtual machines for our devices,
another virtual machine is created to represent the host machine, this is of a di�erent
avor than the rest of the cluster as this machine does not require a PCIe device.
(a) The client requests a cluster with a cluster generation �le (CGF), that is translatedto the appropriate OpenStack commands.
(b) The cluster generated with OpenStack is prepared by connecting the appropriatenodes and preparing the nodes �le required for MPI.
Figure 4.2: Demonstrates the two steps when orchestrating a cluster. First the cluster isreserved using OpenStack and second the cluster is prepared and connected for SnuCL
Once the cluster is formed, then the nodes are connected so that SnuCL works between
them. This involves modifying the �rewall between these nodes and ensuring that there
is ssh access between the host node and the cluster nodes (OpenMPI spawns processes
Chapter 4. Design Alternatives 35
on other nodes by executing them through ssh). A nodes �le is then generated specifying
the IP addresses of the other nodes in the cluster and moved to the host node. The
SnuCL cluster application can now run on the host node by specifying the nodes �le
to MPI. The generation, connection of the nodes and preparation of the nodes �le is
all done automatically with the cluster generation tool. The user, after requesting the
cluster from the tool, would then just have to login to the host virtual machine and run
their cluster application. Figure 4.2 shows the formation of clusters using the cluster
generation tool.
4.4 Results
We implemented simple video processing kernels in OpenCL and ran the kernels with and
without SnuCL, both versions on the FPGA. The video kernels perform object tracking
and recognition on the FPGA.
We averaged the execution time per frame over the execution of 25 frames in our
environment. In the SnuCL environment the execution time per frame is averaged over
20 executions of 25 frames at a time. The average execution of the kernel in the SnuCL
library is 224 ms while it is 2.23 ms directly on the FPGA virtual machine. This is a
100-fold slowdown when going to SnuCL.
This experiment highlights that there is a lot of future work required in the com-
munication protocol in our SnuCL environment. SnuCL was available and easy to use
but the overhead introduced by this system leaves room for improvement. SnuCL im-
plements their communication through MPI, which is readily available, however a more
light-weight protocol can be investigated to replace or enhance SnuCL. On top of a com-
munication protocol, direct communication to the FPGA between compute kernels can
also prove to be bene�cial. In Chapters 5 and 6, we explore our �nal design alternative
which uses direct FPGA communication without MPI software overhead.
Chapter 5
FPGA Network Cluster
Infrastructure
This chapter addresses our second design alternative. This design alternative provides
a cluster of network connected FPGAs to the user given a description of what a cluster
of kernels will look like. The work in this chapter is based on the paper [44]. Authors
Thomas Lin has helped with the networking back-end required, and Eric Fukuda has
helped with the application case study.
This alternative builds from our �rst design alternative by allowing users to work at
a high level with the cloud client. The user provides a description of their desired FPGA
cluster. This description is on a logical level and describes how di�erent FPGA kernels
are to be connected together. Along with the logical description the user provides an
FPGA mapping. This FPGA mapping speci�es the number of FPGAs the user requires
and places the kernels on the appropriate FPGAs. Kernel connections across FPGAs are
implemented via Ethernet. Furthermore kernels may also fan out to schedulers instead
of making direct kernel connections. The intricacies of the network connections and
schedulers are discussed later in Section 5.5.
In this work we de�ne a logical cluster description as a cluster description without a
36
Chapter 5. FPGA Network Cluster Infrastructure 37
notion of an FPGA mapping, and a physical cluster description is after the logical cluster
is partitioned and placed onto the appropriate physical FPGAs.
5.1 Logical View of Kernels
The kernels in this system are streaming kernels and they use the AXI stream protocol
for input and output. The AXI stream interface our system uses has the following �elds
(a subset of all the �elds o�ered by the protocol):
� 32 bit data �eld. Stores the data of each transfer.
� 32 bit dest �eld. Stores the destination of each transfer. The destination corre-
sponds to an address of each kernel on the FPGA.
� 1 bit last �eld. For a packet with multiple transfers this is asserted on the last
transfer of the packet.
� 1 bit ready �eld. This is asserted downstream to notify the stream that it is ready
for input.
� 1 bit valid �eld. This is asserted on a valid transfer.
These bit �elds correspond to a single it of a transfer, an AXI stream packet can
correspond to multiple its, where the concluding it will have the last �eld asserted.
This is the protocol that we use within our module. However when we transfer packets
over the Ethernet we do not have a dest �eld as the Ethernet module does not use a
dest �eld. We append the 32 bit dest �eld as part of the header of our Ethernet packet.
For simplicity we currently use a 32 bit dest �eld because this easily aligns to a 32 bit
word boundary. This is the case because packets read from the Ethernet module are
read 32 bits at a time. This creates a signi�cant overhead when there are packets being
distributed within an FPGA as large multi- it packets will transmit 32 bit dest �elds for
Chapter 5. FPGA Network Cluster Infrastructure 38
each it transfer. To save on wires on the FPGA we can look to shrink the destination
overhead.
All kernel inputs to the system are addressed by a speci�c dest entry. Logically
speaking, unless otherwise stated, any kernel output can connect to any input. This can
be seen as all kernels being connected to a large logical switch. These kernels may be
mapped to the same FPGA or to di�erent FPGAs. Furthermore these kernels can be
replicated with directives in the input scripts and they can be scheduled in di�erent ways
with the use of schedulers.
Figure 5.1: This highlights the simple logical view a kernel cluster. In this situation allthe kernels output to a switch and their input is addressed with the switch.
Figure 5.2 shows an XML �le of the logical �le the user speci�es corresponding to the
the logical cluster in Figure 5.1. Each kernel is assigned an address that corresponds to
the address of the input port of each kernel. There is also a replication �eld that speci�es
the number of times we wish to replicate this kernel.
5.1.1 Sub-Clusters
In Figure 5.1 we show three kernels connected via one logical switch. All kernels are
connected to each other in a fully connected network. Edges can be removed if we
Chapter 5. FPGA Network Cluster Infrastructure 39
Figure 5.2: Example logical cluster XML �le.
directly connect kernels. Figure 5.3 shows four kernels with direct connections between
some of the kernels. Such sub-clusters are then connected to the logical switch.
We can also have our own schedulers where the output of a kernel might not be
Chapter 5. FPGA Network Cluster Infrastructure 40
Figure 5.3: This is an example of a directly connected sub-cluster that would be connectedto the logical switch
connected to all the other kernel inputs but to a subset of kernel inputs arbitrated by a
scheduler. This type of sub-cluster is shown in Figure 5.4 and explained in more detail
in Section 5.5. Figure 5.5 shows how multiple sub-clusters can be connected to the same
logical switch.
Figure 5.4: This is an example of a sub-cluster where a kernel fans out to a local schedulerthat arbitrates between 3 kernels within the sub-cluster
5.2 Physical Mapping of the Kernels
Each kernel in the logical topology is mapped to a physical FPGA. More than one kernel
can be mapped to an FPGA. Direct kernel connections on the same FPGA are simply
connected within the FPGA. Kernels with connections that cross an FPGA boundary
are wrapped with logic to help with the crossing. Figure 5.6 shows a sample mapping
�le our infrastruture will take as input.
Chapter 5. FPGA Network Cluster Infrastructure 41
Figure 5.5: This highlights the how the subclusters would �t with the logical switch.
Figure 5.6: This illustrates the network stack from the transport layer and below.
When connections on the large logical switch are divided across multiple FPGAs,
the logical switch is implemented as physical switches on each of the FPGAs. Figure 5.1
shows three kernels fully connected with a logical switch. Now let's consider the following
scenario: Kernels A and B are on FPGA 1 and Kernel C is on FPGA 2. The physical
mapping is shown in Figure 5.7.
Figure 5.7 shows the logical switch split into two physical switches. The inputs to
the respective kernels on the two FPGAs always come from the physical switch on the
FPGA. The �rst FPGA sends all packets addressed to Kernel C to the switch in the
second FPGA. Furthermore the second FPGA's switch sends all packets dedicated for
Kernels A and B to the �rst FPGA. The output of each of the kernels feed into the physical
Chapter 5. FPGA Network Cluster Infrastructure 42
Figure 5.7: This �gure translates the logical cluster speci�ed in 5.1 into a physical clusterwith two FPGAs.
switch on that FPGA. The physical switch can determine the destination FPGA of each
packet.
For edges between kernels that are not connected to the large logical switch (sub-
clusters), the direct connections must also be facilitated between FPGAs.
5.3 FPGA Infrastructure
To facilitate the connection of FPGAs in the network we need speci�c hardware modules
and software modules. The hardware that we use is the SDAccel framework that was
modi�ed to include Ethernet capabilities. Furthermore this does not support the high-
level OpenCL calls, instead we directly use the HAL to communicate to the FPGA.
Figure 5.8 shows the experimental version of SDAccel shell from Xilinx with Ethernet
capabilities before we make modi�cations to support our infrastructure.
Chapter 5. FPGA Network Cluster Infrastructure 43
Figure 5.8: This �gure shows the experimental version of the SDAccel platform beforemodi�cations were made to support our infrastructure.
This experimental version of the shell does not include an application region. Instead
it has a Microblaze soft processor that is con�gured with a program to send packets
through the Ethernet. The Microblaze processor has a program that reads from a certain
address in o�-chip memory. This address contains the packet the user wishes to send
over the Ethernet. The o�-chip memory is populated by the software application that
uses the HAL to send the packet via PCIe to the FPGA.
Chapter 5. FPGA Network Cluster Infrastructure 44
5.4 SDAccel Platform Modi�cations
Figure 5.9 shows the modi�ed Ethernet platform. The modi�cations to the base platform
are as follows:
� Add an application region. However unlike the default non-Ethernet version of
SDAccel (as seen in Figure 3.5) Application Region is not part of a partial recon-
�gurable region.
� Removed the processor from the critical path to send packets from the Ethernet.
This is necessary for the application region to process packets at line-rate as the
processor introduces too much overhead. Now the application region can directly
stream packets to and from the Ethernet.
� The processor was still kept in the shell for debugging purposes as well as the
con�guration of some hardware blocks
� The PCIe module can also drive signals in the application region which is also used
for the con�guration of hardware blocks from the software driver.
With the modi�cations our system-level multi-FPGA system includes many virtual
lightweight CPUs that are coupled with FPGAs. The CPUs are responsible for con�gur-
ing certain hardware modules within the application region required for the networking
of FPGAs. The network interfaces of the FPGA are physically connected to a network
switch. With the help of speci�c hardware modules and the networking backend SAVI
provides, we can connect the FPGAs in our own speci�c topologies as speci�ed by the
user. Figure 5.10 shows the multi-FPGA system.
The virtual machines with FPGAs are generated with the OpenStack avor of a
lightweight CPU and a single FPGA device. The software image is a stripped down
version of the Xilinx Vivado Tools that only has FPGA programming capabilities. The
FPGA Software driver waits to receive a bitstream over the network. Once a bitstream
Chapter 5. FPGA Network Cluster Infrastructure 45
Figure 5.9: This �gure shows the experimental version of the SDAccel platform aftermodi�cations were made to support our infrastructure.
is received the FPGA is programmed and the FPGA hardware modules are con�gured
with the appropriate network metadata. The machine without the FPGA is generated
with an OpenStack avor that has more CPU cores and memory, the software image
used has the complete Xilinx tools to make the bitstreams.
5.4.1 FPGA Application Region
The FPGA Application region includes helper modules for the User Kernel to interface
directly with the network through the Ethernet interface. The helper modules are re-
sponsible for �ltering packets, formatting packets and arbitrating for the network port.
The Application Region is shown in Figure 5.11.
Chapter 5. FPGA Network Cluster Infrastructure 46
Figure 5.10: This �gure shows how a multi-FPGA system is situated in our environment.
The con�guration bus is used to con�gure the input and the output modules. These
signals are driven by the PCIe Module on the FPGA, which receives signals from the
PCIe-connected virtual CPU.
5.4.2 Input Module
All the packets that the FPGA receives via the Ethernet are forwarded to the input
module. The packets that are received at the network port follow the Ethernet packet
convention with a 14-byte header. On top of this we add our own protocol by appending
two bytes (Kernel Address) to specify the destination kernel for the packet, as we may
have multiple kernels on the FPGA that are requesting input packets.
Figure 5.12 shows the protocol details used by our FPGA infrastructure. Each FPGA
Chapter 5. FPGA Network Cluster Infrastructure 47
Figure 5.11: This �gure shows the details of the application region. The input and outputmodules are both con�gured by the con�guration bus.
Figure 5.12: The Ethernet protocol plus our custom protocol to di�erentiate the kernel.
in our infrastructure is assigned a MAC address within the SAVI infrastructure. The
process by which we get the MAC address is discussed in Section 5.6. The destination
MAC address should match the MAC address assigned to the particular FPGA. The
source MAC address will be the source MAC address of the FPGA or of the virtual
machine within SAVI that is sending the FPGA data. The next two bytes, according to
the Ethernet frame protocol, are the ether-type that we hardcoded to 0x7400, and the
last �eld is the address of the kernel within the FPGA
The Input Module consists of an Input Bridge and an Input Demultiplexer. The
Input Bridge is con�gured after the FPGA is programmed with the bitstream and before
the application can run. The Input Bridge behaves as both a �rewall and converts a
packet from an Ethernet Packet into an AXI Stream packet. The Input Bridge's �rewall
is con�gured with the MAC address assigned to the FPGA. The Input Bridge also drops
Chapter 5. FPGA Network Cluster Infrastructure 48
the Ethernet header and adds a dest �eld as part of the AXI stream. The dest �eld
corresponds to the Kernel Address speci�ed within the header. This Input Demultiplexer
either outputs to kernels on this FPGA that are expecting Ethernet input, or it outputs
to kernels on a di�erent FPGA; in this case all packets matching the corresponding dest
�eld will be sent straight to the output module. The input to the switch comes from
both the Ethernet module and all other user kernels that can output to any other kernel
on the FPGA. An example of an Input Module is shown in Figure 5.13. For details refer
to Section 5.2.
Figure 5.13: The input module consisting of the Input Bridge (labeled IB) and the InputDemultiplexer (labeled ID). In this example the dest �elds 0x2, 0x3 feed into di�erentUser Kernels on this FPGA and 0x4 feeds into another FPGA by going through theOutput Module
5.4.3 Output Module
This module receives streams from the User Kernels and from the Input Demultiplexer.
The Output Module consists of Packet Formatters (PF) and an Output Switch. Each
stream (either from the User Kernels or from the Input Module) needs a Packet Formatter
before it can be sent out to the network. Each stream is formatted with the appropriate
MAC headers. The source MAC address is that of the FPGA. The destination MAC
address is of the destination FPGA or virtual machine. The ether-type is 0x7400 as it
Chapter 5. FPGA Network Cluster Infrastructure 49
was in the input stream and then we append the dest of the stream into the header of the
packet. All the Packet Formatters are fed into an output-switch that arbitrates using the
last �eld of the AXI stream. The output switch uses a round-robin scheduling algorithm.
The output module is shown in Figure 5.14. The input to the Packet Formatter is an
AXI stream with a dest �eld. The formatter uses the dest �eld as the kernel address
when it is outputting to the network.
Figure 5.14: The output module for two streams consisting of Packet Formatters (labeledPF) for each stream that needs to be output
5.5 Scaling up FPGA Clusters
Nodes within the cluster can be replicated as well without replicating the entire cluster.
Replicating a node within the cluster will require all nodes that fan-in to that speci�c
node to now include a Scheduler. The Schedulers currently support any-cast, which uses
a round-robin scheduler, or broadcast. Figure 5.15 shows how a node is replicated within
a cluster and where a Scheduler is inserted.
The Schedulers are also FPGA kernels. If the replicated kernels span across multiple
FPGAs the scheduler will be placed on the FPGA with the most replications of that
kernel to reduce latency for the more common case. For example, in Figure 5.15, if
two of three replications are on FPGA 1, and the other is on FPGA 2, then the script
Chapter 5. FPGA Network Cluster Infrastructure 50
Figure 5.15: This shows the replication of Node 2. The replicated nodes are Node 2 1,Node 2 2 and Node 2 3. Node 1 has a Scheduler that fans out to the replicated nodes.
will place the Scheduler on FPGA 1. The script will then create connections from the
Scheduler to the replicated nodes and one connection to the Output Module on FPGA 1.
The remaining replicated kernel will be connected to the Input Module on FPGA 2.
Figure 5.16 illustrates this scenario.
5.6 FPGA Software Drivers
Each virtual machine with an FPGA is responsible for sending control signals to the
FPGA. These control signals are to con�gure the Input Module and the Output Module
with the appropriate MAC addresses. We choose to use the software to con�gure the
Input and Output Modules because the alternative is to encode the MAC addresses
in hardware, which will require resynthesizing FPGA bitstreams for di�erent physical
FPGAs when replicating the cluster. Our approach allows us the option to generate our
cluster with one set of FPGAs and then replicate the clusters with the same bitstreams
Chapter 5. FPGA Network Cluster Infrastructure 51
Figure 5.16: The physical con�guration if Node 1, Node 2 1 and Node 2 2 are on FPGA 1and Node 2 3 is on FPGA 2.
to more FPGAs.
The software drivers can con�gure the Input Bridge and the Packet Formatters in
the hardware because the PCIe module in the hardware is a master (a driver of signals)
to these modules. This means that writing to a certain address on the PCIe module
can be used to send data to the Input Bridge or the Packet Formatter. We can write
to di�erent addresses of the PCIe module with the HAL driver that was provided in the
SDAccel tool kit. When a virtual machine with an FPGA is booted, the software driver
is accepting bitstreams. Once a bitstream is received it will be programmed with the
HAL and the Input Bridge and Packet Formatters will also be con�gured by the HAL.
Our justi�cation to provide the Packet Formatters as software con�gurable blocks is due
to scalability. If we wish to scale up our cluster with more network FPGAs, the MAC
address of each FPGA can be con�gured by software instead of synthesizing bitstreams
on a per FPGA level.
Each FPGA obtains a network connection by �rst receiving a network port from
the OpenStack service, Neutron. Each network port consists of a MAC address, and
IP address. This port is then registered with the physical port on the network switch
Chapter 5. FPGA Network Cluster Infrastructure 52
that has the FPGA connection. Our scripts can determine the physical switch port of a
particular FPGA connection by observing which physical server hosts the virtual machine
containing the PCIe server. In our setup we have one FPGA per physical server. If this
were to change we would need a new mechanism to infer the physical network port of
a particular FPGA. Once the port returned by Neutron is registered with the physical
port, it is now accessible on the network from any other device in the SAVI data center,
including other virtual CPUs, IoT devices and FPGA clusters.
5.7 Tool Flow
We summarize the use of our system by describing the tool ow. First the user submits a
logical cluster description and FPGA mapping �le to a global FPGA parser. Eventually,
these could be generated by a higher-level framework or application. OpenStack calls
are generated to create virtual machines, which are light-weight CPU virtual machines
connected to an FPGA, and one virtual machine dedicated to synthesize bitstreams.
Subsequent OpenStack calls are generated to create network ports, each with valid MAC
and IP addresses. These ports are registered with the SAVI switch and now all packets
sent to these addresses will be forwarded to the right switch port. After all the Open-
Stack calls are generated, the individual FPGAs are synthesized on the large virtual
machine dedicated to synthesizing bitstreams. Once the bitstreams are synthesized they
are forwarded to the individual FPGAs to be programmed onto the FPGA. Once pro-
grammed, the Packet Formatters are con�gured by the FPGA software driver running
on the light-weight CPU attached to the FPGA via PCIe. After the user submits the
initial cluster description �les, the rest of the calls are automatically generated by our
infrastructure.
Chapter 5. FPGA Network Cluster Infrastructure 53
5.8 Limitations of the Infrastructure
The main limitation of this version of the SDAccel platform is the lack of a partial
recon�gurable region o�ered in the default platform. Due to this limitation each new
application region requires programming the entire FPGA. This recon�guration turns o�
the PCIe interface momentarily and on a physical machine, requires a hard-reboot to
be visible. In the context of an FPGA allocated to a virtual machine, the rebooting of
the entire physical server is not feasible as there may be other virtual machines on that
physical server. Future SDAccel Platforms will include the Ethernet in their standard
base and thus will have the application region within a partial recon�gurable region.
The workaround that we have is instead of programming the FPGA via the software
driver, we have a separate machine dedicated to managing FPGA bitstreams. This
machine is physically connected via JTAG to each of the machines. This is necessary
as the partial recon�g ow is not available through the HAL. The bitstream server is
responsible for programming the FPGA. Furthermore without rebooting the physical
machine the PCIe interface is not available to con�gure the Input Bridge and Packet
Formatters. This is also done via the JTAG/UART connection on the server. This is
a temporary workaround until we have partial recon�guration available for the SDAccel
application region in the platform.
Chapter 6
Evaluation
This Chapter explores our results. First we quantify the resource overhead, latency
and throughput of our FPGA infrastructure. Finally, we test a full application using
a database acceleration application. The designs are implemented on the Alpha Data
7V3 card, which has the following speci�cations: a Xilinx Virtex 7 XC7VX690TFFG-
1157 FPGA (433200 LUTs, 866400 Flip Flops, 1470 BRAM Tiles), two 8GB ECC-
SODIMM for memory speeds up to 1333MT/s and Dual SFP+ cages for high speed
optical communication including 10 Gigabit Ethernet.
Our network infrastructure connects the 10 GbE SFP ports using 10 GbE to 1 GbE
transceivers to a network switch. The switch can support 10 GbE links, but due to
the 1 GbE FPGA core that is in our FPGA Hypervisor we have to use a 1 GbE cable.
The goal of the evaluation is to demonstrate that our FPGA network modules add little
overhead with respect to throughput and very little latency overhead. The absolute
latency and throughput numbers are limited by the 1 GbE network connection but the
infrastructure we have built can be used on 10 GbE, or better, systems where we would
expect these numbers to be better. We also wish to highlight the scalability of our
infrastructure with a case study, demonstrating that by simply changing a directive in
the script, our clusters can replicate with the throughput scaling accordingly.
54
Chapter 6. Evaluation 55
6.1 Resource Overhead
The resource overhead from our infrastructure is shown in Table 6.1. Absolute numbers
are given with the percentage of the device total shown in brackets.
Table 6.1: Resource Overhead of our System
Hardware Setup LUTS Flip-Flops BRAM
SDAccel Base 53346 64550 228(12.3 %) (7.45 %) (15.5 %)
SDAccel Base with 62344 76124 228Ethernet Support (14.4 %) (8.79 %) (15.5 %)
Input ModuleInput Bridge 87 170 2
(0.02 %) (0.019 %) (1.36 %)Input Demultiplexer 82 124 0with 16 outputs (0.019 %) (0.014 %) (0 %)
Output ModuleEthernet FIFO 26 12 2Controller (0.006 %) (0.014 %) (1.36 %)Output Switch 517 138 0with 16 inputs (0.119 %) (0.016 %) (0 %)Packet Formatter 230 252 2(one per network (0.053 %) (0.029 %) (1.36 %)output stream)
Total Available 433200 866400 1470
The SDAccel Base refers to the standard SDAccel environment that has no network
connection for the FPGA. The SDAccel Base with Ethernet Support includes a 1 Gb
Ethernet port. We can see that the addition of the Ethernet port requires only 2.1% of
the resources of the whole device. The Input Module is divided into a Firewall and the
Input Switch. The size of the �rewall is independent of the number of network input
streams. The size of the input switch is dependent on the number of streams. Table 6.1
shows the overhead corresponding to a 16-port switch. The Output Module is divided
into the Ethernet FIFO Controller, the Output Switch and the Packet Formatter. The
Ethernet FIFO Controller overhead is independent of the number of output streams. The
Output Switch size, analogous to the Input Switch size is dependent on the number of
Chapter 6. Evaluation 56
output streams. The number of Packet Formatters we have on our FPGA is dependent
on the number of output streams. It can be seen that the resource usage of the Firewall,
Input and Output Modules and Packet Formatter is small relative to the device and not
signi�cant in terms of resources.
6.1.1 Microbenchmarks
Our microbenchmarks consist of an application that is a direct connection between the
Input Module and the Output Module of an Application Region. The goal of this is to
show the overhead of our Input and Output Modules and to show that they can handle
packets at line-rate as all of the modules are of single-cycle latency.
6.1.2 Micro-experiment Setup
For Microbenchmark 0 the CPU is directly connected to the FPGA. The CPU sends
packets to the raw network interface and the FPGA echoes them back. The packets
traverse through the Input Module, the Application Region FIFO and exit through the
Output Module back into the CPU. The CPU for this data-point is not a virtual machine
and the speci�cations of it are as follows: Intel Xeon 3.5 GHz CPU E5-2637, four cores
with hyperthreading, 32 GB RAM.
Latency
The round-trip latencies are shown in Figure 6.2. There is no switch latency and no
virtualization overhead for Microbenchmark 0. However after that point we notice a
linear progression as we increase FPGAs. Each extra FPGA on the path requires two
trips to the switch. When we compare this to the second iteration of the Microsoft
Catapult project which also used network connected FPGAs we see that we are on the
order of 20 times worse than their network [35]. This is mainly due to our current
infrastructure limitation of 1 Gb/s Network module used.
Chapter 6. Evaluation 57
(a) Microbenchmark 0 is a CPU directly con-
nected to an FPGA (not through network
switch).
(b)Microbenchmark 1 is a CPU connected with
a network switch to an FPGA Chain of length
1.
(c)Microbenchmark 2 is a CPU connected with a network switch to an FPGA
Chain of length 2.
(d) Microbenchmark 3 is a CPU connected with a network switch to an FPGA Chain of
length 3.
Figure 6.1: Microbenchmarks 1 to 3 have a network hop (NH). Each network hop travelsto the network switch connected to all the FPGAs. Microbenchmark 0 does not use avirtualized CPU, where as the others use virtual CPUs provisioned in SAVI.
An example path for a single FPGA is as follows:
1. Virtual CPU to switch
2. Switch to FPGA 1
3. FPGA 1 to switch
4. Switch to Virtual CPU
Chapter 6. Evaluation 58
Figure 6.2: Round-trip latency observed across the microbenchmarks.
Throughput
Figure 6.3 shows the throughput for the di�erent microbenchmarks. The red line is the
bandwidth limit of the network cable. The throughputs of Microbenchmarks 0 to 3 are
measured with the iperf tool [45]. This is a network tool used to measure throughput
of network connections. A directly connected CPU to the FPGA (CPU + FPGA) sat-
urates the network link, thus showing that our FPGA infrastructure can keep up at
line-rate. Next we look at connecting a virtual machine to an FPGA chain of one, two
and three FPGAs (VM + n FPGA). We notice a drop in throughput due to the fact
that the virtual machine is a weaker CPU than the directly connected CPU and due
to some virtualization overhead. Since the FPGA is not the bottleneck we notice that
the throughput as we increase the length of the FPGA chain from one to two to three
remains the same. To further demonstate that the FPGA is not the bottleneck we look
at two additional datapoints. The �rst datapoint are two virtual machines connected in
the SAVI network (VM + VM). The throughput observed between two virtual machines
is half of a virtual machine connected to an FPGA chain. This is because the data en-
ters the software and network stack twice (on both machines). The second additional
datapoint is the calculated throughput within the FPGA (Internal FPGA B/W). The
Chapter 6. Evaluation 59
internal FPGA bandwidth is at 4 Gb/second, much higher than the network link rate.
The internal FPGA throughput is calculated by using the bus width, which is 4-bytes
wide and multiplying that by the clock speed, which is 125 MHz. The network switch is
designed to switch at 4G rates and therefore is not the bottleneck of our system.
Both the Input and Output Modules work with single-cycle latency. The Input Mod-
ule needs a four-cycle warm-up period before it bursts the rest of the packet and the
Output Module requires a �ve-cycle warm-up period. These warm-up periods are ac-
commodated with additional FIFOs, which adds to the latency but does not a�ect the
throughput.
Figure 6.3: Throughput observed across the microbenchmarks.
6.1.3 Application Case-study
Our application case study is a database query accelerator. Several works, such as [46, 47]
have shown FPGAs are a good target for such applications as they can perform low-
latency, high-throughput applications. Furthermore, frameworks such as Apache Drill
have shown that distributed clusters are a good way to accelerate database services [48].
The combination of those observations suggest that a distributed FPGA cluster is ideal
Chapter 6. Evaluation 60
for a database query accelerator.
The application we have built is a naive implementation of a query. The query is
broken down into several sub-queries. Even though it is a naive implementation, the
purpose of the infrastructure is to show that laying out the circuit is easy, and so is
replication of that circuit (changing one number in the logical cluster �le).
6.1.4 Query Implementation Details
The query is composed of �ve streaming components connected as a chain:
1. SQL Read: This component is responsible for reading SQL columns and outputting
the data in a format that enables the rest of the components to process the data.
2. SQL Where: This operation is used to match column predicates and values with
respect to a boolean operation (equal, greater than, less than, etc.)
3. SQL Like: This operation is used on a string column data and is used to match a
string using a substring.
4. SQL Group: This operation aggregates di�erent records using a grouping operation,
such as counting.
5. SQL Write: This component is responsible for separating the stream coming out
of SQL Group into columns.
Figure 6.4 shows how the streaming components are connected to form a single query
engine. Our infrastructure allows us to easily replicate the number of query processing
engines, even across multiple FPGAs. When considering the number of processing en-
gines, we �rst observe the resource usage of one replication of this processing engine.
This is shown in Table 6.2.
The Block RAM utilization limits our replication so we are limited to two query
processing engines per FPGA. In our logical FPGA cluster �le we would specify this
Chapter 6. Evaluation 61
Figure 6.4: The sub-components chained together as one query processing engine.
Table 6.2: Resource Overhead of a single Query Processing Engine
Characteristic Total amount Percentage of FPGALUTS 11561 2.669 %
Flip Flops 17176 1.982 %Block RAM 504 34.286 %
as six replications (maximum two replications per FPGA, with three FPGAs) and in
our FPGA mapping we would divide the kernel nodes onto three FPGAs. We do the
replication with a scheduler. The scheduler is located on one FPGA and forwards the
data to either the replicated engines on the same FPGA or to another FPGA. This
would send all the data to one destination and then the scheduler would be responsible
for forwarding the data to the appropriate query processing engine. The �rst FPGA
has a scheduler connected to two replicated query processing engines. The second and
third FPGAs also have two replicated query processing engines connected directly to the
Input Module as opposed to a Scheduler. The Scheduler on the �rst FPGA is responsible
for scheduling work to all six replicated query processing engines across three FPGAs.
This makes it simpler for the user since they do not have to change their interface to the
cluster as they change the number of replications.
However the infrastructure also provides an easier approach by introducing a sched-
Chapter 6. Evaluation 62
uler. This would send all the data to one destination and then the scheduler would be
responsible to forwarding the data to the appropriate query processing engine. This is
the model that we used in our experiment as the user application remained the same
between 1 replication all the way up to 6 replications across 3 FPGAs. This �rst FPGA
in this cluster is shown in Figure 6.6. The second and third FPGA looks like the FPGAs
in Figure 6.5.
Figure 6.5: One FPGA with two entire clusters replicated.
Chapter 6. Evaluation 63
Figure 6.6: One FPGA with two entire clusters replicated and the scheduler.
6.1.5 Case Study Evaluation
Our evaluation compares the throughput of one replication versus six replications across
three FPGAs. As expected Figure 6.7 shows that the throughput increases as the repli-
cations increase and we expect it to continue to increase until it reaches the maximum
of the FPGA chains observed earlier at about 240 MB/s. This would be at about 12
replications, which would require six FPGAs. The throughput limit of 240 MB/s is due
to the speed of the CPU inputting table data into the FPGA chain. With a faster CPU
we could theoretically saturate the network cable throughput limit of 1 GB/s, which can
be increased with a faster network.
Chapter 6. Evaluation 64
Figure 6.7: Throughput of a query processing engine
Chapter 7
Conclusion
The ability to provision FPGA clusters will become essential if systems like the Microsoft
Catapult project are to become more generally accessible. Our infrastructure provides
a lightweight cluster provisioning tool. This tool with a logical cluster description and
FPGA mapping can generate scalable clusters from a heterogeneous cloud. Moreover
these clusters are connected to the network as network devices ready to interact with
other network devices. Our infrastructure makes it easy to scale up as with a simple
pragma we saw throughput scale almost linearly from one to six replicated processing
units in our database acceleration case study. With this success, our approach is seen to
work, but there is much that can be done to improve this �rst step.
7.1 Future Work
This section describes the future work that we plan to explore in the future. This includes
short-term goals such as physical infrastructure upgrades, reliability protocol upgrades
and lastly the implementation of true virtualization.
65
Chapter 7. Conclusion 66
7.1.1 Physical Infrastructure Upgrades
The limitations of our experiments �rst come from physical infrastructure limitations. A
few infrastructure upgrades that we plan to address in the short-term are:
1. Upgrade 1G physical Ethernet links. This will involve upgrading the FPGA IP
in the SDAccel shell from a 1G core to the 10G core. This will in turn result in
better latency and throughput in our applications. Our additional infrastructure
introduced should be able to scale to these cores as all our cores have a single cycle
latency.
2. Add more physical FPGAs to the network. This will include more of the same
FPGA and other types of FPGAs. Our infrastructure should easily port to other
infrastructures as our input and output modules use simple AXI streams and should
be able to interface with any Ethernet module that uses an AXI stream interface.
7.1.2 Scalability and Reliability
This subsection explores how to enforce reliability on a network with more nodes and
nodes that are many network hops away. This includes reliability on the network and
reliability for the compute nodes.
Networking Scalability and Reliability
Our infrastructure builds on top of RAW Ethernet frames with an additional two bytes
to address speci�c FPGA kernels within an FPGA. This is a lightweight transmission
protocol, and it is suitable in our small environment where all the FPGAs are connected
to the same network switch. There is at most one network hop between FPGAs and
the CPU virtual machines that we communicate with are on the same network EDGE
within the SAVI infrastructure, which is also one network hop between VMs and FPGAs.
However at a larger scale this can result in reliability concerns, where we can expect errors
Chapter 7. Conclusion 67
in packets such as corrupt packets, dropped packets, duplicated packets, or out of order
packets. Furthermore the user has to limit their packet sizes to less than 1536 bytes as
that is the physical limit for packets in the data-link layer (Raw Ethernet frame limit).
These issues can be alleviated by building on top of the network stack.
Figure 7.1: The proposed network stack that builds on top of the standard network stack.
Our current system implements most of the stack shown in Figure 7.1. The physical
layer is currently 1G Ethernet cables that we wish to upgrade to 10G cables once we
upgrade the FPGA core. The data-link and network layer are currently handled with a
combination of OpenStack and SAVI's network registration system. Our call to Open-
Stack (with OpenStack's networking API Neutron) gives us the IP and MAC address that
we then register to a physical port on the network switch in SAVI. This uses Software
De�ned Networking to update the routing tables within our network to route packets
addressed to the particular IP and MAC address returned by Neutron to the registered
physical port. Our custom layer, the kernel layer refers to the extra two bytes that are
used to address to a particular kernel within the FPGA. The layer that we are missing
is the transport layer, which is where we can implement network reliability as we scale
to larger networks. Figure 7.2 shows where the transport layer module would �t. In this
example we support three transport layer protocols but that can be modi�ed depending
on the application and the amount of resources we wish to utilize on the FPGA. The
ether-type within a raw Ethernet frame (currently hard-coded to 0x7400) would help
multiplex the packet to be handled by the appropriate transport layer.
TCP and UDP are two transport layers that are used by many networking applica-
Chapter 7. Conclusion 68
Figure 7.2: The input module modi�ed to include a transport layer.
Characteristic Total amount Percentage of FPGALUTS 36419 8.41 %
Flip Flops 35588 4.11 %Block RAM 392 26.7 %
Table 7.1: Resource Overhead of TCP Transport Layer onFPGA
Characteristic Total amount Percentage of FPGALUTS 74 0.017 %
Flip Flops 72 0.00831 %Block RAM 0 0 %
Table 7.2: Resource Overhead of UDP Transport Layer onFPGA
tions. Both of these transport layers will fragment large packets for the user as needed;
this will remove the small packet size restriction presented by using direct raw Ether-
net frames within our network. TCP provides reliable transmission as it handles the
retransmission of packets on packet drop, whereas UDP is connectionless and does not
retransmit, so it does not provide the reliable connection provided by TCP. We have
example TCP and UDP cores implemented on the FPGA and their overheads are shown
in Tables 7.1 and 7.2
Implementing with TCP and/or UDP will allow the FPGA clusters to directly inter-
face with certain distributed applications using the same transport layer. Such examples
Chapter 7. Conclusion 69
are adding a node to a distributed �le-system which uses TCP (e.g Hadoop Distributed
File-System [49]), or using UDP for multi-media applications such as Voice over IP [50].
These are two alternative transport layers that are popular in distributed applications
but we are not limited to these. There has also been research in implementing custom
transport layers for data centers. This involves using assumptions of the data center
environment and using such assumptions to provide a light-weight (at least relative to
TCP) reliable transport layer. Some examples of works that do this are [51, 52, 35].
Scalability and Reliability of Compute Nodes
Upon scaling the cluster we should expect compute nodes to fail [15]. In the conven-
tional CPU domain this can be due to many reasons, such as CPU power failures, disk
failures, and memory failures. These failures can also be due to network link failures,
router failures, or network congestion. Some of these failures can be addressed with the
implementation of a reliable transport layer but not all. For example a reliability in the
transport layer will guarantee the delivery of the packet as long as the path to the node
exists, however if the network path is destroyed without an alternate path, then this will
cause an issue that has to be dealt with by the application. For our FPGA clusters we
can experience failure due to many reasons, such as bitstream corruption, memory failure
and sometimes the FPGA can be stuck in an unforeseen state.
To ensure reliability we need to monitor our FPGA clusters. This will require an agent
process to alert our provisioning system about the \health" of our FPGA. Our agent
process can run on the FPGA Hypervisor and can send heart-beats to our provisioning
system notifying the cloud system manager of its health. This is analogous to CPU
servers that are managed with OpenStack [27]. These heart-beats can help the cloud
managing software determine if an FPGA is ready to be provisioned. However, failures
can occur after being provisioned as well. When an unrecoverable failure happens to
an FPGA in the cluster, this can result in an application failing. To mitigate this in a
Chapter 7. Conclusion 70
distributed CPU application, redundant compute devices can be used to replace failed
compute nodes [53]. We can introduce redundancy in an FPGA cluster as well by over-
provisioning FPGA devices in our FPGA cluster and run compute tasks in parallel on
redundant nodes. This can also be wasteful of resources as extra redundant resources are
provisioned by the cloud even if it is unnecessary. An example of redundant provisioning
is shown in Figure 7.3.
(a) The original clusters before failure. (b) Cluster after node 2 fails. The tra�c is redi-rected to the redundant instantiation. Note that theoutgoing tra�c is also sent to redundant FPGA3 tomaintain FPGA3's state
Figure 7.3: Demonstrates how tra�c is duplicated to a redundant cluster to maintainit's state, and how tra�c is redirected from a node in the original cluster to a node inthe redundant cluster upon failure.
An alternative to over-provisioning for FPGA clusters is to provision the extra com-
pute nodes after failure. First this requires the monitoring of the FPGA health. Fur-
thermore this requires the saving of context on the FPGA and migrating that context to
a new FPGA. This migration has a time cost with regards to provisioning and program-
ming the FPGA that is minimized by over-provisioning. This is yet to be investigated as
there is a trade-o� between over-provisioning and provisioning on demand.
Chapter 7. Conclusion 71
7.1.3 FPGA Cluster Debugging
The standard FPGA design ow includes simulating individual circuits, integrating the
circuits into larger systems, simulating the larger systems with a testbench and then to
implement the design on the FPGA. Furthermore there could still be problems within the
circuit for which probes are inserted into the FPGA design that interface with speci�c
debugging tools to notify the user when a certain signal in the circuit becomes a certain
value, or a Boolean operation of these signals. This allows the user to debug these values
in real time.
The design ow in our environment should be similar as we do not modify the circuits
provided by the user. Assuming these circuits are fully simulated we can then integrate
them into our multi-FPGA cluster. However once it is in the cluster we do not have a
uni�ed debugging view as we would on a single FPGA. One alternative for now can be to
run the probing tool (Altera's Signaltap or Xilinx's Chipscope) [54, 55] from each FPGA
in the cluster. This can be cumbersome especially for really large clusters. Furthermore a
user will have to navigate through a lot of automatically generated hardware that initially
was abstracted away (inter-FPGA connections, schedulers, switches).
Another area of future work is to provide a uni�ed debugging interface for such
clusters. One possible implementation of this is to forward local probing information
running on individual FPGA environments to a centralized view of the FPGA. The user
can be given the option to view the cluster as a logical cluster or even a physical cluster.
The logical cluster will abstract away all the automatically generated hardware and the
FPGA mapping of the kernels. This possible implementation is detailed in Figure 7.4. In
this example the user has a global view of the logical cluster. The global debugger shares
information with the user using local FPGA debugging tools on the individual FPGAs.
Chapter 7. Conclusion 72
Figure 7.4: An example multi-FPGA cluster that is attached to a debugger.
7.1.4 True FPGA Virtualization
Section 2.8 highlights our level of abstraction and the di�erences between true virtual-
ization and what we provide. Our level of abstraction does not hide the physical details
of the hardware provided as we require an FPGA mapping. Our �rst step to provide
true virtualization is to abstract away these mappings by automatically generating the
mapping. This allows the user to provision a cluster purely based on logical FPGA kernel
connections. This is a multi-FPGA placement problem. Several works have looked into
placing circuits across multiple FPGAs [56, 57]. These works explore FPGAs on the
same die or on the same board. They model the inter-FPGA I/O, which is then taken
into consideration by the FPGA placement and routing tools while it is trying place and
route a user circuit. We can take a similar approach but we would have to model the
network connection. This will involve modeling our FPGA hardware blocks (input and
output modules) that we append to allow for multiple FPGA connections, and network
Chapter 7. Conclusion 73
switches in the data center. This for example will try to place FPGA kernels that are
tightly coupled together on the same FPGA, or FPGAs as close as possible within the
data center network. This level of service provisioning is quite analogous to the Software
as a Service model provided by cloud managers. Software as a service is when the user
requests a software application and the underlying physical hardware is provisioned and
managed by a cloud manager.
Another model of virtualization that can be built on top of this infrastructure is In-
frastructure as a Service. This is similar to the CPU provisioning provided by OpenStack.
In this model the size of the processor and its peripherals are speci�ed and that is then
mapped onto a physical resource. In our infrastructure, we can provide di�erent avors
of FPGA sizes that would abstract away the fact that these are actually multi-FPGA
clusters. Our goal here is to hide the physical implementation of the logical FPGA the
user requests. This can be the number of FPGAs that is actually used to create the
logical FPGA, as well as the type of FPGA. We can create heterogeneous clusters, where
we can stitch together di�erent kinds of FPGAs into large clusters, forming large logical
FPGAs. Once we provide the large FPGA (comprised of multiple FPGAs to the user)
we will have a similar problem as the Software as a Service model where we will have to
map the kernels onto the physical FPGAs beneath the virtual FPGA. Furthermore if the
kernels do not �t this will require context switching kernels and swapping them in and
out of the FPGAs.
Bibliography
[1] InformationWeek. Big Data, Analytics Market To Hit $203 Bil-
lion In 2020. https://www.informationweek.com/big-data/
big-data-analytics-market-to-hit-$203-billion-in-2020-/d/d-id/
1327092, 2016.
[2] ApCon. The Case for Scalability in Large Enterprise Data Centers.
https://www.apcon.com/sites/default/files/Resources%20for%20Download/
apcon_ebook_4_april_2014.pdf, 2014.
[3] Amazon Web Services Inc. Amazon Web Services (AWS). http://aws.amazon.com,
2014.
[4] Microsoft Inc. Microsoft Azure. https://azure.microsoft.com, 2015.
[5] A. Putnum and et al. A Recon�gurable Fabric for Accelerating Large-scale Datacen-
ter Services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International
Symposium on, pages 13{24. IEEE, 2014.
[6] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA architecture: Survey and
challenges. Foundations and Trends in Electronic Design Automation, 2(2):135{253,
2008.
[7] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang,
Kenneth Kent, and Jonathan Rose. VPR 5.0: FPGA CAD and architecture ex-
74
Bibliography 75
ploration tools with single-driver routing, heterogeneity and process scaling. ACM
Transactions on Recon�gurable Technology and Systems (TRETS), 4(4):32, 2011.
[8] IEEE Standard for Verilog Hardware Description Language. Verilog Hardware De-
scription Language. IEEE Std 1364-2005, pages 1{560, 2006.
[9] IEEE Standard for VHDL Language Reference Manual. VHDL Language Reference
Manual. IEEE Std 1364-2005, pages c1{626, 2009.
[10] Xilinx Inc. Vivado High Level Synthesis. https://www.xilinx.com/products/
design-tools/vivado/integration/esl-design.html, 2016.
[11] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona,
Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. LegUp: High-level
Synthesis for FPGA-based Processor/Accelerator Systems. In International Sympo-
sium on Field Programmable Gate Arrays, FPGA '11, pages 33{36, New York, NY,
USA, 2011. ACM.
[12] Xilinx Inc. SDAccel Development Environment. https://www.xilinx.com/
products/design-tools/software-zone/sdaccel.html, 2016.
[13] Intel Inc. Intel FPGA SDK. https://www.altera.com/products/
design-software/embedded-software-developers/opencl/overview.htmll,
2016.
[14] SAP Data Center. How a Data Center Works. http://www.sapdatacenter.com/
article/data_center_functionality/, 2016.
[15] Albert Greenberg, James Hamilton, David A Maltz, and Parveen Patel. The cost
of a cloud: research problems in data center networks. ACM SIGCOMM computer
communication review, 39(1):68{73, 2008.
Bibliography 76
[16] Tech Republic. How Power Works in a Data Center: What
you Need to know. http://www.techrepublic.com/article/
how-power-works-in-a-data-center-what-you-need-to-know/, 2014.
[17] Data Center Knowledge. World's Largest Data
Centers. http://www.datacenterknowledge.com/
special-report-the-worlds-largest-data-centers/
worlds-largest-data-center-350-e-cermak/, 2016.
[18] IBM Inc. What is Cloud Computing. https://www.ibm.com/cloud-computing/
learn-more/what-is-cloud-computing, 2016.
[19] Alberto Leon-Garcia and Indra Widjaja. Communication networks. McGraw-Hill,
Inc., 2003.
[20] Nick McKeown. Software-De�ned Networking. INFOCOM Keynote Talk, 2009.
[21] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson,
Jennifer Rexford, Scott Shenker, and Jonathan Turner. OpenFlow: Enabling Inno-
vation in Campus Networks. ACM SIGCOMM Computer Communication Review,
38(2):69{74, 2008.
[22] Stuart Byma, Naif Tarafdar, Talia Xu, Hadi Bannazadeh, Alberto Leon-Garcia,
and Paul Chow. Expanding OpenFlow Capabilities with Virtualized Recon�gurable
Hardware. In FPGA '15 Proceedings of the 2015 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, pages 94{97, 2015.
[23] Feng Xia, Laurence T Yang, Lizhe Wang, and Alexey Vinel. Internet of things.
International Journal of Communication Systems, 25(9):1101, 2012.
[24] Jesse M Shapiro. Smart cities: quality of life, productivity, and the growth e�ects
of human capital. The review of economics and statistics, 88(2):324{335, 2006.
Bibliography 77
[25] Joon-Myung Kang et al. SAVI Testbed: Control and Management of Converged Vir-
tual ICT Resources. In IFIP/IEEE International Symposium on Integrated Network
Management, pages 664{667. IEEE, 2013.
[26] Stuart Byma et al. FPGAs in the Cloud: Booting Virtualized Hardware Accelerators
with OpenStack. In Field-Programmable Custom Computing Machines (FCCM).
IEEE, 2014.
[27] Omar Sefraoui et al. OpenStack: Toward an Open-Source Solution for Cloud Com-
puting. In International Journal of Computer Applications, 2012.
[28] OpenStack Inc. Welcome to Nova's developer documentation! http://docs.
openstack.org/developer/nova/, 2016.
[29] OpenStack Inc. OpenStack Networking (neutron). http://docs.
openstack.org/icehouse/install-guide/install/apt/content/
basics-networking-neutron.html, 2016.
[30] K Fleming, Hsin jung Yang, M Adler, and J. Emer. The LEAP FPGA operating
system. In Field Programmable Logic and Applications (FPL), pages 1{8, 2014.
[31] Fei Chen et al. Enabling FPGAS in the Cloud. In Computing Frontiers, 2014.
[32] KVM. Kernel Virtual Machine. http://www.linux-kvm.org, 2015.
[33] Maxeler Technologies. MPC-X Series. https://www.maxeler.com/products/
mpc-xseries, 2015.
[34] IBM Research. OpenPOWER Cloud: Accelerating Cloud Computing. https://
www.research.ibm.com/labs/china/supervessel.html, 2016.
[35] Adrian Caul�eld et al. A Cloud-Scale Acceleration Architecture. In Proceedings of
the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Octo-
ber 2016.
Bibliography 78
[36] Amazon. Amazon EC2 F1 Instances. https://aws.amazon.com/ec2/
instance-types/f1/, 2016.
[37] OpenStack Inc. OpenStack Orchestration. http://docs.openstack.org/wiki/
Heat/, 2016.
[38] Apache Software Foundation. Apache Mesos. https://mesos.apache.org, 2015.
[39] Andy Yoo, Morris Jette, and Mark Grondona. SLURM: Simple Linux Utility for
Resource Management. In Job Scheduling for Sategies for Parallel Processing, pages
44{60. Springer Berlin Heidelberg, 2003.
[40] NVidia Inc. NVidia Cuda Zone, Cluster Management Library. https://developer.
nvidia.com/cluster-management, 2015.
[41] Joon-Myung Kang, Lin T., Bannazadeh H., and A. Leon-Garcia. Software-De�ned
Infrastructure and the SAVI Testbed. In TRIDENTCOM, 2014.
[42] The Khronos Group. OpenCL Standard. https://www.khronos.org/opencl/,
2015.
[43] Kim Jungwon, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee.
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings
of the 26th ACM international conference on Supercomputing, pages 341{352. ACM,
2012.
[44] Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia,
and Paul Chow. Enabling Flexible Network FPGA Clusters in a Heterogenous Cloud
Data Center. In International Symposium on Field-Programmable Gate Arrays.
ACM, February 2017. To appear.
[45] Iperf. Iperf { The TCP/UDP Bandwidth Measurement Tool. https://iperf.fr,
2014.
Bibliography 79
[46] Christopher Dennl, Daniel Ziener, and Jurgen Teich. On-the- y composition of
FPGA-based SQL query accelerators using a partially recon�gurable module library.
In Field Programmable Custom Computing Machines (FCCM), pages 45{52, 2012.
[47] Christopher Dennl and et al. Acceleration of SQL Restrictions and Aggregations
through FPGA-Based Dynamic Partial Recon�guration. In Field Programmable
Custom Computing Machines (FCCM), pages 25{28, 2013.
[48] Michael Hausenblas and Jacques Nadeau. Apache Drill: interactive ad-hoc analysis
at scale. Big Data, 1(2):100{104, 2013.
[49] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The
hadoop distributed �le system. In 2010 IEEE 26th symposium on mass storage
systems and technologies (MSST), pages 1{10. IEEE, 2010.
[50] Sachin Garg and Martin Kappes. An experimental study of throughput for UDP
and VoIP tra�c in IEEE 802.11 b networks. In Wireless Communications and
Networking, 2003. WCNC 2003. 2003 IEEE, volume 3, pages 1748{1753. IEEE,
2003.
[51] Sang-Woo Jun, Ming Liu, Shuotao Xu, et al. A transport-layer network for dis-
tributed fpga platforms. In 2015 25th International Conference on Field Pro-
grammable Logic and Applications (FPL), pages 1{4. IEEE, 2015.
[52] David Sidler, Zsolt Istv�an, and Gustavo Alonso. Low-latency tcp/ip stack for data
center applications. In Field Programmable Logic and Applications (FPL), 2016 26th
International Conference on, pages 1{4. EPFL, 2016.
[53] Mirantis. Understanding your options: Deployment topologies for High
Availability (HA) with OpenStack. https://www.mirantis.com/blog/
understanding-options-deployment-topologies-high-availability-ha-openstack/,
2012.
Bibliography 80
[54] Altera Veri�cation Tool. Signaltap II embedded logic analyzer, 2006.
[55] ChipScope Pro Xilinx. 11.1 software and cores user guide, Xilinx. Inc., Apr, 2009.
[56] Kalapi Roy-Neogi and Carl Sechen. Multiple FPGA partitioning with performance
optimization. In Proceedings of the 1995 ACM Third International Symposium on
Field-programmable gate arrays, pages 146{152. ACM, 1995.
[57] Nam Sung Woo and Jaeseok Kim. An e�cient method of partitioning circuits for
multiple-FPGA implementation. In Proceedings of the 30th International Design
Automation Conference, pages 202{207. ACM, 1993.