17
AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao, Mohamad Kayed

AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

  • Upload
    lythu

  • View
    222

  • Download
    5

Embed Size (px)

Citation preview

Page 1: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

AN INTERACTIVE EMBEDDED SYSTEM

& AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao, Mohamad Kayed

Page 2: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

CONTENTS

Overview ........................................................................................................................................................................ 3

Outcome ........................................................................................................................................................................ 5

AI Performance .......................................................................................................................................................... 5

Achieved Features vs Proposed features................................................................................................................... 6

Future Work ............................................................................................................................................................... 6

Project Schedule ............................................................................................................................................................ 7

System Description ........................................................................................................................................................ 9

AI AlGORITHM............................................................................................................................................................ 9

Hardware Accelerator Description .......................................................................................................................... 11

Interface Description ........................................................................................................................................... 11

SUB-BLOCK Description ....................................................................................................................................... 11

Embedded C Description: ........................................................................................................................................ 12

File Structure and Description: ............................................................................................................................ 12

Embedded code flow and function description:.................................................................................................. 13

Tips and tricks .............................................................................................................................................................. 16

TIPS & TRICKS: Accelerators .................................................................................................................................... 16

TIPS & TRICKS: Vivado HLS ....................................................................................................................................... 16

Understanding interfaces: ................................................................................................................................... 16

Understanding Pragmas and Directives:.............................................................................................................. 17

Some Final Tricks…............................................................................................................................................... 17

Page 3: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

OVERVIEW

Connect-5 is a well known board game where two players alternate and place stones onto a board. The first player

to form a pattern of five stones in a row wins the game. Competitive two player board games are a popular topic in

the FPGA research community, as evidenced by the appearance of games such as Connect-6 and Blokus in the

2012 and 2013 International Conferences on Field Programmable Technology (ICFPT). The project topic was

motivated by a desire by the team members to learn about High Level Synthesis tools such as Xilinx’s Vivado HLS

and how they can be used to generate hardware out of C code.

Figure F3. Example board state in a Connect-5 game running on the final project

Two main goals were proposed for the project:

1. Build a standalone FPGA implementation of the Connect-5 board game using the touch screen peripheral.

The system should support matches between any combination of human and AI players.

2. Build an AI using hardware accelerators which is capable of beating human players and outperforms the

same AI running on modern PCs in runtime.

Page 4: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Figure F4. Basic block diagram of the project showing all pcores used.

Figure F4 about gives an overview of the FPGA project. A list of IP used along with descriptions is given

below:

IP Name Description

Microblaze Xilinx’s proprietary embedded microcontroller. Clocked at 50 MHz.

HW Accelerator Vivado HLS generated accelerator capable of evaluation (scoring) all

potential moves for a given board state.

TouchSensor Interrupt managed touch screen used to accept input for human

game moves, to change game state, and reset the game.

TFT Display Conjoined with the TouchSensor, the TFT renders the 11x11 game

board and button. Renderings map to buttons.

Memory Controller Manages transactions to the on-board DD2. The TFT reads it’s input

images from this shared memory space.

DDR2 Managed by the memory controller. 128MB total size

UART Used to communicate with a PC; enables the FPGA to be pitted

against the PC version.

Microblaze

HW Accelerator Memory Controller

TouchSensor

DDR2

UART

TFT Display

FSL

AXI4Lite

Timer

AXI4

Page 5: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Timer Generates the interrupts required to do software-intrusive profiling

with mb-gprof.

Table T2. List of IP used in the project

OUTCOME

The two key objectives outlined in the Overview were met. Firstly, an interactive game environment was building

using the TFT touchsensor. The game implements a reduced board size (11 x 11) as larger boards would cause each

square to be too small on the touch screen. The game allows the user to switch between three different player

modes including:

1) Human: A play is generated when a human player touches a valid square on the touchsensor’s

gameboard.

2) FPGA: The FPGA’s native AI generates a move.

3) UART: The FPGA waits on an interrupt from a USB connected host-cpu.

Black Player Mode Select

White Player Mode Select

11x11 Gameboard

Reset Button

Figure X: Shows the final version of the UI.

AI PERFORMANCE

The second objective of this project was to implement an AI on the FPGA that outperformed its equivalent

implementation running on a PC. A description of the AI algorithm can be found in section X. With a speculation

depth of 5 future moves and a breadth of the best 8 available moves per level the FPGA implementation ran BX

faster than the PC implementation.

Profiles of the embedded and PC variants follow in figures X2 and Y2, and elucidate where the PC and FPGA spend

most of their execution time.

Page 6: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Figure X2. Profile of the PC implementation:

% cumulative self self total

time seconds seconds calls s/call s/call name

27.20 15.21 5.21 2897546 0.00 0.00 count_vert

27.04 10.38 5.18 2897546 0.00 0.00 count_horiz

19.56 14.12 3.74 2897546 0.00 0.00 count_se

18.46 17.66 3.53 2897546 0.00 0.00 count_ne

5.28 18.67 1.01 22621 0.00 0.00 get_best_move_n

1.67 18.99 0.32 2648713 0.00 0.00 copy_board

0.52 19.09 0.10 11590184 0.00 0.00 add_counts

0.16 19.12 0.03 1 0.03 19.14 tree_search

0.10 19.14 0.02 2897545 0.00 0.00 board_count_score

Figure X3. Profile of Final FPGA Variant (Breadth of 36, Depth of 3):

% cumulative self self total

time seconds seconds calls s/call s/call name

55.37 6.34 6.34 116207 0.00 0.00 hardened_generate_board_count_score

24.61 9.17 2.82 3321 0.00 0.00 hardened_get_best_move_n

4.91 9.73 0.56 505538 0.00 0.00 XGenerate_board_count_score_IsDone

3.38 10.11 0.39 3 0.13 3.67 tree_search

2.08 10.35 0.24 268033 0.00 0.00 TFT_WriteToPixel

1.68 10.55 0.19 239052 0.00 0.00 set_square

1.56 10.73 0.18 3321 0.00 0.00 generate_unsorted_scores

1.29 10.87 0.15 119528 0.00 0.00 XGenerate_board_count_score_Start

1.14 11.00 0.13 119528 0.00 0.00 XGenerate_board_count_score_SetP

1.10 11.13 0.13 116207 0.00 0.00 board_count_score

1.08 11.25 0.12 119528 0.00 0.00 XGenerate_board_count_score_GetReturn

As visible in the profile above, the majority of the system run time is spent in functions that directly interface with

the hardware accelerator, demonstrating good utilization of the hardware accelerator.

ACHIEVED FEATURES VS PROPOSED FEATURES

Both objectives set out for the project were met. In fact the eventual speed up we got from the FPGA

implementation exceeded our initial expectation. This speed up was achieved despite implementing less of the AI

in the hardware than was proposed. The proposed system would do the entire tree traversal in hardware, while

the actual system does the game tree traversal in the MicroBlaze while using the accelerator to evaluation all of

the potential moves given a particular board state.

Moving beyond what was initially proposed, a UART interface was created to enable a PC to play against the FPGA

AI, to visible show the relative speed up of the FPGA implementation versus the initial PC variant. This feature was

not part of the original proposal and was added as further demonstration of the FPGA implementation’s superior

performance.

FUTURE WORK

Page 7: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Integration of UART functionality was intended to be a stepping stone towards playing the FPGA system against

competitive software Connect-5 AIs found online. A number of these AIs and a framework to compete against

them are available at gomocup.org, which organizes an annual world championship for Connect-5 (also known as

Gomoku).

By constraining the time budgets of these competitive AIs to make moves, the quality of our FPGA implemented

algorithm could be measured. This would enable data-driven parameter space exploration of our AI, where we

could trade off move breadth for depth, in addition to tuning the coefficients of the board threat evaluation. This

advanced tuning would greatly improve the strength of the current system.

PROJECT SCHEDULE

Week Milestones Proposed Milestones Met

Feb 26 - Display game grid and identify square pressed - Write function to evaluate a single board state

- All met

Mar 5 - Display game state - Create minimax search tree

- All met

Mar 12 - AI finished - AI code ported to MicroBlaze - Finish touchscreen GUI

- AI is complete but buggy, makes mistakes while playing - No-Tree AI ported - Met

Mar 19 - Create prototype accelerator - Speed up AI compared to previous week

- Met - Was not able to interface accelerator with MicroBlaze

Mar 26 - Move board evaluation function into accelerator - FPGA now faster than PC

- Met, but resulting accelerator was unable to achieve speedup - Was not able to achieve speed up over PC

Apr 2 - Move search tree into hardware - Tune design to maximize search depth

- Milestone deemed not possible - Was not able to achieve speed up over PC - UART game interface created

Apr 9 - Achieved speedup using hardware accelerator - Implemented ability to play games with a PC over UART

Figure F1. Comparison of milestones proposed with milestones achieved

As Figure F1 shows, while we were able to meet the project goals, it was not until the final week that we achieved

speedup over a PC. The project met two main unexpected challenges which set back the project:

Creating a strong AI for Connect-5: While the board evaluation function and minimax search tree was written on

schedule, the AI routinely underperformed after it was meant to be complete due to the difficulty of tuning the

numerical constants in the algorithm. The AI code should be been frozen on March 12 but saw many changes in

mid-March, which affected work on hardware generation.

Page 8: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

Defining interfaces for a custom block: While the processing elements were pushed through the HLS flow to create

a hardware accelerator on schedule, defining interfaces for the pcore proved difficult. A number of different

interfaces such as streaming and FIFO were tried, but were either non-functional or too slow. As a result

milestones regarding connecting the pcore to the MicroBlaze or using it to speed up runtime were not met.

As mentioned before, the UART interface was a feature implemented in the final project which was not part of the

original proposal. This was done because of setbacks to the project which freed up manpower in the final weeks.

Page 9: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

SYSTEM DESCRIPTION

AI ALGORITHM

The game algorithm consists of two main concepts, a board evaluation function and a minimax search tree. The

board evaluation function assigns an integer score to a given board position which corresponds to how “good” the

position is for the current player. A winning position gets a high score while a losing position gets a low (possibly

negative) score. To decide on a move, an AI can play every possible next move, evaluate the board position

afterwards, and choose the best scored move. Such an AI performs a single move look-ahead, and was the first

version that was built.

The board evaluation function for the game of Connect-5 uses the concept of threats. Consider a pattern of five

consecutive squares in any direction (vertical, horizontal, either diagonal) – if all five squares are occupied by allied

stones, the game is a victory for the current player. If any four squares are occupied (and the pattern contains no

opposing stones), this constitutes a one-turn threat, as the opponent must play on the reminding square on its

next turn to avoid a loss. Two simultaneous four-stone threats result in a win. If any three squares are occupied,

the pattern is a two turn threat. The board evaluation function operates by counting up all instances of two, three,

and four stone threats on the board for both the player and opponent, and computes the final score by multiplying

the number of each type of threat by a scaling constant, and summing up the products. This is illustrated in the

following code:

int score = C_P4*C.p4[p] - C_O3*C.p3[o] + C_P3*C.p3[p] - C_O2*C.p2[o] + C_P2*C.p2[p]; if (C.p5[p] > 0) score = MAX_SCORE; // win else if (C.p5[o] > 0) score = -MAX_SCORE; // lose else if (C.p4[o] > 0) score = -(MAX_SCORE-2); // lose in 1 else if (C.p4[p] > 1) score = MAX_SCORE-3; // win in 1

where C_P4 and the like are the constants and C.p4[p] and C.p4[o] are the number of one type of player and

opponent threats. The if statements following ensure the function returns an extreme (essentially positive or

negative infinity) result if a win or loss is ensured.

To count the number of threats on the board, the following method is used:

1) Define a 5 square “stencil” or mask.

2) Put the stencil on the board and find what threat, if any, is inside it. A stencil contains an N-stone threat if

there are N player stones and zero opposing stones on the squares in the stencil.

3) Sweep the stencil over the board, performing step 2 at each location.

4) Perform the above steps for horizontal, vertical, and both diagonal directions.

5) Sum up all of the threats detected.

Page 10: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

The board evaluation function thus consists of a large number of bit based comparisons and shifts when

implemented in the hardware. To maximize speed, the board is stored as a 121 element array, where each

element is a 2-bit value capable of being 0 (unoccupied), 1 (player one stone), or 2 (player 2 stone).

The board evaluation function is defined in board.c, with count_horiz, count_vert, count_ne, count_se

implementing the directional stencil sweeps, and generate_board_count_score implementing the entire evaluation

function.

In order to perform a look-ahead of two or more moves, an AI must take into account the fact that the opponent

will also be attempting to maximize its board score. The most common way to do this is through a minimax game

tree, which is illustrated in Figure F2. The algorithm works by building a tree where each node has N children, N

being the search breadth. The layers of the tree alternate between max (player) and min (opponent) layers,

starting with a max layer. Each node has an associated move to explore and a score, except the root node. Score

are calculated as follows from the bottom of the tree upwards:

For each leaf node, the score is the board score of the resulting board after playing all moves from the

root to the leaf.

For each non-leaf node in a max layer, the score is the maximum score of the children. This simulates the

player choosing its best move.

For each non-leaf node in a min layer, the score is the minimum score of the children. This simulates the

opponent choosing its best move.

The root node is always on a max layer. The move associated with the best scoring child of the root node

is the correct move to make.

Figure F2. First four layers of a minimax search tree with a breadth of 3.

In this project, the minimax search tree is implemented using the function tree_search in ai.c. This function

performs a recursive depth first traversal of the entire tree using the stack only. On a non-leaf node, tree_search

calls get_best_move_n to obtain the N best moves to make, and calls itself on each move. On a leaf node,

tree_search computes the board score. Because the tree resides on the stack, a depth first traversal was chosen as

it limits the number of recursion calls at any given time to the tree depth, thus saving memory.

Layer 0Max LayerEach node generates 3 player moves

Layer 1Min LayerEach node generates 3 opponent moves

Layer 2Max LayerEach node generates 3 player moves

Layer 3Min LayerEach node generates 3 opponent moves

Page 11: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

One simple optimization is that if at any point a max node receives an infinite score from its child or a min node

receives a negative infinite score, the traversal is stopped immediately and the score returned to the previous

layer.

HARDWARE ACCELERATOR DESCRIPTION

The only custom IP built for the project was an accelerator used to evaluate all possible speculative moves

available given a current game state. In other words, this block returns the board scores for all of possible (121-(N-

1)) moves available to a player on turn N. As the MicroBlaze traverses down the game tree in tree search it

explores the strongest speculative moves, and passes the updated board state back to the AI.

The accelerated function is most similar to get_best_move_n, described in the previous section, however, this

function returns, as its name suggests, the best n moves available, where as the accelerator returns the scores for

all moves and expects the MicroBlaze to select from them.

The accelerator is clocked with the 50 MHz system clock, and is reset with the peripheral reset (active high).

INTERFACE DESCRIPTION

The C – signature of the function passed through the HLS flow was as follows:

int generate_board_count_score (unsigned int* master, int p, int* move_score)

Argument interfaces:

Master (input): The board game state to be evaluated. This is read sequentially and thus is implemented as an

ap_fifo built upon FSL.

P (input): the current player ID; implemented as ap_none, set through AXI4lite slave interface.

move_score (output): a buffer of all possible scores as they are generated (row-major order). This interface is

implemented as an ap_fifo built upon FSL.

Return interface:

The block returns one of two different integer values based on the value of P. If P = {0,1}, the number of

speculative moves evaluated is returned. If P = {2,3}, the current board evaluation, given no speculation, is

returned. The latter case is important to evaluate leaf nodes of the game tree, where no further speculation is

required, and for the win condition.

SUB-BLOCK DESCRIPTION

Since HLS defines sub-blocks according to the function call hierarchy, it is natural to describe the blocks within the

accelerator by referring to the functions that would be invoked in the C-implementation.

count_vert, count_horiz, count_ne, count_se:

As described in section X, a complete board evaluation, involves the application of four different stencil operators

(horizontal, vertical, and two diagonals) to every conceivable winning position on the board. These functions

Page 12: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

perform these operations for each of the 4 possible orientations. Each function consists of three for loops, the

inner loop counts all threats with the stencil scope, the upper two loops sweep the scope over the board. The HLS

description of these functions applies the flatten directive, to the inner most loop, and pipelines the loop it’s

nested in.

As shown in figure N, these functions account for the bulk of PC run time.

generate_board_count_score_internal:

This function, takes the counts generated by the stencil functions and multiplies them by the coefficients for each

threat length (2,3,4,5). It also adds a small weight to the score to bias moves towards the centre of the board. This

weight is omitted if there is a threat that would force a win or loss for the current player.

generate_board_count_score:

This is the top level function (whose arguments were described in the previous section). Here, two loops sweep

through every possible speculative position, and pass the now-speculative game state to the

generate_board_count_score_interal, for evaluation. Upon completion, the score and the moves position (x,y) are

written to the output fifo [move_score]. The move is unplayed, and the block proceeds to evaluate the next

available move.

EMBEDDED C DESCRIPTION:

The embedded C code performs three main functions:

1) Execute game flow

2) Interface with the hardware accelerator

3) Render and display the game User Interface(UI)

To best understand the embedded code, the following section describes the file structure of the code. In addition,

it provides an overview of the embedded code flow with descriptions of the functions used and how they serve to

perform each of the three main functions outlined above.

FILE STRUCTURE AND DESCRIPTION:

main.c

This file contains the IP and game state initialization, the main execution loop as well as the implementation

for several helper functions.

ai.c/ai.h

These files contain the C component of the AI implementation. The AI leverages the hardware accelerator core

to generate board score counts.

board.c/board.h

These above files contain the functions related to manipulating and evaluating the game board.

gameboard.c/gameboard.h

These files implement the UI side of the game board. In particular, they provide the functionality used to

render the UI control buttons as well as rendering and re-rendering the game board following a move.

common.h

Page 13: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

This file contains the definition of several game structures that are used throughout the embedded code. The

most important of these structures are: BOARD, PLAYER and COORD. In addition, it contains many important

macro definitions such as board dimensions and search tree parameters.

touchsensor_buttons.c/touchsensor_buttons.h

These files provide a Touch Sensor Button Library API. This allows for the creation of buttons with custom

functionality managed through a provided button manager.

xgenerate_board_count_score_ctrl.h/ xgenerate_board_count_score.c/ xgenerate_board_count_score.h

These files make up the driver for the hardware accelerator core.

EMBEDDED CODE FLOW AND FUNCTION DESCRIPTION:

As described above, the main function of the program lies in main.c. The beginning of this function consists of a

series of function calls that perform the required initialization of the game state and the numerous IPs used:

Touch Sensor IP and Button setup:

- TouchSensor_Initialize: This function initializes the touch screen IP which is used to receive user input for

gameplay and mode selection. The function is part of the touch sensor driver.

- TouchSensorButtons_InitializeManager: This function initializes the button manager which is used to direct a

touch sensor interrupt upon a button press to the correct button handler function. This function is part of the

Touch Sensor Buttons API which will be described in more detail later.

Following that, the following set of functions is called to set up the main buttons of which the game UI is

comprised:

- Button_SetGridDim/Button_SetRectDim: These functions set up the dimensions of the buttons to be rendered

onto the TFT Display. The first function applies to grid type buttons while the second function applies to the

rectangle type buttons

- Button_AssignHandler: This function assigns the button handler in the button data structure.

- TouchSensorButtons_RegisterButton: This function registers the button with the button manager. As

described above, the button manager will invoke this handler upon receiving a touch sensor interrupt that

corresponds to the button

- TouchSensorButtons_EnableButton: Finally, this function enables the button with in the button manager. This

functionality exists to allow buttons to be disabled/enabled during program execution

The above set of functions is also part of the Touch Sensor Buttons API. In total, the UI consists of 4 buttons.

There are three grid type buttons which correspond to the main game board grid and player mode selection

for each of the black and white players. In addition, a rectangle type exists to allow for game play to be reset.

Hardware Accelerator Setup:

- initialize_accelerator: This function initializes the hardware accelerator. It is part of the accelerator driver.

TFT Display Setup:

- blackScreen: The definition of this function is found in this file and it initializes the memory region corresponding the TFT pixel values to black.

Page 14: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

The following functions are responsible for setting up the TFT display and are part of the TFT driver:

- TFT_Initialize: Initializes the TFT Display IP. - TFT_SetImageAddress: Points the TFT IP to the address of the memory region where the pixel values are

stored. - TFT_SetBrightness: Sets the brightness of the TFT Display. - TFT_TurnOn: Turns the TFT Display on.

UART Setup:

- XUartLite_Initialize: Initializes the UART IP.

- XUartLite_SetRecvHandler: Sets the handler function for an interrupt generated by receiving something on

the UART. This is set to RecvUartCommand which is defined in “main.c”

- XUartLite_ResetFifos: Resets the UART Fifos. - XUartLite_EnableInterrupts: Enables interrupts from the UART side.

Interrupt Controller Setup:

- XIntc_Initialize: This function initializes the interrupt controller.

- XIntc_Connect: This function connects an interrupt source to the interrupt controller and specifies interrupt handler corresponding to that source. There are two IPs connected to the interrupt controller: 1) UART which is handled by XUartLite_InterruptHandler. 2) Touch Screen which is handled by TouchSensorButtons_InterruptHandler.

- XIntc_Start: Starts the interrupt controller. - XIntc_Enable: Enable the interrupts for each interrupt source connected to the controller.

The above functions are all part of the interrupt controller driver. In addition, microblaze_enable_interrupts is called to enable interrupts on the microblaze side.

Gameboard Initialization and Rendering:

- Gameboard_Initialize: Initialize the game board. - Gameboard_RenderBoard: Renders the game board. - Gameboard_RenderAllCtrlButtons: Renders the UI control buttons. -

The definitions for the above functions are all found in gameboard.c.

After the initial setup, the program enters the main execution loop which is structured as follows: - Outer while loop which iterates until a win has been achieved. - Switch statements that select between several functions to perform based on the current player mode in

anticipation of the player move: o Human player:

Enable the game board grid button using TouchSensorButtons_EnableButton o FPGA AI:

Disable the game board grid button using TouchSensorButtons_DisableButton Get the AI move using HandleAiMove. HandleAiMove is defined in “main.c” and invokes the

top level AI function get_move_ai2 defined in “ai.c”. o UART:

Disable the game board grid button. Send the last played move to the UART using SendUartCommand.

- Inner while loop which iterates until a new move has been played. - Win condition check which used check_board_win (defined in “board.c”).

Page 15: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,
Page 16: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

TIPS AND TRICKS

This project produced a number of interesting insights into two primary areas:

1) How to develop hardware accelerators in embedded environments

2) How to use Vivado HLS flow effectively

TIPS & TRICKS: ACCELERATORS

Profiling is as essential in optimizing in an embedded environment as it is on a PC or super computer. Fortunately,

Xilinx provides a microblaze specific variant of gprof (mb-gprof) to help developers identify hot spots in their code.

Using gprof helped us identify the proportion of time we were keeping our accelerator active, and what the cost

overheads were of invoking it. Xilinx UG448 “EDK Profiling User Guide” is very helpful in describing how to use the

profiling from within SDK/EDK.

Setting up a microblaze project for profiling is not trivial however, there were a few key issues.

1) Gprof uses timer driven interrupts to sample code during execution. This must be provided in the XPS

project and connected to the interrupt controller. This was all expected. However, when other interrupt

devices were initialized and connected in our embedded project, all successive samples were lost. To

avoid this, we ifdefed out this initialization when a profile flag was set to true.

2) In order to profile the system, the BSP must also be compiled with –pg enabled, and with software-

intrusive interrupts enabled. Compiling the BSP takes a considerable amount of time. so it was more

convenient to define a profile-only variant, which helped reduce our iteration time.

TIPS & TRICKS: VIVADO HLS

Vivado HLS is a powerful tool, if the engineer understands how it performs it’s translations.. At present the best

reference for Vivado HLS related information is:

Xilinx UG902 “Vivado Design Suite User Guide: High Level Synthesis”

Xilinx UG817, “Vivado Design Suite Tutorial – High Level Synthesis”

UG817 demonstrates to the new user, the process of building a pcore using vivado HLS, while UG902 is a

comprehensive reference about the tool itself. There are a few key areas a new user must study to become

effective at using the flow.

Understanding interfaces:

The first insight is that arguments to the function being accelerated should not be thought of as arguments at all,

but rather interfaces.

If an argument, or the return value is single valued (scalar), a bundle of wires connected to a memory mapped

register is all that’s required to capture it (use an ap_none interface, and bundle it with the control interface

(AXI4lite slave).

Page 17: AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY …pc/courses/432/2014/projects/11_connect5/Fi… · AN INTERACTIVE EMBEDDED SYSTEM & AI TO PLAY CONNECT5 David Biancolin, Ritchie Zhao,

#pragma HLS INTERFACE ap_none port=p #pragma HLS RESOURCE variable=p core=AXI4LiteS metadata="-bus_bundle ctrl" #pragma HLS INTERFACE ap_ctrl_hs port=return

#pragma HLS RESOURCE variable=return core=AXI4LiteS metadata="-bus_bundle ctrl"

Vivado will conveniently generate driver code to write and read from these slave fields. If reading from, or writing

to an array sequentially, use an ap_fifo. Ap_fifo interfaces can use either FSL or AXI_streaming cores.

#pragma HLS INTERFACE ap_fifo port=master #pragma HLS RESOURCE variable=master core=FSL #pragma HLS INTERFACE ap_fifo port=move_score #pragma HLS RESOURCE variable=move_score core=FSL

We had considerably less difficulty intergrating with FSLs over and AXI Streaming Fifo. This, coupled with their

reduced call overhead, mad FSL the clear winner for this project.

If random access to array to is required, a bus interface will be required (ap_bus), these have considerable area

over head, and increased read and write latency.

Understanding Pragmas and Directives:

UG902 is very important here. The final sections of the document expand on what transformations each pragma

performs.

1) Array optimizations: These tell vivado how to implement and partition storage elements in HW

2) Loop optimizations: These tell vivado how it should extract loop-level parallelism (there are a few

different varieties). It’s also important to tell vivado if inter-interation dependencies exist and what

variety they are.

3) Function Optimizations: Functions are treated as blocks. Vivado can be told to multiplex different

invocations of the same function into the same piece of hardware, or instantiate multiple instances of it.

Some Final Tricks…

Use reduced bit-width values wherever possible. IE, uint4, if the values are known to be bound between 0 – 15.

Vivado may not be able to infer this from the C description, leading to completely unnecessary 32 bit counters.

But be aware of the pitfalls (which seem obvious after the fact):

1) Adding signed ints of different bit widths leads exactly to what you’d expect.

2) This is an infinite loop:

uint4 counter;

for (counter=0; counter < 16; counter++) {

printf(“There ain’t no escape!”);

}