Upload
sydney-marshall
View
216
Download
0
Embed Size (px)
Citation preview
Othello Processor
Bret Taylor, Olatunji Ruwase, and Jim Norris
CS343, Spring 2003
Project Overview
AI algorithms for games like Othello are highly parallelizableGenerally, algorithm performance is increased
by adding more, generic processors What price-performance can we achieve
with a custom processor?What granularity of custom instructions
achieves greatest price-performance?How generic can we make the instructions?
Overview of Othello
Goal: Have more pieces when the game ends
You flip your opponent’s pieces when you surround them with your pieces on each end
Overview of Othello
Goal: Have more pieces when the game ends
You flip your opponent’s pieces when you surround them with your pieces on each end
Overview of Othello
Goal: Have more pieces when the game ends
You flip your opponent’s pieces when you surround them with your pieces on each end
Othello in Academia
Software Rosenbloom, Paul S.: A World-Championship-Level
Othello Program Lee, K.; Mahajan, S.: The Development of a World
Class Othello Program
No specific hardware implementations, but related work for MiniMax-style algorithms Powley, Curtis Nelson: Parallel Tree Search on a
Single-Instruction, Multiple-Data (SIMD) Machine
Common Algorithm Structure
MiniMax search to a depth determined by global or per-move time limit
Heuristics evaluate “value” of move at leaves
Common Algorithm Properties
The deeper the search, the better the processor Assumes heuristic is “reasonable,” i.e., it does not get
worse with more information
Effectively infinitely parallelizable Many operations are expensive in software:
Determining successors (“is valid move”) Calculating successors (“make move”) Heuristic calculation
Our Othello Implementation
Based on Iago, concentrating on high-quality heuristic variables: Stability – Number of pieces that can never be flipped Mobility – Number of available moves Piece differential Vulnerability – Entrance points to stable squares on
the corners and sides Heuristic value is weighted sum of variables
Weights learned through reinforcement learning Weights vary over the course of the game
Software Overview
Boards are 128-bit entries (2 bits per piece) Lookup tables for things like stability Lookup tables are indexed by the ternary
number represented by the row or column:
1 + 2 * 32 + 2 * 33 + 34 154
Software Trace Profile
Vast majority of CPU time consumed in IsLegalMove and MakeMove 53.21% DoOneDirection 30.64% DoAllDirections
Loops in all rows/diagonals to find/flip valid directions
Called to find successors, to calculate mobility, and to make moves
Operations are common to all Othello players (extensions are at least slightly generic)!
Flip Instruction Granularity
Split a single MakeMove or IsLegalMove operation into a sequence of four operations corresponding the four flip directions (row, column, rdiag, ldiag)
Made a lookup table of the 38 row/column/diagonal configurations to lookup which pieces get flipped on an axis given a piece placement
Reducing Die Area
We only implemented FLIPROW and FLIPDIAGONAL instructionsWe do a 90o rotation of the board and back
again to flip the other two directions Saved on die area and cycle time;
transposing and rotating are very cheap instructions
New DoAllDirections
Output dependencies galore!B = FLIPROW(B, row, col);B = FLIPDIAG(B, row, col);B = ROTATEBOARD(B, CLW);B = FLIPROW(B, col, 7 – row);B = FLIPDIAG(B, col, 7 – row);B = ROTATEBOARD(B, ACLW);
State Registers
Store 64-bit FLIPTABLE state register to keep track of which pieces should be flipped: no output dependencies between instructions
Added benefit: seeing if a move is valid simply amounts to (FLIPTABLE != 0) after flip operations
Results
New instructions are extremely effective with relatively little complexity compared to optimizing for a multi-processor environment
~4.1 times better performance than base processor
CPU Cycles CycleTime Area PricePerf
Base 492143906 ~10 ns ~4.2 mm2 ?
Extended 126832159 10.63 ns 20389? ?
Conclusions
Positives Optimizations can be used for all Othello players With very little work, we could reduce the cycle time to
that of the base processor Negatives
Cost of custom processors is prohibitive It may be more effective to exploit parallelism of
search algorithm Combining custom processors with MP
parallelism for best results?