View
212
Download
0
Category
Preview:
Citation preview
SYNCHRONIZATION METHODS FOR SCRAMNET+ REPLICATEDSHARED-MEMORY SYSTEMS
by
Stephen Frank Menke
BEE, Georgia Institute of Technology, 1993
Submitted to the Graduate Faculty of
Arts and Sciences in partial fulfillment
of the requirements for the degree of
Master of Science
University of Pittsburgh
1999
This thesis was presented
by
Stephen Frank Menke
It was defended on
April 28, 1999
and approved by
Rami Melhem, Professor of Computer Science, Committee Member
Mark Moir, Assistant Professor of Computer Science, Thesis Advisor
Daniel Mossé, Associate Professor of Computer Science, Committee Member
ii
Copyright by Stephen Frank Menke
1999
iii
SYNCHRONIZATION METHODS FOR SCRAMNET+ REPLICATED
SHARED-MEMORY SYSTEMS
Stephen Frank Menke, MS
University of Pittsburgh, 1999
SCRAMNet+ (Shared Common Random Access Memory Network) is a communications
network that transparently provides replicated shared-memory via a high-speed fiber-
optic ring topology. Such systems combine the ease of programming of shared-memory
multiprocessor systems with the distance and heterogeneity of message-passing networks.
These features are ideal for a variety of distributed real-time applications.
This thesis explores both blocking and non-blocking synchronization methods in such
systems. We first develop a mutual exclusion algorithm, the most common blocking
synchronization method, by exploiting unique features of the SCRAMNet+ hardware.
Through theoretical and experimental analysis we compare our algorithm to a mutual
exclusion algorithm suggested by the manufacturer, Systran Corp. The analysis concludes
that our algorithm is both scalable and fair Systran’s algorithm is not. Our algorithm also
has faster execution times for any size SCRAMNet+ network.
Although mutual exclusion is the most common method for synchronization, non-
blocking methods overcome a number of problems caused by the use of mutual
exclusion, such as deadlock. It is well known that strong primitives such as compare and
swap (CAS) or load-linked/store-conditional (LL/SC) are required for general non-
blocking synchronization. We therefore present and evaluate a CAS algorithm for
SCRAMNet+ systems. We validate the algorithm by incrementing a shared-memory
counter with the CAS operation. More significantly, we use the CAS algorithm to
construct lock-free and wait-free large shared large objects, which are designed to
iv
overcome the problems associated with mutual exclusion. We experiment with both lock-
free and wait-free versions of a queue to validate the large object implementation on a
real system.
Although we used a real system to perform experiments on all the algorithms, it was
limited to only two nodes. Therefore, we also built a simulator, based on Augmint, which
can model any size SCRAMNet+ network. We used experiments to validate our
simulation against our real-world results, which then allowed us to extend our analysis to
systems with more than two nodes.
v
Acknowledgements
First and foremost, I would like to thank my future wife Carolyn for her love, patience
and support. Both work and school demanded many long hours. However, her
encouragement and smile always kept me going.
I would also like to thank my advisor, Mark Moir, for his flexibly and guidance. He has
balanced my knowledge by adding the theoretical. I truly believe this will attribute to my
career, wherever it may lead.
Thanks, too, to the rest of my committee: Rami Melhem and Daniel Mossé. I am grateful
for their flexibility in arranging their schedules for my defense. This also includes the
help from Daniel’s group in setting up RT-Mach.
Finally, I would like to thank Systran Corp. for supplying the hardware, software and
documentation necessary to complete this thesis. Most importantly, Chris Fought from
technical support, whose assistance was key to developing the driver for RT-Mach.
vi
Table of Contents
1 INTRODUCTION..................................................................................................................................1
2 SCRAMNET+ HARDWARE..............................................................................................................6
2.1 GENERAL PURPOSE COUNTER / GLOBAL TIMER................................................................................6
2.2 ERROR CORRECTION...........................................................................................................................7
2.3 INTERRUPTS........................................................................................................................................7
2.4 WRITE-ME-LAST MODE.....................................................................................................................8
3 BLOCKING SYNCHRONIZATION.................................................................................................9
3.1 SYSTRAN’S MUTUAL EXCLUSION ALGORITHM..................................................................................9
3.1.1 Acquire...................................................................................................................................10
3.1.2 Release....................................................................................................................................10
3.2 OUR MUTUAL EXCLUSION ALGORITHM...........................................................................................12
3.2.1 Acquire...................................................................................................................................12
3.2.2 Release....................................................................................................................................13
3.2.3 ISR0.........................................................................................................................................13
3.3 THEORETICAL COMPARISON.............................................................................................................14
3.4 SYSTEM EXPERIMENTS.....................................................................................................................16
3.4.1 No Contention.........................................................................................................................16
3.4.2 Contention..............................................................................................................................19
3.5 SIMULATION EXPERIMENTS..............................................................................................................21
3.5.1 No Contention.........................................................................................................................22
3.5.2 Contention..............................................................................................................................22
3.5.3 Polling....................................................................................................................................25
3.5.4 Heavy Contention...................................................................................................................27
3.6 CONCLUSIONS AND FUTURE WORK.................................................................................................29
4 NON-BLOCKING SYNCHRONIZATION.....................................................................................31
4.1 COMPARE AND SWAP.......................................................................................................................31
4.1.1 CAS.........................................................................................................................................32
4.1.2 Read........................................................................................................................................32
4.1.3 Analysis...................................................................................................................................34
4.1.4 Experiments............................................................................................................................34
4.2 LARGE OBJECTS...............................................................................................................................36
4.2.1 Experiments............................................................................................................................36
4.2.2 Conclusions and Future Work................................................................................................38
5 SIMULATION....................................................................................................................................39
5.1 COMPILE-TIME.................................................................................................................................39
5.2 RUN-TIME.........................................................................................................................................40
5.2.1 Events.....................................................................................................................................40
5.2.2 Data Movement......................................................................................................................41
5.2.3 Tasks.......................................................................................................................................41
5.2.4 Threads...................................................................................................................................42
5.2.5 Backend..................................................................................................................................42
5.2.6 Execution................................................................................................................................42
5.3 SCRAMNET+ BACKENDS................................................................................................................43
5.3.1 Memory Model........................................................................................................................43
5.3.2 User Events.............................................................................................................................45
5.3.3 Write-Me-Last Backend..........................................................................................................45
5.3.4 Interrupt Backend...................................................................................................................46
5.3.5 Polling Backend......................................................................................................................49
5.4 SIMULATION PARAMETERS...............................................................................................................49
5.4.1 Transit Time............................................................................................................................49
5.4.2 Access Times...........................................................................................................................50
5.4.3 Context Switch Time...............................................................................................................50
5.5 CONCLUSIONS AND FUTURE WORK.................................................................................................51
6 SUMMARY AND CONCLUSIONS.................................................................................................53
APPENDIX A................................................................................................................................................55
A.1 SCRAMNET+ DRIVER.....................................................................................................................55
A.2 SCRAMNET+ API..........................................................................................................................56
A.2.1 scr_mem_mm..........................................................................................................................56
A.2.2 get_base_mem........................................................................................................................56
A.2.3 scr_csr_read...........................................................................................................................56
A.2.4 scr_csr_write..........................................................................................................................57
A.2.5 scr_id_mm..............................................................................................................................57
A.2.6 scr_acr_read...........................................................................................................................57
A.2.7 scr_acr_write..........................................................................................................................57
APPENDIX B................................................................................................................................................58
B.1 SYNTAX............................................................................................................................................58
B.1.1 Augmint Parameters...............................................................................................................59
B.1.2 Backend Parameters...............................................................................................................59
B.1.3 Simulation Parameters...........................................................................................................59
B.2 EXPERIMENTS...................................................................................................................................60
BIBLIOGRAPHY.........................................................................................................................................61
List of Tables
TABLE 1 COMPARISON OF AVERAGE EXECUTION TIMES FOR PAIR OF ACQUIRE/RELAEASE OPERATIONS WHEN THE MAXIMUM NUMBER OF NODES EQUALS 256 (S).....................................................................19
TABLE 2 AVERAGE EXECUTION TIME FOR A READ OPERATION (S)..............................................................35
TABLE 3 AVERAGE EXECUTION TIME FOR CAS OPERATION (S)..................................................................35
TABLE 4 AVERAGE EXECUTION TIME FOR PAIR OF ENQUEUE/DEQUEUE OPERATIONS FOR LOCK-FREE CONSTRUCTION OF LARGE OBJECTS (S)..........................................................................................37
TABLE 5 AVERAGE EXECUTION TIME FOR PAIR OF ENQUEUE/DEQUEUE OPERATIONS FOR WAIT-FREE CONSTRUCTION OF LARGE OBJECTS (S)..........................................................................................38
TABLE 6 SIMULATOR EXECUTABLE DIRECTORIES..........................................................................................58
TABLE 7 SCRIPTS TO RUN SIMULATION EXPERIMENTS...................................................................................60
List of Figures
FIGURE 1 SYSTRAN’S MUTUAL EXCLUSION ALGORITHM................................................................................11
FIGURE 2 OUR MUTUAL EXCLUSION ALGORITHM...........................................................................................14
FIGURE 3 COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A REAL SYSTEM...........................17
FIGURE 4 CLOSE-UP COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A REAL SYSTEM..........18
FIGURE 5 TIMING SEQUENCE OF OUR ALGORITHM’S ACQUIRE PROCEDURE WITHOUT CONTENTION.............18
FIGURE 6 COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A REAL SYSTEM..................................20
FIGURE 7 CLOSE-UP COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A REAL SYSTEM.................21
FIGURE 8 COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A SIMULATED SYSTEM.................23
FIGURE 9 CLOSE-UP COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A SIMULATED SYSTEM 23
FIGURE 10 COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A SIMULATED SYSTEM......................24
FIGURE 11 CLOSE-UP COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A SIMULATED SYSTEM.....24
FIGURE 12 COMPARISON OF POLLING AND INTERRUPT VERSIONS WITHOUT CONTENTION ON A SIMULATED SYSTEM......................................................................................................................26
FIGURE 13 COMPARISON OF POLLING AND INTERRUPT VERSIONS WITH CONTENTION ON A SIMULATED SYSTEM......................................................................................................................26
FIGURE 14 COMPARISON OF ALL ME ALGORITHMS UNDER HEAVY CONTENTION ON A SIMULATED SYSTEM28
FIGURE 15 CLOSE-UP COMPARISON OF ALL ME ALGORITHMS UNDER HEAVY CONTENTION ON A SIMULATED SYSTEM......................................................................................................................28
FIGURE 16 COMPARISON OF AVERAGE EXECUTION TIMES FOR EACH NODE UNDER HEAVY CONTENTION....29
FIGURE 17 SEMANTICS OF COMPARE AND SWAP............................................................................................31
FIGURE 18 COMPARE AND SWAP ALGORITHM................................................................................................33
FIGURE 19 TIMING DIAGRAM OF OUR ALGORITHM’S ACQUIRE PROCEDURE WITHOUT CONTENTION.............51
1 Introduction
This thesis presents and evaluates synchronization mechanisms for SCRAMNet+ (Shared
Common Random Access Memory Network) systems. SCRAMNet+ is a
communications network geared toward real-time applications, and based on a replicated
shared-memory concept [12]. By combining the advantages of shared-memory multi-
processors and message passing systems, SCRAMNet+ offers distributed shared-memory
with reliable, deterministic and low-latency updates. Thus, SCRAMNet+ has proven to
be ideal for many real-time applications [19].
SCRAMNet+ systems offer the benefits of shared-memory multiprocessor, namely ease
of programming, low-latency communications, and little or no software overhead for
communications [4]. A SCRAMNet+ network consists of up to 256 computers (nodes)
each with a SCRAMNet+ network card. The network cards are interconnected through
fiber-optic cables in a serial-ring topology. Each network card has dual-ported RAM
(Random Access Memory) that can be mapped into the address space of any process on a
node. Any write to the dual-ported RAM is transparently replicated to each node, and
hence every process, in the network.
In addition to providing a shared-memory abstraction to applications, SCRAMNet+
systems also have the advantages of a message passing system. First, processors can be
connected at distances of hundreds or even thousands of meters [4]. In contrast, a typical
multiprocessor system is limited to only a few meters. SCRAMNet+ networks can also
connect machines with different architectures or operating systems. This might be an
advantage, for example, in an industrial control system where the data acquisition and
1
control are run on distributed embedded processors and the graphical interface runs on
standard PCs.
Typical distributed systems, such as industrial control systems, require concurrent access
to shared data. Usually some synchronization is required to protect the consistency of the
data. The most common method used are mutual exclusion algorithms, which protect
shared data through the access to a critical section. The semantics of mutual exclusion
prevent more that one process from entering a critical section at time, therefore limiting
access to the data. The manufacturer, Systran Corp., presents a mutual exclusion
algorithm for SCRAMNet+ memory systems in [15]. However, this algorithm has several
shortcomings.
Its performance is drastically affected by the number of nodes in the system;
The solution is not starvation free. It is theoretically possible for one process
to repeatedly attempt to acquire the lock but never succeed; and
The solution necessarily prioritizes the processes, but does not make any
concrete guarantees. Furthermore, the prioritization mechanism is unavoidable
and leads to starvation of lower priority nodes.
In this thesis we present our own mutual exclusion algorithm for SCRAMNet+ systems.
This algorithm exploits special hardware features of the SCRAMNet+ network and is
both fair and starvation free. We compared the two algorithms using both real-system
experiments and simulations that compute the average execution time for a pair of
acquire/release operations. The results demonstrate that our algorithm has faster
execution times both with and without contention, regardless of the network’s size.
Although both algorithms are sufficient for synchronization, when one process enters the
critical section, any other process desiring access to the shared data must wait indefinitely
for that process to exit the critical section.
2
Recently, significant progress has been made toward efficient lock-free and wait-free
implementation of shared objects (e.g. [2, 3, 6, 7, 8, 9]). A shared object is a shared data
structure and associated operations. A lock-free implementation of a shared object
guarantees that after a finite number of steps of a process p’s operation, some process
(not necessarily p) completes an operation on the object. A wait-free implementation
guarantees that each operation of a process p completes after a finite number of p’s steps.
The result is fault tolerance, meaning some process (lock-free) or the actual process
(wait-free) will continue to progress, regardless of the failure of any other process. A
mutual exclusion algorithm cannot be either lock-free or wait-free because if a process
never exits the critical section, no other process can continue.
In [6] Herlihy defines universal objects that can construct any wait-free object. He
assigns a consensus number to each object, where a consensus number of n can
implement any wait-free object for up to n processes. Herlihy also proved that CAS
(Compare and Swap) is universal and has a consensus number of infinity. Therefore,
CAS is an important primitive to implement in a shared-memory system that requires
wait-free objects. Given this, we have implemented and evaluated a CAS algorithm for
SCRAMNet+ systems. By conducting a simple experiment that used a CAS to increment
a shared counter concurrently, we validated the correctness of this algorithm. We also
compared experiments with and without contention, and found that our algorithm
performs well under contention. However, the contention experiments were only run with
two nodes and therefore further testing is needed. Now that we have created an effective
CAS primitive for SCRAMNet+ systems, we can construct wait-free objects for such
systems.
Herlihy extended his work in [6] by suggesting lock-free and wait free constructions for
large shared objects. However, the implementation is inefficient due to the large amount
of data being copied – especially when much of the copying may be unnecessary. In [2]
3
Anderson and Moir present a more efficient implementation of lock-free and wait-free
constructions for large shared objects. In [5], Filachek furthers their work by
implementing and testing their algorithms in simulations. We have furthered the study by
porting and testing the algorithms to a SCRAMNet+ system. Our main objective was to
validate the operation of the algorithms. We accomplished this task by testing concurrent
access to lock-free and wait-free implementations of a queue on an actual system and
verifying the consistency of the queue.
The original evaluation of all our algorithms was performed on an actual system
consisting of two 266 MHz Pentium II PCs running the RT-Mach operating system. Each
PC was equipped with a SCRAMNet+ network card with 2MB RAM interconnected with
single-mode fiber optic cables. However, due to the availability and cost of the hardware,
we were only able to construct a system with two nodes. This was sufficient for testing
the algorithms without contention, but provided little insight to situations with many
nodes and heavy contention. Therefore, we designed SCRAMNet+ simulators using
Augmint.
Augmint is a fast, execution-driven multiprocessor simulator for Intel x86 architectures
[16]. Augmint allows the modification of a library called the backend to implement
various memory models. We created three different backend libraries to model different
configurations of the SCRAMNet+ system, and then duplicated the original experiments
for mutual exclusion in order to compare the simulations to our real world results. The
comparison verified the accuracy of our simulators allowing us to continue the
simulations with confidence in the results. We then used the simulators to evaluate the
mutual exclusion algorithms under heavy contention. The results of these experiments
show that Systran’s algorithm fails to guarantee its prioritization scheme. It also shows
that by modifying our algorithm to use polling techniques versus interrupts, the resulting
4
algorithm will outperform Systran’s algorithm with heavy contention regardless of the
number of nodes in the network.
The remainder of this thesis is organized as follows. We provide an overview of the
SCRAMNet+ hardware in Section 2. Section 3 covers blocking synchronization methods
for SCRAMNet+ systems. It contains a detailed description of Systran’s and our mutual
exclusion algorithms and an analysis of the experiments performed on the real system
and on simulations. Section 4 covers non-blocking synchronization methods for
SCRAMNet+ systems. It describes a CAS algorithm and analyzes the results of real
world experiments. It then presents results of experiments for lock-free and wait-free
objects implemented with the CAS algorithm. Section 5 contains an overview of
Augmint and a full description of the simulation implementation. In Section 6, we
summarize the overall results and conclusions.
5
2 SCRAMNet+ Hardware
SCRAMNet+ cards have many configurable features. This section describes the features
of interest to this thesis. For more detailed information or a complete listing and
explanation of all features, see [12]. To understand the algorithms in this thesis it is first
necessary to understand how the SCRAMNet+ network operates.
A SCRAMNet+ node updates the shared-memory on all other nodes by inserting a
message on the ring for every write to shared-memory. The message contains the
memory offset and value of the word written. When the message is received by another
node, the write is replicated by writing the same value to its memory. When the
originating node receives its own message, the message is removed from the ring.
Although SCRAMNet+ uses a ring topology, it is essentially a point-to-point network in
a ring orientation. That is, a message must be received and retransmitted by each
intermediate node to traverse the ring. This introduces a minimum delay of 247
nanoseconds at each node [12]. For our experiments we used the fixed size packet
configuration, which according to [12] has a maximum delay of 800 nanoseconds at each
node. Therefore, our two-node system should have a round-trip transit time between 494
and 1600 nanoseconds.
2.1 General Purpose Counter / Global Timer
SCRAMNet+ cards provide a General Purpose Counter / Global Timer that can measure
the round-trip transit time of a message with a resolution of 26.66 nanoseconds. Using
this timer the transit time on our two-node network was measured as 1270 nanoseconds,
which is within the expected range.
2.2 Error Correction
The SCRAMNet+ network has a bit error rate of 10-15, meaning that an error might occur
once every 76 days of continuos, 24 hour, 100% bandwidth-saturated network utilization
[20]. Although rare, these errors must still be handled. We configured the SCRAMNet+
card in PLATINUM mode to detect and handle any errors. PLATINUM mode can detect
and correct two types of errors. First, bit errors are detected with a bit-by-bit comparison
of the message once it has returned back to the originating node. Second, a configurable
time-out can detect the loss of any originated message. If either type of error occurs, they
are corrected by automatically re-transmitting the original message until it is received
correctly. Also, once an error has been detected, any new messages from that node are
stored in a transmit FIFO and not sent until the message that was in error is received
correctly. Therefore, PLATINUM mode guarantees that every message is eventually
delivered correctly. SCRAMNet+ cards can also be configured to generate an interrupt on
the host whenever an error occurs. We used this interrupt in all of our experiments to
generate an error message, however it never occurred.
2.3 Interrupts
In addition to interrupting on errors, the SCRAMNet+ network cards can be configured to
generate an interrupt whenever a given 32-bit memory word is written. Each 32-bit word
in SCRAMNet+ memory has an associated ACR (Auxiliary Control RAM) location that
is used to configure this feature. Each ACR can be configured to send interrupts, receive
interrupts or both. Although the memory of the cards is replicated on every node, the
ACRs are not. Therefore, the interrupt configuration for each word can be different on
every node.
Whenever a node writes a 32-bit memory word, the ACR for that word on that node is
checked. If it is configured to send interrupts, an interrupt message is generated
containing the memory offset of the word written. Whenever a node receives an interrupt
message, the ACR for the word written is also checked. If the ACR is configured to
receive interrupts, the memory offset for that word is stored in a FIFO (First-In, First-Out
data buffer) for the ISR (Interrupt Service Routine) to interrogate. The first entry into the
FIFO generates an interrupt on the host and disables the interrupt hardware until re-
enabled by the ISR. Any subsequent interrupt messages are inserted in the FIFO without
generating an interrupt. The ISR then continually processes the interrupt FIFO until it is
empty. This allows the ISR to process multiple interrupts with only one context switch.
Once the ISR detects the FIFO is empty, it re-enables the interrupt and exits.
Both the mutual exclusion and CAS algorithms presented in this thesis exploit this
interrupt feature by enabling node 0 to receive interrupts. Writing to specific shared-
memory words generates interrupt messages signaling the ISR on node 0 of a request.
The ISR essentially arbitrates between concurrent requests from other nodes. Processes
on the ISR node may also participate in the algorithm, because node 0’s ACRs for the
appropriate words are configured to both send and receive interrupts. The SCRAMNet+
card must also be configured to enable self-interrupts, which allows a node to receive its
own interrupt messages. Systran’s algorithm does not use the interrupt features of their
cards. Instead they must use the Write-Me-Last mode which is described next.
2.4 Write-Me-Last Mode
Normally when a node writes to a shared-memory word, the word is immediately
modified on the originating node and a message is propagated around the ring replicating
the write to all other nodes. In Write-Me-Last mode, the originating node of a write is the
last node to have its memory word written. This is achieved by only modifying the
originating node’s memory when it receives its own message. This can be used to
guarantee that data is available on all other nodes by writing a value to a shared-memory
word and then spinning on the word written until it changes to that value. Systran uses
this technique in their mutual exclusion algorithm (see Section 3.1).
3 Blocking Synchronization
Mutual exclusion algorithms are a form of blocking synchronization. The semantics of
mutual exclusion prevent more than one process from entering the critical section at a
time. A process enters the critical section via the Acquire() procedure. If a process B
attempts to enter a critical section while another process A is already in the critical
section, that process B remains in the Acquire() procedure until process A performs a
Release(), which exits the critical section. Therefore, process B is blocked until process A
exits the critical section.
In this section we present a mutual exclusion algorithm suggested by Systran in [15] and
a new mutual exclusion algorithm based on interrupt features of the SCRAMNet+
hardware. We also present the results of both real world and simulation experiments
comparing the two.
3.1 Systran’s Mutual Exclusion Algorithm
Figure 1 contains Systran’s mutual exclusion algorithm, which is described in [15]. The
programming notation used is similar to notation of most shared-memory algorithms and
should be self-explanatory. Their algorithm requires that a node request to enter the
critical section by setting a flag. It must then determine that no other nodes are in the
critical section by reading the flags of all the other nodes. If any other node’s flag is set,
there has been a collision (more than one node has simultaneously written to its flag) and
one of the nodes must continue while the others reset their flags and retry. Systran
suggests a prioritization scheme whereby the lower priority node retries and the higher
priority node may continue.
10
Systran’s algorithm also requires that the SCRAMNet+ system be configured for Write-
Me-Last mode, as described in Section 2.4. This is necessary to guarantee that all nodes
have seen a write before the originating node continues. This is achieved by writing to a
value and spinning on that value until it changes. The acquire() and release() procedures,
described next, implement the entering and exiting of the critical section respectively.
3.1.1 Acquire
Each node’s flag is represented by an element in the array FLAG[N] where N is the
number of nodes in the system. Nodes are prioritized with the highest priority node as the
first element in the array and the lowest priority node as the last. To enter the critical
section, a node n must continually read the entire FLAG array until every element is zero.
This indicates that no other node is currently in the critical section. Then the node writes
a non-zero value to its element in the array, FLAG[n]. It then spins on that array element
until that value is read back. Since the SCRAMNet+ network is in Write-Me-Last mode,
this guarantees that all other nodes have seen its request. Now the node must scan the
FLAG array from highest to lowest priority to see if there have been any collisions.
If a collision is detected, the lower priority node removes its request by writing a zero to
FLAG[n] and starts back at the beginning of the loop. The higher priority node spins on
the lower priority node’s array location until it changes to zero or a time-out expires. The
time-out is necessary because the lower priority node may not have even seen the higher
priority node’s request and will not have cleared its flag. Systran suggests at time-out of
one message transit time. This is easily achieved in Write-Me-Last mode by incrementing
FLAG[n] and waiting to see it change. If the higher priority node does time-out, it
revokes its request by writing a zero to FLAG[n] and starts back at the beginning of the
loop. Otherwise, it continues and enters the critical section.
11
3.1.2 Release
To exit the critical section node n simply writes a zero to its designated array location,
FLAG[n].
12
13
Shared variable FLAG: array[0..N-1] of integer
Local variable i: 0..N-1; zero: boolean; grant: boolean; attempts: integer
procedure Acquire()begin attempts := 0; do grant := true; FLAG[n] := 0; do zero := true; for i := 0 to N-1 do if FLAG[i] ≠ 0 then zero := false; break fi od while ¬zero;
/* Write and wait for own request */ attempts := attempts + 1; FLAG[n] := attempts; While FLAG[n] ≠ attempts do od;
for i := 0 to N-1 do if FLAG[i] ≠ 0 then if i < n then grant := false; break else if i > n then /* Write and wait for one round trip or revoke */ attempts := attempts + 1; FLAG[n] := attempts; While (FLAG[n] ≠ attempts) ^ (FLAG[i] ≠ 0) do od;
if FLAG[i] ≠ 0 then grant := false;
break fi fi fi od while ¬grantend
procedure Release()begin FLAG[n] := 0end
Figure 1 Systran’s mutual exclusion algorithm
14
3.2 Our Mutual Exclusion Algorithm
As explained in Section 2.3, the ISR on node 0 is configured to receive all interrupts in
the system. Our algorithm uses three shared variables to communicate between the ISR
and the nodes. The REQ array is configured to generate interrupts that signal the ISR that
a node requests access to the critical section. The GRANT array is used as a spinlock that
the ISR will write to notify a node that it has been granted access to the critical section.
Finally, RELEASE is configured to generate an interrupt, notifying the ISR that a node
has exited the critical section. The following three sections explain the code for acquire(),
release() and the ISR, which are shown in Figure 2. We have added one definition snaddr
to our programming notation, which is an address within SCRAMNet+ memory.
3.2.1 Acquire
To enter the critical section, a process p must first perform a local acquire. The local
acquire synchronizes processes on the same node by only allowing one process per node
to attempt to enter the critical section. This bounds the size of the arrays to the number of
nodes in the system, rather than the number of processes in the system. It also eliminates
unnecessary network and ISR activity by eliminating multiple requests from the same
node. Any local mutual exclusion algorithm can be used and the same method could be
applied to Systran’s algorithm. However, the local acquire was excluded from our
experiments so that timing particular to the algorithms could be studied.
Once process p has returned from the local acquire, it writes false to GRANT[n], where n
is the node that process p resides on, to initialize the spinlock. Then the process writes
true to REQ[n]. This generates an interrupt message to node 0. The ISR on node 0 then
determines which node is requesting to enter the critical section from the offset in the
interrupt FIFO. Meanwhile process p is spinning on GRANT[n]. Once the ISR has
determined to let the node enter the critical section, it writes true to GRANT[n] and
process p on node n exits the spinlock and continues. A process may now access any
shared data and exit the critical section via the release() procedure.
3.2.2 Release
To exit the critical section, process p writes true to RELEASE. This generates an interrupt
message to node 0, thereby notifying the ISR that a node is exiting the critical section.
RELEASE does not need to be an array due to the semantics of mutual exclusion. That is,
only one node can be in the critical section at a time. In contrast, because multiple nodes
may make concurrent requests, GRANT and REQ must be arrays.
3.2.3 ISR0
The ISR on node 0 maintains two local variables: owner is the node currently in the
critical section and wait is a FIFO queue of nodes waiting for the critical section. If an
interrupt occurs and the offset is within the REQ array, the ISR determines whether to
grant the critical section to the requesting node (req) based on the state of owner. If there
is no current owner, owner equals –1, the request is granted by writing true to
GRANT[req]. Otherwise req is inserted in the wait FIFO.
If the interrupt is caused by a write to RELEASE, then the next node in the wait FIFO
queue is granted the critical section by writing true to GRANT[owner]. If wait is empty,
then owner is set to –1, indicating that the critical section is available. Whenever the ISR
grants the critical section to a node, by writing true to either GRANT[req] or
GRANT[owner], the corresponding process p on node req or owner is released from its
spinlock.
Figure 2 Our mutual exclusion algorithm
Shared varable REQ, GRANT: array[0..N-1] of boolean initially false; RELEASE: integer
interrupts writes to REQ and RELEASE interrupt node 0
private variable for node 0 wait: queue of 0..N-1; req: integer; owner: -1..N-1 initially –1
isr ISR0 (addr: snaddr)begin if addr = &RELEASE then if empty(wait) then owner := -1 else owner := dequeue(wait);
GRANT[owner] := true fi else req = (addr - &STATE[0])/4; /* Determine which process made the request. There are 4 bytes per word */ if 0 req ^ req < N then if owner –1 then enqueue(wait,req) else owner := req; GRANT[req] := true fi fi fiend
procedure Acquire()begin LocalAcquire(); GRANT[n] := false; REQ[n] := true; While GRANT[n] do odend
procedure Release()begin RELEASE := true; LocalRelease()end
3.3 Theoretical Comparison
Both algorithms have arrays of length N and our algorithm also has a queue of size N.
Therefore, the space complexity of both algorithms is O(N), where N is the number of
nodes on the network. However, the time complexity of Systran’s algorithm in the
absence of contention is O(N) versus O(1) for our algorithm. This is because Systran’s
algorithm scans every element of its FLAG array, whereas our algorithm only uses the
REQ array to generate interrupts and the GRANT array as a spinlock. The execution time
of the ISR is also constant and is not affected by the number of nodes when there is no
contention.
However, increasing the ring size does increase the execution time of both algorithms by
also increasing the round-trip transit time of the network. This is inherent to the design of
the network and can not be avoided by any algorithm. Thus we do not consider the transit
time when computing the time complexity of either algorithm.
Furthermore, Systran’s algorithm is not starvation-free because of the chance of
collisions. It is possible that collisions never end and no progress is made by some node
(starvation) or by any node (live-lock). The likelihood of such a situation increases as the
number of nodes increase. Although Systran suggests a priority scheme, there is no
guarantee of the prioritization because a higher priority node may never get into the
critical section if it has a much slower processor speed than some other nodes. This is
because when a node releases the critical section and the array becomes all zeros, a faster
processor might detect this earlier and enter the critical section before the higher priority
but slower node.
In contrast, our algorithm is starvation free due to the use of two FIFOs to order the
requests. First, the interrupt FIFO on the SCRAMNet+ cards ensures that the ISR on
node 0 receives the interrupt messages in First-In First-Out order. The error correction
and retransmission feature of PLATINUM mode also ensures that all interrupt messages
are eventually received correctly by node 0. Second, the wait FIFO is used by the ISR to
orders the nodes waiting for the critical section. All of the analytical comparisons above
were verified through experiments on an actual SCRAMNet+ system and through
simulations.
3.4 System Experiments
Experiments were performed to test each algorithm both with and without contention. In
general, each experiment performed 10,000,000 iterations of an Acquire/Release pair
with an increment of one global variable between them. The average execution time of
the pair of operations was determined by dividing the total time to execute the experiment
by the number of iterations. Due to our limited hardware, all experiments were performed
on a system with only two nodes. However, the total possible number of nodes in the
experiment was varied. For example, the array sizes increase for both algorithms and the
queue size increases for our algorithm. This does not take into consideration the increase
in the round-trip time, but does examine the effects from the algorithm.
For all of the experiments on Systran’s algorithm, node 0 was configured as the highest
priority node and was therefore assigned the first element of the FLAG array. Likewise,
node 1 was configured as the lowest priority node and assigned the second element in the
FLAG array. For all the experiments on our algorithms, node 0 was configured to execute
the ISR and participate in the algorithm and node 1 was configured to only participate in
the algorithm. The first experiments performed were in the absence of contention.
3.4.1 No Contention
The experiments without contention were executed on each node and for each algorithm
but without the other node participating. Figure 3 graphs the average execution time for
each algorithm executed on each node as the maximum number of nodes is increased.
The graphs show that our algorithm scales far better than Systran’s algorithm. This is due
to our O(1) time complexity in absence of contention compared to Systran’s O(N), where
N is the number of nodes on the ring. This is because Systran’s algorithm scans its FLAG
array of length N at least twice per acquire. In contrast, our algorithm only uses its REQ
array to generate interrupts and its GRANT array as a spinlock. Therefore, their graphs
rise as the number of nodes increase where our graphs are flat.
Figure 4 shows a close-up view of the same graphs as Figure 3. It shows that our
algorithm performs better than Systran’s algorithm when a system contains 9 nodes or
more. The graphs also indicate a difference between node 0’s and node 1’s average
execution times for our algorithm. This is due to the different sequence of events for the
acquire procure on each node, as shown in Figure 5. Node 0 does not have to wait for a
transit time for the ISR to see its writes and vice versa, since they both are on the same
node. However, the process on node 0 does suffer from a context switch delay between
the time the ISR finishes and when the process can continue. In contrast, node 1 must
always wait a transit time for the ISR to see its writes and vice versa, but it does not have
to wait for a context switch before continuing. This is because a message is sent to node 1
as soon as the ISR writes to GRANT[1] and then the context switch occurs as the ISR
exits. This context switch does not affect node 1 since it occurs on node 0.
0
50
100
150
200
250
300
350
400
450
500
550
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum number of Nodes
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
μs)
Systran's Alg. (Highest Priority)
Systran's Alg. (Lowest Priority)
Our Alg. (Node w/ ISR)
Our Alg. (Node w/o ISR)
Figure 3 Comparison of ME algorithms without contention on a real system
0
5
10
15
20
25
30
35
40
45
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Maximum Number of Nodes
Ave
rage
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Sytran's Alg. (Highest Priority)Systran's Alg. (Lowest Priority)Our Alg. (Node w/ ISR)Our Alg. (Node w/o ISR)
Figure 4 Close-up comparison of ME algorithms without contention on a real system
Figure 5 Timing sequence of our algorithm’s acquire procedure without contention
Node 0:1. Process writes to REQ[0]
2. Interrupt occurs and context switch changes to ISR
3. ISR write true to GRANT[0]
4. ISR exits and context switch changes to the process
5. Process sees GRANT[0] = true
Node 1:1. Process writes to REQ[1]
2. Transit time delay for write to reach node 0
3. Interrupt occurs and context switches changes to ISR
4. ISR writes true to GRANT[1]
5. Transit time delay for write to reach node 1
6. Process sees GRANT[1] = true
3.4.2 Contention
The experiments with contention were executed simultaneously on both nodes with a
barrier to synchronize the start of the experiment. The first test was to verify that the
semantics of mutual exclusion were maintained. This was achieved by simply verifying
that the global counter, which is incremented between the Acquire/Release pair, was
twice the number of iterations, which was always true for both algorithms. The
experiments also calculated the average execution times as before.
Figure 6 and Figure 7 graph the average execution time for each algorithm as the
maximum possible number of nodes is increased. The results indicate that the average
execution time increases with contention for Systran’s algorithm whereas its decreases
for our algorithm. This is demonstrated by computing the combined average execution
time of both nodes and comparing the result to the average without contention. The
average time with contention was calculated by taking the maximum value of the two
nodes and dividing it by two. The maximum value is used because it is the time that both
nodes finished all the iterations. The average time without contention was computed by
adding the results for the two nodes without contention and dividing that value by two.
Table 1 contains the computations from the results with a maximum number of nodes
equal to 256.
AVERAGE SYTRAN’S ALGORITHM OUR ALGORITHMWith contention (1020.4 / 2) = 510.3 (40.3 / 2) = 20.2Without contention (509.4 + 507.3) / 2 = 508.4 (22.8 +19.0) / 2 = 20.9
Table 1 Comparison of average execution times for pair of acquire/relaease operations when the maximum number of nodes equals 256 (s)
Both graphs also demonstrate that our algorithm is fair to both nodes, while Systran’s
algorithm is not. That is, with our algorithm both nodes have identical executions times
of 40.3 microseconds, while there is a significant difference with Systran’s algorithm.
Their lower priority node takes twice as long as the higher priority node, because it is
completely starved by the priority mechanism. The highest priority node essentially runs
to completion and then the lowest priority node runs to completion. Although this is a
consequence of their design, it is not guaranteed as explained in Section 3.3. However,
our algorithm could guarantee this through sorting its wait queue by priority.
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum number of Nodes
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg. (Highest Priority)
Systran's Alg. (Lowest Prioirty)ISR Alg. (Node 0 w/ ISR)
ISR Alg. (Node 1 w/o ISR)
Figure 6 Comparison of ME algorithms with contention on a real system
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Maximum Number of Nodes
Ave
rage
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran' Alg. (Highest Priority)Systran' Alg. (Lowest Priority)Our Alg. (Node w/ ISR)Our Alg. (Node w/o ISR)
Figure 7 Close-up comparison of ME algorithms with contention on a real system
3.5 Simulation Experiments
Our real system only consisted of two nodes, which was sufficient for testing the
algorithms without contention but neglected the affects of the network’s round trip time.
It also limited the number of nodes that could participate in the experiments. Therefore,
we designed a SCRAMNet+ simulator for any number of nodes, which is described in
Section 5. The results of the simulation experiments for the mutual exclusion algorithms
are provided below.
The first experiments were identical to those on the real system and were performed both
with and without contention. As before, only two nodes were used and the maximum
number of possible was increased. Only the trends of the simulation and real system
experiments should be compared since the simulation does not take all factors into
account. One factor is the activity of the RT-Mach operating system, such as clock
interrupts, swapping, daemons, etc. Not considering these factors should make the results
of the simulation faster than the real-system results, which is the case.
3.5.1 No Contention
The simulation results of the experiments without contention are shown in Figure 8 and
Figure 9, which correspond to the real-system results in Figure 3 and Figure 4. These
graphs are similar in both the slope and magnitude of the graphs. They also show a
similar difference between the nodes with and without the ISR for our algorithm. That is
the node without the ISR has faster executions times than nodes with the ISR when
executed without contention.
3.5.2 Contention
The simulation results of experiments with contention are shown in Figure 10 and Figure
11, which correspond to the real-system results in Figure 6 and Figure 7. These graphs
also show similar trends in slope and magnitude. They also show that both nodes in our
algorithm have identical execution times when under contention.
The similarity of the simulation and real-world results validate the implementation and
parameters used for our simulations (see Section 5.4). This allowed us to perform more
simulation experiments with confidence in the results.
0
50
100
150
200
250
300
350
400
450
500
550
600
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum number of Nodes
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg. (Highest Priority)
Systran's Alg. (Lowest Priority)
Our Alg. (Node 0 w/ ISR)
Our Alg. (Node 1 w/o ISR)
Figure 8 Comparison of ME algorithms without contention on a simulated system
0
5
10
15
20
25
30
35
40
45
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Maximum Number of Nodes
Ave
rage
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs) Systran's Alg. (Highest Priority)
Systran's Alg. (Lowest Priority)
Our Alg. (Node 0 w/ ISR)
Our Alg. (Node 1 w/ ISR)
Figure 9 Close-up comparison of ME algorithms without contention on a simulated system
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum Number of Nodes
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg. (Highest Priority)Systran's Alg. (Lowest Prioirty)
Our Alg. (Node 0 w/ ISR)Our Alg. (Node 1 w/o ISR)
Figure 10 Comparison of ME algorithms with contention on a simulated system
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Maximum Number of Nodes
Ave
rage
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg. (Highest Priority)
Systran's Alg. (Lowest Priority)
Our Alg. (Node 0 w/ ISR)
Our Alg. (Node 1 w/o ISR)
Figure 11 Close-up comparison of ME algorithms with contention on a simulated system
3.5.3 Polling
We also used the simulator to implement a polling version of our mutual exclusion
algorithm. It uses a dedicated node to continually poll the interrupt FIFO and execute the
same code as the ISR. This eliminates any context switch times, but adds one extra transit
time to the round trip time. We used the same experiments from the previous mutual
exclusion simulations to compare against the new polling version.
The results comparing the ISR and polling versions without contention are shown in
Figure 12. With the polling version, both nodes have the same average execution time
without contention. This is because neither thread runs on the ISR node. There is also a 4
microsecond or 27 percent improvement because there is no context switch delay in the
polling version. There is not a complete context switch time difference of 5-microsecond
because the polling version uses a dedicated node, requiring three nodes for two nodes to
participate. This increases the ring size and the round-trip transit time. The results also
further validate the context switch calculation from the timing diagrams in Figure 19
which shows one context switch for nodes without the ISR when using the interrupt
version of the our algorithm.
The results comparing the interrupt and polling versions under contention are shown in
Figure 13. The trends are similar for both algorithms. That is, both nodes have identical
execution times. However, the removal of the context switch reduces the total execution
time of the polling version by 18 microseconds or 50 percent.
0
5
10
15
20
25
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum Number of Nodes
Ave
rage
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs) Polling Ver. (Node 1)
Polling Ver. (Node 2)Interrupt Ver. (Node 0 w/ ISR)Interrupt Ver. (Node 1 w/o ISR)
Figure 12 Comparison of polling and interrupt versions without contention on a simulated system
0
5
10
15
20
25
30
35
40
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Maximum Number of Nodes
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Polling Ver. (Node 1)Polling Ver. (Node 2)Interrupt Ver. (Node 0 w/ ISR)Interrupt Ver. (Node 1 w/o ISR)
Figure 13 Comparison of polling and interrupt versions with contention on a simulated system
3.5.4 Heavy Contention
The main goal of developing the simulators was to perform experiments with a large
number of nodes without the actual hardware. Because our real system only had two
nodes, we could only increase the total possible number of nodes by increasing the sizes
of the structures used by each algorithm. However, this did not affect the ring size and
round-trip time of the messages. With our new simulations we created experiments where
every node added increases the ring size and that node participates in acquiring the
critical section.
The results comparing the mutual exclusion algorithms are shown in Figure 14 and
Figure 15. These figures graph the average time for a node to execute an Acquire/Release
pair of operations. The average was computed by taking the maximum execution time
and dividing that value by the number of nodes in the experiment. This computation was
used because when under contention, all nodes executed simultaneously. Therefore, all
nodes are finished at that maximum time. The results show that both the interrupt and
polling versions of our algorithm clearly outperform Systran’s algorithm when under
heavy contention. In fact the polling version outperforms their algorithm with any
number of nodes in the ring. Our execution time does increase as the number of nodes
increase, but this is attributed to the increase in ring size. As the size increases it takes
longer for a message to traverse the ring, ultimately increasing the execution times.
Figure 16 shows the average execution time for each node on a 256-node system under
heavy contention. Our algorithm is clearly fair since each node has an identical average
for both the interrupt and polling versions. In contrast, Systran’s algorithm starves any
lower priority node. The graph indicates this because as the node number increases, the
execution time increases and node’s priority decreases. However, the graph is not strictly
monotonically increasing which indicates this priority is not always guaranteed.
0
100
200
300
400
500
600
700
800
900
1000
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Number of Nodes
Max
imum
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg.Interrupt Ver.Polling Ver.
Figure 14 Comparison of all ME algorithms under heavy contention on a simulated system
0
5
10
15
20
25
30
35
40
45
50
55
60
65
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of Nodes
Max
imum
Exe
cutio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs)
Systran's Alg.Interupt Ver.Polling Ver.
Figure 15 Close-up comparison of all ME algorithms under heavy contention on a simulated system
0
100
200
300
400
500
600
700
800
900
1000
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256Node
Aver
age
Exec
utio
n Ti
me
for A
cqui
re/R
elea
se P
air (
µs) Systran's Alg.
Interrupt Ver.Polling Ver.
Figure 16 Comparison of average execution times for each node under heavy contention
3.6 Conclusions and Future Work
Our algorithm scales far better than Systran’s algorithm both with and without
contention. In fact, the polling version of our algorithm outperforms Systran’s algorithm
regardless of the number of nodes. The results of our experiments also show our
algorithm is starvation-free were Systran’s algorithm is not.
The removing of the context switches by the polling version of our algorithm also leads
to an even more interesting extension. The mutual exclusion functionality could be
embedded into the SCRAMNet+ card. Each card could use a microprocessor and
firmware to execute the algorithm, thus avoiding interrupts and context switches on the
nodes themselves. The load could also be distributed by configuring cards on each node
to process different critical sections. For example, node 0 could process the first 5 critical
sections, node 1 the next 5 and so on.
Another modification to the hardware could reduce the amount memory used by our
mutual exclusion algorithm. Currently, the interrupt FIFO only contains the offset of the
word written. However, if the value of the write was also included, the REQ and GRANT
arrays could be reduced to just two words; REQ2, and GRANT2. A process could then
write its node number to REQ2 to signal the ISR and read GRANT2 to determine who
currently has the critical section, including itself. The ISR would use the value from the
FIFO, instead of the offset, to determine which node is requesting the critical section and
grant the critical section by writing the node number to GRANT2. RELEASE would be
used as before. Currently the variables must be arrays to avoid a race condition where the
value of a write may change before the ISR can read the value of the write.
Finally, priority-based schemes could be implemented and tested to compare against
Systran’s algorithm. However, the nondeterministic nature of their prioritization scheme
will depend on the system configuration such as, the heterogeneity of processor speeds
among nodes, the correlation between the priority of a node and its position on the ring,
and the ring size.
4 Non-blocking Synchronization
Non-blocking algorithms avoid the pitfalls of blocking algorithms, such as deadlock.
Compare and Swap is a universal object that can construct any wait-free or lock-free
object [6]. This section presents a CAS algorithm for SCRAMNet+ systems and uses it to
construct both lock-free and wait-free large objects.
4.1 Compare and Swap
To understand our implementation, it is important to first understand the semantics of
CAS, which are equivalent to the atomic code fragment in Figure 17. We have also
provided a Read operation, because most non-blocking algorithms that use CAS require
it. The semantics of the Read operation are to return the new value from the last
successful CAS operation.
Figure 17 Semantics of compare and swap
Our compare and swap algorithm uses similar ISR and spinlock techniques to our mutual
exclusion algorithm. However, instead of each element of the array representing a node,
they represent a process. Node 0 is again configured to execute the ISR, which maintains
the current value (cur) of the implemented register and arbitrates the CAS and Read
requests for all nodes.
The algorithm uses four shared arrays of length P, where P is the number of processes in
the system. The first two arrays, OLD and NEW, are used by the CAS() procedure to pass
34
CAS(X, old, new) if X = old then X := new; return true else return false
the old and new parameters to the ISR. The NEW array is also used to return the value to
the Read() procedure. The second two arrays, STAT and DONE, are used to indicate the
type of operation, read or cas, and the result of the operation, succ or fail. The STAT
array is also configured to generate an interrupt to inform the ISR of a request. The
CAS(), Read() and ISR procedures are shown in Figure 18 and are explained in the
following two sections.
4.1.1 CAS
To perform a compare and swap, a process p writes its old value to OLD[p] and its new
value to NEW[p]. This is used to pass the information to the ISR. Then cas is written to
both DONE[p] and STAT[p]. Writing cas to DONE[p] initializes the spinlock and writing
cas to STAT[p] generates an interrupt message to node 0, indicating a cas operation. Then
process p spins on DONE[p], waiting for the ISR to indicate that the operation is
complete.
When the ISR receives a cas operation it compares OLD[req] to cur, where req is the
node requesting the operation. If they are the same, the operation is successful. In this
case, cur is updated to the value of NEW[req], and succ is written to DONE[req]. If they
are different, then fail is written to DONE[req]. Either way process p is released from its
spinlock and returns the result of the operation, DONE[p].
4.1.2 Read
To read the current value of the variable, a process p must write read to both DONE[p]
and STAT[p]. Writing read to DONE[p] initializes the spinlock and writing read to
STAT[p] generates an interrupt message to node 0, indicating a read operation. Then
process p spins on DONE[p], waiting for the ISR to indicate that the operation is
complete.
35
When the ISR receives a read operation it simply writes the current value (cur) to
NEW[req] and writes succ to DONE[req], where req is the node requesting the operation.
The read procedure is thereby released from its spinlock and returns the value of NEW[p].
36
Figure 18 Compare and swap algorithm
37
shared variable OLD, NEW: array[0..P-1] of valtype; STAT, DONE: array[0..P-1] of {read, cas, succ, fail}
interrupts writes to STAT interrupt node 0
private variable for node 0 cur: valtype; req: 0..P-1
isr ISR0(addr: snaddr)begin req = (addr - &STAT[0])/4; /* Determine which process wrote STAT. */ if 0 ≤ req ^ req < P then case STAT[req] of cas: if OLD[req] ≠ cur then DONE[req] := fail else cur := NEW[req]; DONE[req] := succ fi read:
NEW[req] := cur; DONE[req] := succ esac fiend
procedure Read() returns valtypebegin DONE[p] := read; STAT[p] := read; while DONE[p] = read do od; return NEW[p]end
procedure CAS(old, new:valtype) returns booleanbegin OLD[p] := old; NEW[p] := new; DONE[p] := cas; STAT[p] := cas; while DONE[p] = cas do od; return DONE[p] = succend
4.1.3 Analysis
The space complexity of our algorithm is O(P), where P is the number of processes. This
is because all the arrays are all of size P. The time complexity of our algorithm is O(1) in
the absence of contention, since every access is directly to one element in each array. It is
also easy to see that our algorithm provides the correct semantics for both operations by
using the ISR to serialize requests for both operations.
It is tempting to implement the current value (cur) as a shared-memory variable, allowing
any process to simply read the location instead of using the ISR. However, this would
allow a node “downstream” to read an old value of the register after the ISR updates the
new value. This would be an improper serialization of the operations and violate the
semantics of Read and CAS.
One might also question why both the STAT and DONE arrays were not combined into
one array, say STAT2. The arrays were not combined because the ISR would generate an
unnecessary interrupt to itself whenever it writes succ or fail to STAT2[n]. Therefore, the
arrays were kept separate, avoiding unnecessary context switches and delays.
4.1.4 Experiments
The first experiment used a shared-memory counter incremented with the Read and CAS
operations to validate our algorithm. The counter was incremented by reading the current
value with Read(), incrementing that value, then updating that value with CAS(). If CAS()
failed, the sequence was repeated until a success, and that would be considered one
iteration of the loop. Both nodes simultaneously ran 100,000,000 iterations and the final
value was verified as twice that.
38
Two more experiments were performed for each procedure. One experiment tested the
algorithm without contention and the other with contention. Each experiment simply
performed 100,000,000 iterations of each procedure tested and an average execution time
was calculated from the total execution time of the loop. The results were as follows.
Table 2 contains the results of the Read procedure both with and without contention.
Likewise, Table 3 contains the results for the CAS procedure. As with our mutual
exclusion algorithm, the maximum possible number of processes does not affect the
performance. Therefore both tables include the results for maximum number of processes
equal to 256 or one per node. However, increasing the actual number of nodes in the
network would affect the performance due to the increase in the round-trip transit time.
READ NODE 0 W/ ISR NODE 1 W/O ISRNO CONTENTION 34.10 31.67CONTENTION 39.25 38.11
Table 2 Average execution time for a read operation (s)
CAS NODE 0 W/ ISR NODE 1 W/O ISRNO CONTENTION 37.38 34.00CONTENTION 39.67 39.08
Table 3 Average execution time for CAS operation (s)
The results for the contention experiments were measured simultaneously by using a
barrier to synchronize the start of nodes 0 and 1. Therefore the average time for both
nodes to complete one operation is the higher of the two nodes, 39.25 microseconds for
Read() and 39.67 microseconds for CAS(). In contrast, the experiments without
contention calculate the average time for only one node to perform one operation.
Therefore, the average times with contention are actually for twice as many operations as
the average times without contention, but are almost the same time. This happens because
39
the ISR can read multiple requests (FIFO entries) within one context switch. This is
highly likely since the nodes are operating in parallel. This concurrency was not as
evident in the mutual exclusion experiments, due to the semantics of mutual exclusion.
That is, although the ISR may simultaneously process two Acquire requests, only one
will receive an immediate response.
40
41
4.2 Large Objects
The lock-free and wait-free constructions for large objects of Anderson and Moir from
[2] were simulated and evaluated by Filachek in [5]. This thesis furthers their work by
implementing the large object constructs on a SCRAMNet+ system using the CAS from
above.
The implementation of the large shared objects required several modifications to the code
from [5]. One of the modifications was due to the different models used. [5] uses a
thread-based model running on a multiprocessor machine. However a SCRAMNet+
system is a process-based model running on different machines. This difference required
the addition of barriers to synchronize the initialization of the data structures between the
two nodes. A thread would just inherit such information from its parent.
All of the large object constructions in [2, 5] use the load-linked (LL), stored-conditional
(SC) and validate (VL) primitives and were implemented by Read and CAS primitives as
described in [10]. Therefore the LL, SC and VL primitives were modified to use our CAS
algorithm from Section 4. However, these primitives use 64-bit values and required our
CAS and Read operations to be modified to do likewise. This was achieved by doubling
the size of the OLD and NEW arrays, so they could be indexed as an array of long
integers.
4.2.1 Experiments
Two experiments were performed for both the lock-free and wait-free implementations of
a FIFO queue. Each experiment performed 100,000 iterations of the Enqueue/Dequeue
pair of operations on a queue. The number of iterations for these last experiments is lower
than the previous sections due to time constraints. The SCRAMNet+ cards were
borrowed for a limited time and the higher execution times caused 100,000 iterations to
take at least a day. As before, one experiment tested the algorithms without contention;
the other tested them without contention.
42
The most important result was from the experiments with contention, which validated the
correctness of the large object constructions. Checking the state of the queue throughout
the experiments validated the correctness. First, each node inserted its node number with
each enqueue operation and verified the number returned by each dequeue operation. A
valid number was either 0 or 1, since both nodes were operating in parallel. Second, a
barrier was used to detect when both nodes where finished, at which the queue was
checked if it was empty. Finally, the total number of dequeues for each node was verified
to be the same as the number of enqueues or iterations.
The lock-free contention results from Table 4 show that the algorithm may not be fair.
Node 0 has significantly shorter average executions times than node 1. Therefore it may
be possible for one node to be starved. However, this is acceptable because of the
definition of lock-free, which is that some process will make progress in a finite number
of steps. The definition does not indicate which process should make progress and
therefore allows starvation.
LOCK-FREE NODE 0 W/ ISR NODE 1 W/O ISRNo Contention 1187.5 1375.3Contention 1255.0 2583.1
Table 4 Average execution time for pair of enqueue/dequeue operations for lock-free construction of large objects (s)
Just as before, the results for contention for node 0 and node 1 were measured
simultaneously. Therefore the total time for both to complete is the higher of the two
results, 2583.1 microseconds. This value is near the sum of the two nodes without
contention (1187.5 + 1375.3 = 2562.8 2583.1). This indicates that the operations are
not concurrent as mentioned in both [2] and [5].
WAIT-FREE NODE 0 W/ ISR NODE 1 W/O ISRNo Contention 2438.6 2643.4Contention 4393.3 3186.0
Table 5 Average execution time for pair of enqueue/dequeue operations for wait-free construction of large objects (s)
43
The wait-free results are far more difficult to interpret. One would expect when under
contention the execution times for both nodes would be nearly the same. However, since
the interrupt occurs on node 0, it may not be requesting operations as quickly as node 1.
Therefore it will end later than node 1 as it finishes the rest of it operations. We believe
further experiments and analysis of systems with many nodes is necessary to explain this.
4.2.2 Conclusions and Future Work
There are two important conclusions from our CAS experiments. First, our CAS
algorithm works thus allowing us to construct both lock-free and wait-free large objects.
Second, the algorithm handles contention very well. However, more nodes are necessary
to solidify this conclusion. Comparing the results with and without contention also
indicates that the ISR operates more efficiently when heavily utilized. This is because it
can process more than one request within one context switch.
The most significant result was the implementation of the both lock-free and wait-free
large objects on a memory model as unique as SCRAMNet+. Although the execution
times are higher than hoped, the validation of the algorithm is significant. Further work in
improving the compare and swap algorithm, such as a dedicated polling node, may
increase performance. The LL, VL and SC primitives could also be implemented directly
in the ISR or other SCRAMNet+ memory to improve the performance further. The
benefits of lock-free and wait-free operations demands continued research in this area.
5 Simulation
Augmint is a software package on top of which multiprocessor memory hierarchy
simulators can be constructed for Intel Architecture specific platforms [1]. A simulator is
constructed by creating a test application with C/C++ and m4 macros supplied by
Augmint. The m4 macros are used to implement constructs such as locks, barriers,
semaphores, etc. The test application is augmented to generate events during memory
accesses. Events are also generated directly by the m4 macros. Each event has an
associated procedure in a library called the backend. By developing different backends,
different memory models can be simulated. As each thread runs in the simulation, its
execution time is calculated by the time spent to process each event in the backend and
by the time to execute each machine instruction.
The following sections provide an overview of Augmint. See [1, 16, 17, 21] for more
details. Augmint is composed of a compile-time and run-time component. The compile-
time component performs the code augmentation and the run-time component schedules
and executes the generated events.
5.1 Compile-Time
Application code is first written in C/C++ and m4 macros supplied by Augmint. A GNU
C compiler compiles the C file to generate 80x86 assembly instructions. Then a program,
called the Doctor, parses the assembly code for memory references and inserts code just
before each memory reference. This code calculates the address, size and value of the
memory reference and generates an event corresponding to the memory access. The code
also updates the thread’s time in processor cycles, to account for the execution of the
44
instructions leading up to the event. The Doctor determines the number of cycles from a
table of mnemonics and corresponding processor cycles found in the file
mnemonics.unix.x86. Finally the augmented code is linked to the Augmint and backend
libraries and is ready to run.
5.2 Run-Time
The run-time component consists of three parts; the Application, Augmint and the
backend. The application is the user’s C code, written with main() replaced by
appl_main(). Augmint is the main thread of execution for the simulation and manages
events, tasks, and threads. The backend is a library that executes the actual events.
When a simulation executable is run, Augmint executes first since is contains the actual
main(). Then Augmint schedules a task to switch to the main application thread by
calling appl_main(). The application code then executes as usual until an augmented code
region is executed. The augmented code causes an event and context switch back to the
Augmint thread. The Augmint thread creates and schedules a new task to process the
event.
5.2.1 Events
Events are generated directly from m4 macros or indirectly through memory references in
the application code. When a thread generates an event, the thread is blocked and a task is
scheduled to process the event. Each event has an associated procedure in the backend,
which is called when the task is scheduled to execute. The return value from this
procedure controls the execution of the thread that generated the event. A return value of
T_ADVANCE or T_CONTINUE allows the thread to continue. A return value of
T_FREE, T_YIELD or T_NO_ADVANCE leaves the thread blocked. The other
difference in the return values is how they affect the memory of a task, as described in
Section 5.2.3.
45
Each event is represented by a structure containing the process identifier (pid) of the
thread that generated the event, the time the event occurred and the type of event. When
an event returns T_ADVANCE, that event’s time is used to update the time of the thread
that generated the event. This allows the backend to arbitrarily delay a thread. This is
fundamental to any memory simulation, such as a cache, and is key for our
implementation of an ISR context switch (see Section 5.3.4.2). The event structure also
contains the address for use in Data Movement mode, which is described next.
5.2.2 Data Movement
Normally, Augmint performs a read and returns that value to the thread after a read event
returns from the backend. Likewise, Augmint writes the actual value of a write after a
write event returns from the backend. However, with Data Movement, the backend is
responsible for performing the actual reads and writes instead of Augmint. This is
achieved by passing the backend a pointer to the data accessed by the read or write
operation. On a read event, the backend writes the read value via the pointer and Augmint
returns that value to the thread. On a write event, the pointer references the value written
by the thread and the backend copies that value to the proper address, thus performing the
write.
The Data Movement option was key to implementing the Write-Me-Last mode (see
Section 5.3.3) and CSR registers (see Section 5.3.4.1). In general, our backends require
each thread to allocate its own copy of SCRAMNet+ memory. With Data Movement, our
backends control the values read and written from each copy of memory. In Write-Me-
Last mode, when a thread writes to SCRAMNet+ memory, its copy should change some
time after the write returns due to the transit time of the write message. Without Data
Movement the write would occur immediately after the backend returns. When the
Doctor is passed the –V option it will generate code for data movement and when used
46
in conjunction with the –V command line option to Augmint, Data Movement is enabled
(see Section B.1.1).
5.2.3 Tasks
To accommodate concurrent read and write operations, Augmint provides for scheduling
of arbitrary independent tasks. Each task has an associated structure containing a time,
priority, function pointer and pid. Tasks are added to a structure called the time wheel,
which orders the tasks by time and then by priority. Augmint executes the tasks in that
order by calling a task’s function pointer. In the case of an event, the function pointer
contains the appropriate backend procedure. As mentioned in Section 5.2.1, the return
value of a backend function controls the execution of the application thread, but is also
affects the memory of the task associated with the event. Return values of
T_ADVANCE, T_CONTINUE and T_FREE all free the memory associated with the
task’s structure. However, returning T_YIELD does not. This allows a task to be saved
and rescheduled later. We used this feature to implement our ISR context switch, as
discussed in Section 5.3.4.2.
5.2.4 Threads
Since Augmint uses a single thread of execution, an Augment thread is just a passive
structure simulating an actual thread. The thread structure contains state and context
switch information used by an associated task that actually executes the code. Each time
the application code calls CREATE(), a new application thread is created, simulating a
fork(), and a task is created and scheduled to execute the thread. The thread structure also
contains the current time, which is updated at every context switch and return of each
event.
47
5.2.5 Backend
The backend is a customizable event execution library. Each event is implemented by a
procedure in the backend. For example, a shared-memory write is implemented by
sim_write() and a read by sim_read(). Augmint passes a pid value to each event in the
backend to indicate the thread that generated the event. Therefore, a thread and pid are
interchangeable when discussing the backend.
5.2.6 Execution
The execution of a simulation is based on Threads, Events, and Tasks. However, Tasks
perform all the work by executing the events associated with them. For example, when a
thread generates a read event, a task is created and the thread is blocked. When the task is
created it is assigned the thread’s pid and its function pointer is assigned to sim_read().
When that task reaches the front of the time wheel, Augmint calls sim_read() via the
task’s functions pointer. If the event returns T_ADVANCE, Augmint reads the pid from
the current task and unblocks that thread. When a thread unblocks, its time is updated to
the time of the task and a context switch is made to begin executing the application code
until another event occurs. If sim_read() were to return T_FREE, the thread would
remain blocked until some task with the same pid returns T_ADVANCE. Therefore, a
task’s pid and associated event’s return values are used to control a thread.
5.3 SCRAMNet+ Backends
A backend can be written to simulate any given memory model. We wrote three
backends for the following SCRAMNet+ memory models; Write-Me-Last mode;
SCRAMNet+ interrupts; SCRAMNet+ polling. The Write-Me-Last backend was used to
simulate Systran’s mutual exclusion algorithm, which must be in Write-Me-Last mode.
The interrupt backend was used to simulate our mutual exclusion algorithm, which
includes context switches between the application process and the ISR on node 0. Finally,
the polling backend was used to simulate a dedicated node polling the interrupt FIFO as
48
suggested in Section 3.5. Each of these backends uses the same techniques to implement
the basic SCRAMNet+ memory model.
5.3.1 Memory Model
There are three common parameters to all SCRAMNet+ memory models: the read access
time, write access time and transit time. Since, the SCRAMNet+ card’s memory is dual-
ported RAM and it is mapped into each process, it cannot be cached. Therefore, every
read and write must directly access the bus. The read and write access times represent the
time to access the bus and the time for the card to respond. The transit time represents the
time it takes a write message to propagate from one node to the next.
In our models each thread represents one processor. Therefore, the pid of the thread is
equivalent to its node number. To simulate SCRAMNet+, each thread uses the m4 macro
G_MALLOC() to allocate memory in the backend. The memory address returned is used
as the address of the SCRAMNet+ memory. This way each node reads and writes out of
its own SCRAMNet+ memory, just like on the real system. Execution of G_MALLOC()
generates a sim_shalloc() event in the backend. When sim_shalloc() executes, it stores
the newly allocated memory’s size and address in a memory map table indexed by the pid
of the thread that generated the event. This information is then used by sim_read() and
sim_write() as described next.
Whenever a thread reads a memory location a sim_read() event is generated. The
memory size, memory address and the thread’s pid are passed to sim_read(). Sim_read()
first checks the memory map table to see if the address is in the SCRAMNet+ memory of
the pid. If it is not, then the value is immediately read and T_ADVANCE is returned in
order to unblock the thread. If it is, a new task, node_read(), is scheduled one read access
time after the current time and the thread is blocked by returning T_FREE. When
node_read() is scheduled to execute, it performs the read and returns T_ADVANCE,
49
thereby unblocking the thread. This simulates the delay of accessing the SCRAMNet+
card for a read.
Whenever a thread writes a memory location a sim_write() event is generated. The
memory size, memory address, memory value the thread’s pid are passed to sim_write().
Sim_write() first checks the memory map to see if the address is in the SCRAMNet+
memory of the pid. If it is not, then the value is immediately written and T_ADVANCE
is returned to unblock the thread. Otherwise, an new task, issue_ring_write(), is
scheduled one write access time after the current time and the thread is blocked by
returning T_FREE. When issue_ring_write() is scheduled to execute, it unblocks the
thread by returning T_ADVANCE. This simulates the delay for writing to the
SCRAMNet+ card. Issue_ring_write() is also starts propagating a write around the ring.
This is achieved by creating and scheduling a new task, node_write(). Since
issue_ring_write() unblocks the thread by returning T_ADVANCE, the thread may
proceed normally while the node_write() propagates the write around the ring.
Node_write() is passed the originating node, memory offset, destination node and value
of a write. When it executes, the SCRAMNet+ memory address is found in the memory
map table by indexing by the destination node. Then the value is written to the same
offset in the destination node’s memory. If the destination node is not the originating
node, then the destination node is incremented and another node task is scheduled one
transit time later. If the destination and originating node are the same, node_write()
simply ends by returning T_NO_ADVANCE.
One might suggest that normal memory reads and write should be modeled with a cache.
However, Augmint only supplies an infinite cache model. Since our threads all run on
different processors, the cache would never be invalidated and would only waste
50
computation time. Also, most memory accesses in our algorithms are to SCRAMNet+
memory, therefore the added accuracy of a realistic cache was deemed unnecessary.
The Write-Me-Last, interrupt and polling backends are variations of the basic memory
model described above. Each backend is different in how the originating node and initial
delay are used by issue_ring_write() and node_write(). The interrupt and polling
backends also model the interrupt features of the SCRAMNet+ card.
5.3.2 User Events
All three models use the GEN_USER_EVENT macro to generate the sim_user() event in
the backend. We defined the first parameter of GEN_USER_EVENT to specify the type
of user_event() and the second parameter to pass in data such as a return pointer. The
GET_PID type of user_event() returns the pid used by the simulation and backend. The
GET_TIME type of user_event() returns the current simulation time in cycles. Both are
used for the timing and analysis of the simulations.
5.3.3 Write-Me-Last Backend
In the Write-Me-Last backend, issue_ring_write() assigns the destination node as the
originating node plus one and schedules the first node_write() one transit time after the
current time. This causes the originating node to be written last. The Data Movement
option is essential for the Write-Me-Last mode to work. Without this option, Augmint
would automatically perform the write to the originating node’s memory after the thread
is unblocked, making Write-Me-Last mode unachievable. However, with Data Movement
it is the responsibility of the backend to perform a write. Therefore, when
issue_ring_write() returns T_ADVANCE, the thread continues. However, subsequent
reads will return the old value until the node_write() for the originating node executes,
which is scheduled last.
51
5.3.4 Interrupt Backend
In the interrupt backend, issue_ring_write() uses the originating node as the destination
node and schedules the first node_write() at the current time. This causes the write on the
originating node occur immediately and all others one transit time apart. When the
backend propagates a write, it uses a thread’s pid as the node number. However, the
interrupt backend simulates the application thread for node 0 and the ISR thread as on the
same node. This is achieved by assigning the ISR thread a pid of 0 and the application
thread for node 0 a pid of 1. Then the backend checks for writes propagating from pid 0
to pid 1. If this occurs, the node_write() for pid 1 is scheduled at the current time instead
of one transit time later. This causes the writes on the application thread and the ISR
thread to occur simultaneously. Both the interrupt and polling backends also simulate the
interrupt FIFO information on each SCRAMNet+ card.
5.3.4.1 Interrupt FIFO
As described in Section 2.2, each SCRAMNet+ card contains a FIFO of interrupt offsets.
The backend maintains a queue of memory offsets to simulate this FIFO. Access to the
interrupt FIFO is provided through CSRs (Control/Status Registers). The CSR registers
are mapped into SCRAMNet+ memory above 0x80000. We modeled the CSR’s access
exactly such that a port of the ISR code would not require any major modifications.
Therefore, a thread must allocate 0x100000 bytes through G_MALLOC() to access these
registers.
CSR4 contains the 16 least significant bits of the interrupt offset at the top of the FIFO.
CSR5 contains the 8 most significant bits and a FIFO “not empty” status bit. To simulate
this, the node_read() event was modified in both the interrupt and polling backends to
check if the read address equals CSR4 or CSR5. If a read is from CSR5, the status of the
interrupt queue is checked. If it is empty, sim_read() simply returns with the FIFO “not
52
empty” status bit cleared as the value of CSR5. If it is not empty, it dequeues an offset
and returns with the FIFO “not empty” status bit set and the 8 most significant bits of the
offset as the value of CSR5. The remaining 16 least significant bits of the interrupt offset
are stored in a static variable in the backend, which is returned by a subsequent read of
CSR4. The Data Movement option was also essential in implementing the CSR registers
by allowing the backend to control the return values of the CSR reads. Otherwise
Augmint would automatically calculate the value of a read.
To simulate the interrupt FIFO, the interrupt and polling backends also modified
node_write() to check if the address written is configured to generate interrupts. If it is,
the offset of the write is put in the interrupt-offset queue. Node_write() must then
determine if it should generate an interrupt and create a context switch on node 0 from
the application thread to the ISR thread.
5.3.4.2 Context Switches
The interrupt backend is designed to simulate the execution of both the ISR and the
application thread on node 0. To implement this, the backend assumes that the ISR’s pid
is 0 and the node 0 application thread’s pid is 1. The backend then controls the execution
of each thread through its return values to the appropriate pid.
5.3.4.2.1 ISR Context Switch
First, the ISR thread must appear to be in an idle or blocked state and then it can be
awakened whenever an interrupt occurs. This is achieved by using the WAIT_FOR_ISR
type of user_event(). The ISR thread continually calls WAIT_FOR_ISR and checks its
return value. If the return is 0, the thread executes its ISR code. If the return is 1, the ISR
thread exits the loop and terminates. The first time WAIT_FOR_ISR is called, the
generated user_event() returns T_YIELD to block the ISR thread. However, returning
53
T_YIELD does not destroy the current task, which is stored in the backend and is
scheduled later to unblock the ISR thread.
The interrupt backend has one additional parameter, context switch time, which simulates
the time for node 0 to switch between the execution of the ISR thread and application
thread. When node_write() determines that there is an interrupt, it reschedules the saved
task at the current time plus one context switch time. The function pointer of the task is
also changed to execute_isr(). When the task is scheduled, it calls execute_isr() which
sets the return value of WAIT_FOR_ISR to 0 and returns T_ADVANCE. T_ADVANCE
unblocks the thread and the return value of 0 causes the ISR thread to execute.
The backend maintains an isr_flag to determine whether to generate an interrupt or not.
The isr_flag is set by execute_isr() when the ISR is started and is cleared by the
WAIT_FOR_ISR user event when the ISR finishes. If the isr_flag is not set when
node_write() executes, node_write() will enqueue the interrupt offset and schedule the
ISR to run. Otherwise, node_write() will only enqueue the interrupt offset. This simulates
the enabling and disabling of hardware interrupts that allows one ISR thread to process
more than one interrupt message at a time.
The END_ISR type of user_event() is used by all other threads to signal that they are
done. Once all the threads finish and execute END_ISR, the generated user_event()
schedules the task saved for the blocked ISR thread to execute end_isr(). When end_isr()
executes, it sets the WAIT_FOR_EVENT return value to 1 and returns T_ADVANCE
The T_ADVANCE unblocks the ISR thread and the return value of 1 terminates the ISR.
This was necessary to signal the ISR that the simulation was over, otherwise it would
loop forever.
5.3.4.2.2 Application Thread Context Switch
54
The application thread’s execution is controlled by checking the isr_flag at each
sim_read() and sim_write() event created by pid 1. The isr_flag indicates whether the ISR
is currently executing. If the ISR is executing, the application thread on node 0 should be
blocked so that both threads do not execute simultaneously. Therefore, if isr_flag is not
set, the event executes as usual. If isr_flag is set, the current task is saved and the event
returns T_YIELD to block the thread.
Once the application thread has been blocked, the task saved for the thread is used to
unblock the thread once the ISR finishes. Since the ISR thread is in a loop, it calls the
WAIT_FOR_EVENT when it finishes servicing an interrupt. The generated user_event()
then reschedules the task saved for the application thread at the current time plus one
context switch. The user_event() does not modify the function pointer as the same event
is still desired. Because the application thread is blocked while the ISR is executing, it is
essentially delayed for the time of the ISR to execute plus on context switch time.
One might suggest that waiting for a read or write to block the application thread is not
accurate enough and that it should be blocked immediately. However, the threads only
perform local non-memory operations between each read and write and the operations are
transparent to the other threads. As long as the total delay is accounted for, the final
simulation will be accurate.
5.3.5 Polling Backend
The polling backend was designed to test a dedicated node that polled the interrupt FIFO,
from Section 3.5.3, instead of using interrupts. The polling backend is similar to the
interrupt backend except that it does not execute the ISR thread and the polling thread on
the same node. Because of this, the context switching capabilities were unnecessary and
removed. However, the WAIT_FOR_ISR technique was still used instead of actually
polling. It was necessary because otherwise it would be impossible for the ISR thread to
55
determine when to end. Although it is not a perfect simulation, the timing is accurate to
within one read access time and is sufficient for our purposes.
5.4 Simulation Parameters
The first goal of the simulations was to duplicate the results of our real-system
experiments from Section 3.4. To achieve this, four backend parameters were used to
match the results (see Section B.1.2). Since Augmint uses processor cycles as its unit of
time, all times where converted into cycles by dividing them by our real system’s clock
rate of 266 Megahertz.
5.4.1 Transit Time
The first backend parameter is the transit time of messages between nodes. We measured
the total round-trip time of our two-node system as 1270 microseconds in Section 2.1.
The transit time between nodes is half of that because there were only two nodes.
Therefore 169 cycles or 635 nanoseconds was used as the transit time for our simulations.
5.4.2 Access Times
The second and third parameters are the read and write access times. Since we did not
have any experimental values for the access times of the SCRAMNet+ cards, these two
parameters were determined by matching against our experiments on the real systems.
We used experiments identical to those used in the real system and varied each
parameter. This analysis showed that the read and write access times affect the slopes of
the resulting graphs and the transit time only affects the offset of the graphs. Therefore,
we adjusted both the read and write access times so that the slopes of the real and
simulated experiments were similar. The value used was 266 cycles or 1 microsecond,
which we argue is reasonable for two reasons. First, [11] specifies the typical read and
write access times for PCI based SCRAMNet+ cards as 133 and 240 nanoseconds
56
respectively. However, such marketing materials tend to use best-case numbers. Second,
according to [18], the typical access time for a PCI device is approximately 2-4
microseconds. However, it was published in 1995 and there has been considerable
advances in PCI chipsets since then. Therefore, our choice of 1 microsecond is within a
reasonable range of these two numbers.
5.4.3 Context Switch Time
The fourth parameter, the context switch time, was derived mathematically from the
results of our real-system experiments without contention in Section 3.4.1. In these
results the timing difference between the nodes with and without the ISR was 3.8
microseconds. As described in Section 3.4.1, this corresponds to the fact that the node
without the ISR does not have to wait for the context switch when the ISR finishes, since
it is not on the same processor. Furthermore, the ISR node does not have to wait for any
transit times since it is on the same node as the ISR and we are not using Write-Me-last
mode. The timing diagrams in Figure 19 correspond to the timing sequences of an
Acquire shown in Figure 5. Time flows from left to right and is of no particular scale.
Figure 19 Timing diagram of our algorithm’s acquire procedure without contention
57
ISR Node:
Context Switch ISR Context Switch
Normal Node:
Transit Time Context Switch ISR Transit Time
The context switch time was derived from the timing diagrams and the following
calculation:
[ISR Node Time] – [Normal Node Time] = 22.8 – 19.0 = 3.8 μs
[(2 * Context Switch) + ISR] – [(2 * Transit Time) + Context Switch + ISR] = 3.8 μs
Context Switch = (2 * Transit Time) + 3.8 μs
Context Switch = (2 * 0.635 μs) + 3.8 μs
Context Switch ≈ 5.0 μs or 1330 cycles
5.5 Conclusions and Future Work
Implementing and comparing identical experiments to the real system in Section 3.5
allowed us to verify our models and continue testing with confidence. However, the main
advantage of the simulations is that any algorithm for SCRAMNet+ memory systems can
be implemented and tested. Therefore, future work should include porting and simulating
the compare and swap so that it can be studied under heavy contention. Finally, the
renaming algorithm from [9] should be implemented and tested.
Another product of the simulation was a closer understanding of the SCRAMNet+ card’s
operation Although the SCRAMNet+ memory is 32-bit aligned, the CSR offsets are 16-
bit registers and CSR4 and CSR5 are not contiguous. Combining these two registers
would allow one 32-bit read get the interrupt FIFO information, eliminating a 1
microsecond access time delay. As mentioned in Section 2.2, SCRAMNet+ interrupts are
automatically disabled after the first interrupt until the ISR finishes. The ISR re-enables
them by writing to CSR0. This also adds a 1 microsecond access time delay. Designing
the card to automatically re-enable interrupts when the ISR reads the combined CSR4
and CSR5 registers and sees the FIFO empty would eliminate this delay. One might think
58
that redesigning the card is unreasonable, however Systran is currently developing a new
version of the SCRAMNet+ card.
59
6 Summary and Conclusions
We presented both blocking and non-blocking synchronization algorithms for
SCRAMNet+ systems. These algorithms were tested with both real-system experiments
and simulations.
First, we reviewed a mutual exclusion algorithm suggested by the manufacturer, Systran
Corp. After discussing its shortcomings, namely poor scalability and starvation, we
present our own mutual exclusion algorithm, which exploits unique features of the
SCRAMNet+ hardware. Our results comparing the two algorithms indicate that our
algorithm has faster execution times, both with and without contention, regardless of the
size of the network. More importantly, our results demonstrate that our algorithm is more
scalable than Systran’s, and is fair unlike Systran’s algorithm. Our algorithm also has the
advantage that its design does not require the nodes to be prioritized. Although, one could
be provided if necessary by simply sorting the queue in the ISR. This would guarantee
that the critical section would be granted in order of priority. In contrast, our experiments
show that Systran’s algorithm cannot guarantee any prioritization.
Next, we presented non-blocking algorithms for SCRAMNet+ systems. First, we
designed and implemented a Compare and Swap algorithm. We used experiments on a
real system to test the algorithm. We then used this algorithm to implement lock-free and
wait-free constructions for large objects developed by Anderson and Moir. Experiments
were performed on both lock-free and wait-free implementations of a shared queue.
60
These experiments tested the algorithms and demonstrated that they could be
implemented on memory architecture as unique as SCRAMNet+.
Unfortunately, the lack of hardware prevented extensive experiments with large networks
or with heavy contention. Therefore, we developed a simulator based on Augmint, which
allows modification of a library called the backend to implement different memory
models. We implemented three backends to simulate the Write-Me-Last, interrupt and
polling configurations of the mutual exclusion algorithms. We verified the simulation
against our real-system experiments and continued with experiments for large networks
and heavy contention. The simulations gave us insight to the timing of the SCRAMNet+
hardware, allowing us to suggest simple hardware changes to improve the performance of
the next hardware design that is currently under way. Most importantly we have a solid
model to build other simulations.
Future work should include simulation of the non-blocking algorithms presented in this
paper. We also believe that both our mutual exclusion and CAS algorithms could be
implemented directly on SCRAMNet+ hardware. This would both eliminate timely
context switches and make the implementation transparent to the programmer. Currently,
the programmer must incorporate the algorithms into the ISR, which might already be
used by the programmer.
61
Appendix A
SCRAMNet+ SoftwareBoth a driver and API (Application Programmer’s Interface) library were developed to
implement the tests on real hardware.
A.1 SCRAMNet+ Driver
The SCRAMNet+ driver was developed for the RT-Mach operating system on 80x86
Intel architectures. The driver was built on the rk97a version of RT-Mach. The following
files were modified to configure the kernel to build the new driver:
./rtmach/src/mk/kernel/conf/i386/files
./rtmach/src/mk/kernel/conf/i386/MASTER
./rtmach/src/mk/kernel/conf/i386/MASTER.local
./rtmach/src/mk/kernel/i386at/autoconf.c
./rtmach/src/mk/kernel/i386at/conf.c
The following files were modified to change pcibus_read() and pcibus_write() from static
to global calls. They were need by the SCRAMNet+ driver to configure the PCI FIFO
and interrupt registers.
./rtmach/src/mk/kernel/i386at/pcibus.c
./rtmach/src/mk/kernel/i386at/pcibus.h
62
The following files were modified to implement the SCRAMNet+ driver itself:
./rtmach/src/mk/kernel/i386at/scramnet.c
./rtmach/src/mk/kernel/i386at/scramnet.h
./rtmach/src/mk/kernel/i386at/scramnet_defs.h
./rtmach/src/mk/kernel/i386at/scramnet_ioctl.h
A.2 SCRAMNET+ API
Systran supplies an API to interface to SCRAMNet+ cards. We developed a library called
scrplus to interface to our driver with the identical function prototypes as the
SCRAMNet+ library. A full description of the SCRAMNet+ library can be found in [13].
This way our test code could be easily ported to any existing operating system and
platform supported by Systran. We only implemented the functions necessary for our
testing as follows:
A.2.1 scr_mem_mm
Prototype: unsigned int scr_mem_mm(int arg)
This function maps or unmaps the SCRAMNet+ card’s memory to the API library. The
action is based on the values (MAP or UNMAP) passed for arg. A zero is returned on
success. After success, calling get_base_mem() will return the address of the
SCRAMNet+ card’s memory.
A.2.2 get_base_mem
Prototype: unsigned long int get_base_mem()
63
This function returns the address of the SCRAMNet+ card’s memory.
Scr_mem_mm(MAP) must be called before this function will return a valid value.
A.2.3 scr_csr_read
Prototype: unsigned short scr_csr_read(unsigned int csr_number)
The SCRAMNet+ cards use 16 Control/Status Registers (CSR) to configure and monitor
the status of the card. This function returns the values of the CSR register indicated by
csr_number.
A.2.4 scr_csr_write
Prototype: void scr_csr_write(unsigned int csr_number,
unsigned short value)
This function writes the value of value to the CSR number indicated by csr_number.
A.2.5 scr_id_mm
Prototype: void scr_id_mm(char *id, char *cnt)
This function assigns the node number to id and the total number of nodes in the network
to cnt. Valid values for both id and cnt are in the range 0-255.
A.2.6 scr_acr_read
Prototype: unsigned char scr_acr_read(unsigned long mem_loc)
As mentioned in Section 2.2, each 32-bit address has an associated memory location to
configure its interrupts. These location are called Auxiliary Control RAM (ACR). This
function will return the ACR value associated with the address mem_loc.
64
A.2.7 scr_acr_write
Prototype: void scr_acr_write(unsigned long mem_loc,
unsigned char acr_val)
This function will write acr_val to the ACR register associated with the address
mem_loc.
65
Appendix B
Using the SimulatorsThis section contains instructions on how to use both of our simulators and how to
duplicate the results in this paper.
B.1 Syntax
Any program linked with the Augmint library will accept three sets of parameters; for
Augmint, the backend and the simulator. The syntax is as follows:
run [Augmint Parameters] – [Backend Parameters] – [Application Parameters]
Note that the sets of parameters are separated by double dashes, which are required even
if no parameters are used. The executables for our simulators, all named run, are
contained in the Augmint directory tree as described in Table 6.
Simulator DirectorySCRAMNet+ Mutex ./applications/scramnetInterrupt Mutex ./applications/interruptPolling Mutex ./applications/polling
Table 6 Simulator executable directories
66
67
B.1.1 Augmint Parameters
B.1.1.1 -V
The –V parameter indicates that Augmint should use Data Movement as described in
Section 5.2.2. All of our simulations require the Data Movement option.
B.1.2 Backend Parameters
B.1.2.1 –n Xn
The –n parameter indicates the number of nodes in the simulation. The default value is
256, which is the physical limit of a SCRAMNet+ network.
B.1.2.2 –t Xt
The –t parameter indicates the transit time, in cycles, for a message to propagate from
one node to another. The default value is 169 cycles.
B.1.2.3 -r Xr
The –r parameter indicates the read delay, in cycles, used to simulate the host read access
time of the SCRAMNet+ card. The default value is 266 cycles (1 microsecond).
B.1.2.4 -w Xw
The –w parameter indicates the write delay, in cycles, used to simulate the host write
access time of the SCRAMNet+ card. The default value is 266 cycles (1 microsecond).
B.1.2.5 -c Xc
The –c parameter indicates the context switch time in cycles. The default value is 1330
(5 microseconds). The Write-Me-Last backend does not use this command line option.
B.1.3 Simulation Parameters
B.1.3.1 –PXp
The –P parameter indicates which node(s) are to participate in the test. The default value
is –1, indicating all nodes. Other valid values are nodes 1 through 256.
68
B.1.3.2 –NXn
The –N parameter indicates how many nodes there are. The default is 256. This number
should always be less than or equal to the number used with the –n backend option.
B.1.3.3 –MXn
The –M parameter indicate the total possible number of nodes in the system. The default
value is 256.
B.1.3.4 –CXi
The –C parameter indicates how many iterations the simulation should make. The default
value is 100.
B.2 Experiments
Scripts where used to generate results of our experiments. Table 7 lists the scripts used to
generate the results in this paper. The directory is based from Augmint as the root of the
directory tree. The figure column indicates which figures used the results of the script.
Directory Script Figure./applications/systran/results/ compare_nc_pid1 Figure 8 & Figure 9./applications/systran/results/ compare_nc_pid2 Figure 8 & Figure 9./applications/systran/results/ compare_c_all Figure 10 & Figure 11./applications/systran/results/ heavy_c_all Figure 14, Figure 15 & Figure 16./applications/interrupt/results/ compare_nc_pid1 Figure 8, Figure 9 & Figure 12./applications/interrupt/results/ compare_nc_pid2 Figure 8, Figure 9 & Figure 12./applications/interrupt/results/ compare_c_all Figure 10, Figure 11 & Figure 13./applications/interrupt/results/ heavy_c_all Figure 14, Figure 15 & Figure 16./applications/polling/results/ compare_nc_pid1 Figure 12./applications/polling/results/ compare_nc_pid2 Figure 12./applications/polling/results/ compare_c_all Figure 13./applications/polling/results/ heavy_c_all Figure 14, Figure 15 & Figure 16
Table 7 Scripts to run simulation experiments
Bibliography
1. Augmint User’s Manual. Unpublished manuscript. http://iacoma.cs.uiuc.edu/iacoma/augmint/users-guide.ps
2. J. Anderson and M. Moir, “Universal Constructs for Large Objects”, Submitted to IEEE Transactions on Parallel and Distributed Computing, 1997.
3. G. Barnes. “A Method for Implementing Lock-Free Shared Data Structures”, Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, 1993, pp. 261-270.
4. T. Bowman, “Shared-Memory Computing Architectures for Real-Time Simulation – Simplicity and Elegance”, Systran technical paper available from http://www.systran.com/scramnet.htm, January 1997.
5. C. Filachek, “Evaluation and Optimizations of Lock-Free and Wait-Free Universal Constructions for Large Objects”, Master’s Thesis, University of Pittsburgh, 1997.
6. M. Herlihy, “A Methodology of Implementing Highly Concurrent Data Object”, ACM Transactions on Programming Languages and Systems, Vol. 15, No. 5, 1993, pp. 745-770.
7. M.Herlihy, “Transactional Memory: Architectural Support for Lock-Free Data Structures”, Proceedings of the 20th International Symposium in Computer Architecture, 1993, pp. 289-300.
8. M. Herlihy, “Wait-Free Synchronization”, ACM Transactions on Programming Languages and Systems, Vol. 11, No. 1, 1991, pp. 124-149
9. S. Menke, M. Moir, and S. Ramamurthy, “Synchronization Primitives for SCRAMNet+ Systems”, Proceedings of the 16th Annual Symposium on the Principles of Distributed Computing, 1998, pp. 71 –80
10. M. Moir, “Practical Implementations of Non-Blocking Synchronization Primitives”, Proceedings of the 15th Annual ACM Symposium on the Principles of Distributed Computing, Santa Barbara, CA, August 1997, pp. 219-228.
69
11. “PCI/PMC Interface Overview”, Technical Note 131, Copyright 1996, Systran Corp.
12. Anthony-Trung, Nguyen, Maged Michael, Arun Sharma, and Josep Torrellas. “The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures.” Proceedings of 1996 International Conference of Computer Design, October 1996.
13. “SCRAMNet Network PCI Bus Hardware Reference”, Document No. D-T-MR-PCI#####-A-0-A2, Copyright 1991, Systran Corp.
14. “SCRAMNet Network Programmer’s Reference Guide”, Document No. D-T-MR-PROGREF#-A-0-A6, Copyright 1997, Systran Corp.
15. “SCRAMNet VME Hardware Reference”, Document No. D-T-MR-VME#####-A-0-A2, Copyright 1994, Systran Corp., pp. F1-F2.
16. Arun Sharma, Augmint, A Multiprocessor Simulator. Master’s Thesis, University of Illinois at Urbana-Champaign, May 1996.
17. Arun Sharma, Anthony-Trung Nguyen, and Josep Torrellas. Augmint: A Multiprocessor Simulation Environment for Intel x86 Architectures. Center for Supercomputing Research and Development (CSRD) Technical Report 1463, March 1996.
18. Edward Solari and George Willse, PCI Hardware and Software, San Diego; Annabooks, March 1995, pp. 434.
19. Systran Corp. World Wide Web Page. http://www.systran.com/scramnet.htm, January 1997.
20. Systran Corp. World Wide Web Page. http://www.systran.com/ftp/scramnet/snovervw.pdf
21. Jack E. Veenstra and Robert J. Fowler, MINT Tutorial and User Manual. Technical Report 452, University of Rochester, Computer Science Department, August 1994.
70
Recommended