19
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002 453 An Architecture for a Nondeterministic Distributed Simulator Marc Bumble and Lee D. Coraor Abstract—A computer architecture for an accelerated, parallel, nondeterministic, discrete event simulator is described. The ma- chine is evaluated for accelerating road traffic simulation. The ar- chitecture employs reconfigurable logic, systolic arrays, and a re- duction bus to perform microscopic discrete event simulation. The simulator, which achieves a speedup factor of at least 91 over its traffic software counterpart, is fast enough to be practical to mu- nicipal traffic management engineers handling road incidents in large metropolitan traffic networks. Index Terms—Field programmable gate arrays (FPGAs), recon- figurable logic, road traffic simulation, simulation, simulation ma- chine. I. INTRODUCTION S OFTWARE simulators are often unable to simulate road traffic at rate much greater than the time required to actu- ally run traffic on a network of roads. Their prediction abilities are therefore somewhat limited. A demonstration of MITSIM displaying a section of the Boston arterial flow project is able to simulate traffic moving at a stated rate approximately equal to 90% of real time. The speed of software simulators is ad- equate for the design of new traffic pattern construction and for optimizing traffic light timing sequences, however, the re- sponse time is inadequate for handling traffic incidents that re- quire a greater level of simulation acceleration to be useful to traffic managers attempting to optimize existing networks ex- periencing unanticipated crises. Metropolitan traffic grids are often strained by the advent of celebrations or demonstrations, which may foster abnormal traffic loads. Concentrations of congregants may induce local- ized surges of congestion. Even a simple traffic incident in an already strained metropolitan street grid yields immediate con- sequences. Traffic engineering presents a practical and realistic simulation application. One possible scenario models a contro- versial conference located in New York, NY. During an inci- dent, changes could be made to existing models, factoring in the affects of traffic outages and thereby allowing the simu- lation and verification of proposed traffic detours. Traffic out- ages could occur due to construction, accidents, or terrorist ac- tivity. Accelerated simulators might prove to be highly effective in assisting engineers rerouting traffic during emergency situa- tions. The same possibilities are available for rail [1] and air- plane traffic. If transportation systems designers have fast sim- Manuscript received February 1, 2001; revised September 26, 2001. The authors are with Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16801 USA. Publisher Item Identifier S 0018-9545(02)02502-1. ulators available, the following system design implications are more easily answered [2]: • simulation can contribute to systems management rather than being used solely for the design process; • real-time verifiable experimentation becomes plausible; • users can be more confident of the accuracy of their im- plementation decisions. This paper presents a simulation machine architecture ca- pable of serving as a rapid traffic incident response simulation system. The accelerated machine is designed to assist traffic management officials in obtaining and testing detours. The ma- chine is capable of running its simulations fast enough to be useful to the traffic officers on the street. Although anyone stuck in a traffic jam can attest to the benefits and increased satisfac- tion level gained by avoiding these situations, the implications of increased traffic throughput are not just a matter of conve- nience. Faster response time to injury victims can be directly correlated to an increased survival rate [3]. For 30 years [4], computer architects have been designing and building special purpose deterministic logic simulation machines. These deterministic simulators greatly accelerate the simulation, verification, and construction of new com- puter hardware designs. This paper proposes an accelerated architecture composed of multiple processing elements for the purpose of accelerating nondeterministic traffic simulation. The processing elements are united and synchronized toward the common goal of accelerating general purpose discrete event simulation. The paper presentation is organized into the following sec- tions. First, Section I-A describes a basic model of discrete event simulation. Section II provides citations for related simulator work. The software simulator models used for comparison with the proposed hardware are discussed in Section III. Mathemat- ical analysis in Section IV presents results which assist in se- lecting the fastest simulator run mode. Computer architecture methods used to accelerate the simulator are described in Sec- tion V. The simulator architecture is presented in Section VI. Finally, results are presented in Section VII. A. Simulation Model Discrete event simulations typically have three basic common denominators. First, they contain a set of state variables denoting the current state of the simulation. The state variables contain information such as the number and availability of system resources. Second, a typical discrete sim- ulation contains an event queue, depicted in Fig. 1. The event queue is a list of pending events that have been created by an event generator but are not yet executed by the scheduler. These 0018-9545/02$17.00 © 2002 IEEE

An architecture for a nondeterministic distributed simulator

  • Upload
    ld

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: An architecture for a nondeterministic distributed simulator

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002 453

An Architecture for a Nondeterministic DistributedSimulator

Marc Bumble and Lee D. Coraor

Abstract—A computer architecture for an accelerated, parallel,nondeterministic, discrete event simulator is described. The ma-chine is evaluated for accelerating road traffic simulation. The ar-chitecture employs reconfigurable logic, systolic arrays, and a re-duction bus to perform microscopic discrete event simulation. Thesimulator, which achieves a speedup factor of at least 91 over itstraffic software counterpart, is fast enough to be practical to mu-nicipal traffic management engineers handling road incidents inlarge metropolitan traffic networks.

Index Terms—Field programmable gate arrays (FPGAs), recon-figurable logic, road traffic simulation, simulation, simulation ma-chine.

I. INTRODUCTION

SOFTWARE simulators are often unable to simulate roadtraffic at rate much greater than the time required to actu-

ally run traffic on a network of roads. Their prediction abilitiesare therefore somewhat limited. A demonstration of MITSIMdisplaying a section of the Boston arterial flow project is ableto simulate traffic moving at a stated rate approximately equalto 90% of real time. The speed of software simulators is ad-equate for the design of new traffic pattern construction andfor optimizing traffic light timing sequences, however, the re-sponse time is inadequate for handling traffic incidents that re-quire a greater level of simulation acceleration to be useful totraffic managers attempting to optimize existing networks ex-periencing unanticipated crises.

Metropolitan traffic grids are often strained by the adventof celebrations or demonstrations, which may foster abnormaltraffic loads. Concentrations of congregants may induce local-ized surges of congestion. Even a simple traffic incident in analready strained metropolitan street grid yields immediate con-sequences. Traffic engineering presents a practical and realisticsimulation application. One possible scenario models a contro-versial conference located in New York, NY. During an inci-dent, changes could be made to existing models, factoring inthe affects of traffic outages and thereby allowing the simu-lation and verification of proposed traffic detours. Traffic out-ages could occur due to construction, accidents, or terrorist ac-tivity. Accelerated simulators might prove to be highly effectivein assisting engineers rerouting traffic during emergency situa-tions. The same possibilities are available for rail [1] and air-plane traffic. If transportation systems designers have fast sim-

Manuscript received February 1, 2001; revised September 26, 2001.The authors are with Computer Science and Engineering, The Pennsylvania

State University, University Park, PA 16801 USA.Publisher Item Identifier S 0018-9545(02)02502-1.

ulators available, the following system design implications aremore easily answered [2]:

• simulation can contribute to systems management ratherthan being used solely for the design process;

• real-time verifiable experimentation becomes plausible;• users can be more confident of the accuracy of their im-

plementation decisions.This paper presents a simulation machine architecture ca-

pable of serving as a rapid traffic incident response simulationsystem. The accelerated machine is designed to assist trafficmanagement officials in obtaining and testing detours. The ma-chine is capable of running its simulations fast enough to beuseful to the traffic officers on the street. Although anyone stuckin a traffic jam can attest to the benefits and increased satisfac-tion level gained by avoiding these situations, the implicationsof increased traffic throughput are not just a matter of conve-nience. Faster response time to injury victims can be directlycorrelated to an increased survival rate [3].

For 30 years [4], computer architects have been designingand building special purpose deterministic logic simulationmachines. These deterministic simulators greatly acceleratethe simulation, verification, and construction of new com-puter hardware designs. This paper proposes an acceleratedarchitecture composed of multiple processing elements for thepurpose of accelerating nondeterministic traffic simulation.The processing elements are united and synchronized towardthe common goal of accelerating general purpose discrete eventsimulation.

The paper presentation is organized into the following sec-tions. First, Section I-A describes a basic model of discrete eventsimulation. Section II provides citations for related simulatorwork. The software simulator models used for comparison withthe proposed hardware are discussed in Section III. Mathemat-ical analysis in Section IV presents results which assist in se-lecting the fastest simulator run mode. Computer architecturemethods used to accelerate the simulator are described in Sec-tion V. The simulator architecture is presented in Section VI.Finally, results are presented in Section VII.

A. Simulation Model

Discrete event simulations typically have three basiccommon denominators. First, they contain a set ofstatevariables denoting the current state of the simulation. Thestate variables contain information such as the number andavailability of system resources. Second, a typical discrete sim-ulation contains anevent queue,depicted in Fig. 1. The eventqueue is a list of pending events that have been created by anevent generator but are not yet executed by the scheduler. These

0018-9545/02$17.00 © 2002 IEEE

Page 2: An architecture for a nondeterministic distributed simulator

454 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

Fig. 1. Simulator Model: The simulator is divided into the componentsillustrated. The Event Generator creates random events, according to a userselected statistical distribution, and the event’s resource requirements. Theevents and their attributes are placed in theEvent Queue. The Schedulersteps through the Event Queue in chronological order according to theGlobalSimulation Clock,attempting to allocate resources for each event. If theresources are available, the event can execute. If not, the event is blocked.

events require system resources to execute. The resources’savailability is described by the state variables. Events often con-tain an arrival time-stamp and possibly a duration. The arrivaltime-stamp indicates when the event impacts the system’s statevariables. Event arrival times and service times are frequentlygenerated based on statistical models. For example, events mayarrive according to a Poisson distribution. Finally, the thirdcommon denominator of discrete event-driven simulation is theglobal simulation clockwhich keeps track of the simulation’sprogress. The simulation must maintain proper causal states,meaning that each event must be executed in the environmentcreated by the execution of the prior events. Therefore, if priorevents have depleted a particular resource, that resource will beunavailable for the execution of a following event.

The simulation generally executes amain loop[5] which re-peatedly removes the event with the smallest time-stamp fromthe event queue. Each event is processed by making appropriatestate changes to the simulation model’s state variables.

Discrete simulation creates a system which changes state atspecific points in time. The simulation model jumps from onestate to the next when an event occurs and is processed. A tele-phone system might contain a set of state variables which de-scribe telephone trunks leading from a substation as either fullor available to route new calls. Additional state variables mightcontain the number of calls currently being handled by the sub-station. Typical events at the substation might include call ar-rivals, inbound calls being routed through the station, calls beingterminated, or calls being blocked.

II. RELATED WORK

Special purpose machines have been implemented expresslyfor the development of logic design and performance [4],[6]–[15]. In logic design, simulations are used to verify newprojects and to run fault analysis of these designs. Unlikedeterministic logic simulators, the proposed machine simulatesnondeterministic behavior. Road traffic requires a wider varietyof dynamic behavior than a limited set of deterministic logicfunctions. Recent research has begun to explore the application

of parallel processing to real-time traffic simulation. Bothmicroscopic [16] and macroscopic [17], [18] approaches areexplored.

The proposed nondeterministic simulation architecture dif-fers from the existing body of published research. The simulatorarchitecture leverages the locality of data inherent in discreteevent simulation. The architecture mitigates the Von Neumannbottleneck by embedding simulation instructions in reconfig-urable logic. Distributed processing is developed by applyingboth a global network of processing elements and the imple-mentation of embedded instructions as pipelined, systolic ar-rays within reconfigurable logic. Conventional general purposeprocessors fetch data and instructions from memory, performcomputation, and return the results to memory. In the proposedarchitecture, data flows from functional unit to functional unit,accomplishing computation as part of its transport process. Thedata channel is pipelined allowing concurrent computation ofdifferent stages of the simulation within each processing ele-ment. Multiple processing elements are networked together intoa scalable architecture facilitating further parallelism.

III. SOFTWARE TRAFFIC SIMULATION

The primary focus of this work is the creation of an archi-tecture for a nondeterministic parallel discrete event simulationmachine. The research focuses on road traffic as a selected ap-plication. In order to clearly concentrate acceleration efforts,software models were used as a guide for the hardware de-velopment. The software models included in this paper werestudied in three phases. The first phase determined whether thisproject was worth pursuing. As part of that initial work, thesmall, simple code modules of Section III-A were used to es-tablish what types of speedup can be obtained to accelerate dis-crete event simulation. Once it was established by the initialpublications [19]–[21] that the simulator work is both desiredand justified, a study [22] of a representative and well-estab-lished traffic simulator, CORSIM, was undertaken as the secondresearch phase. This study is described in Section III-B. SinceCORSIM is not an open-source simulator, sharing verifiable re-sults was not practical using CORSIM as a standard for com-parison. Other possible candidates for study were rejected forsimilar reasons. The final stage of the simulator work did re-quire a system to verify the accuracy of the selected Scheduleralgorithm employed in Section VI-B-3. Therefore, as a sepa-rate effort, the simulator Trafix was generated during the thirdproject phase as an open source, freely available traffic simu-lator. Unlike other conventional simulators, Trafix is both opensource and modular. The Trafix simulator is briefly described inSection III-C.

It is important to emphasize that the proposed simulator is re-configurable. Therefore, users may choose to substitute and usea variety of traffic models [23]–[26]. The traffic models selectedfor this study were adapted from [27]–[31].

A. Event Generation and Queue

As part of the initial studies used to gauge the effectivenessand direction of the selected acceleration approach, the eventgenerator and the event queue of Fig. 1 are examined. The event

Page 3: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 455

TABLE IEVENT GENERATION CODE I: THE INITIAL EVENT GENERATION

IMPLEMENTATION FOLLOWED [32] CREATING RANDOM ARRIVAL AND SERVICE

TIMES AS FINE-GRAINED PARALLEL DISCRETESTEPS IN A SYSTOLIC

ARRAY. THE APPROACH ISALSO ILLUSTRATED IN THE HARDWARE EVENT

GENERATION BLOCK DIAGRAMS OF SECTION VI-B.1

generator is prototyped using software which is then translatedinto a reconfigurable logic implementation. The same approachis followed with the event queue. In Section III-A.1, the eventgeneration software is implemented following the methods ap-plied in [32]. These methods facilitate the fine-grained, parallel,systolic hardware implementation described in Section VI-B.1.

In Section III-A.2, the Event Queue software applies stan-dard GNU C++ classes to manage both the event queue and therandom distribution calculations. Therefore, the event genera-tion code is rewritten in Section III-A.2, so that standardizedcode can be applied, and attention focused on the queuing soft-ware.

1) Event Generation Software:An abbreviated softwareoutline is listed in Table I. First, the Poisson event arrival offset,

, is calculated according to (1) [32]. In (1), , or rand1in thecode, is an independent random variable uniformly distributedover [0, 1). , or LAMBDA in the code, is the object or eventarrival rate. The event generator dynamically allocates spacefor the new event and enqueues theobject

(1)

The resulting values generated by this equation can be seenin Fig. 2. is the distance from the beginning of the timelineto the first event. is the distance from the beginning of thefirst event to the beginning of the second event and so on. Newevent arrival times are calculated by adding the arrival offset,,to the previous event arrival time. The clock is then advanced tothe new event arrival time. Service events that overlap the endof the current timeline segment are carried over into the nextsegment.

The Poisson service time,, is calculated according to (2)[32]:

(2)

where , given as an average number of events per second, isthe object or event service rate.is the same as in Table I.The values generated by (2) are illustrated in Fig. 2 to be theoffsets from the beginning to the end of event.

, or rand2in the code, is also an independent random vari-able which is uniformly distributed over [0, 1). The service timeoffset, , is added as an offset to the event’s arrival time to de-termine the end of the event’s service time. Event resources arereleased at the end of this service time. Bothand are inde-pendent and exponentially distributed. In software, to allocatememory and then generate Poisson arrival and service times re-quires approximately 30.5s on an Ultra Sparc.

Fig. 2. Simulation Timeline Generation: Each succeeding arrival starts anoffset of � from the previous arrival. Similarly, each service time� is anoffset from thex event’s corresponding arrival time. These dependencies thatconstrain event arrival time and event service time generation appear to preventspeedup through parallelism.

TABLE IIEVENT GENERATION CODE II: THE EVENT GENERATION CODE ALLOCATES AN

EVENT WITH AN ARRIVAL TIME WHICH IS A RANDOM OFFSETFROM THE

PREVIOUS EVENT’S ARRIVAL TIME. THE SERVICE TIME FOR THE EVENT IS

THEN SELECTED TO BE ARANDOM OFFSETFROM ITS OWN ARRIVAL

TIME. THE TWO RANDOM VALUES NEED NOT NECESSARILY USE THE

SAME STATISTICAL DISTRIBUTION. THE EVENT IS ALSO CONSTRUCTED

TO RANDOMLY REQUIRE RESOURCESWHEN IT IS EXECUTED BY THE

SCHEDULER. THIS CODE DIFFERSFROM TABLE I IN THAT GNU LIBG++STANDARD CLASSES AREAPPLIED. TABLE I CREATESITS RANDOM OFFSETS

USING DISTRIBUTION METHODSFROM [32]

2) Event Queue Software:This section focuses primarily onthe software used for the service queue. The software version isimplemented as a GNULIBG++ XPPQ Priority Queue class. Inthe software simulation, the time required for the insertion andextraction of events to and from the event queue increases asthe queue strays from its optimum size. The proposed hardwarequeue speed, on the other hand, is not affected by size and itprovides a 10 speedup over the software model.

The software simulation model, used for comparison, iswritten in C++ and is illustrated in Tables II and III. Some ad-ditional processing is performed when the event data structureis allocated. The arrival and service queues are maintained asa single heap data structure, unlike the proposed dual queuehardware mechanism of Fig. 4. To gather accurate timingresults, the number of events in the queue is kept constant. The

Page 4: An architecture for a nondeterministic distributed simulator

456 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

TABLE IIIEVENT QUEUE LOOPCODE: THE ARRIVAL AND SERVICE QUEUES ARE

MAINTAINED AS A SINGLE HEAP DATA STRUCTUREUNLIKE THE PROPOSED

DUAL- QUEUE HARDWARE MECHANISM ILLUSTRATED IN FIG. 4. IF THE

DEQUEUED EVENT WAS AN ARRIVAL EVENT, THEN THE RESOURCES

AVAILABLE WERE COMPARED AGAINST THE RESOURCESREQUIRED BY THE

EVENT. IF THE REQUIRED RESOURCESWERE AVAILABLE A SERVICE EVENT

WAS ENQUEUED; OTHERWISE, THE RESOURCESWERE UNAVAILABLE , THE

EVENT WAS RECORDED ASBLOCKED. WHEN A SERVICE EVENT IS DEQUEUED,ITS RESOURCES ARERETURNED TO THEAVAILABLE RESOURCEPOOL. TO

GATHER ACCURATETIMING RESULTS, THE NUMBER OFEVENTS IN THE EVENT

QUEUE WAS KEPT CONSTANT. THE EXTRA TIME USED TO GENERATE

ADDITIONAL ARRIVAL EVENTS IN ORDER TOMAINTAIN THE QUEUE SIZE IS

NOT INCLUDED IN THE SPEEDUPPLOT OF FIG. 3

extra time used to generate additional arrival events in order tomaintain the queue size is not included in the speedup plot ofFig. 3.

Fig. 3 illustrates the speedup expected for other distributionsif the 80 ns clock is maintained in their hardware implemen-tations. The implemented hardware distribution provides thespeedup illustrated for the negative exponential curve in Fig. 3.Note that the proposed linear array queue needs not be im-plemented as reconfigurable logic. The queue could be imple-mented as an application specific integrated chip (ASIC), andwould probably be able to function at an even faster clock ratewith many more queue elements. The code execution times wereclocked on a Dual Pentium 350-MHz machine running Linuxkernel version 2.2.15. The code was compiled using the GNUgcc compiler, version 2.95.2.

B. CORSIM: An Established Software Simulator

As part of the effort to develop a profile of a traffic simu-lator, CORSIM (CORridor SIMulator) was selected as a repre-sentative software simulation model. CORSIM microscopically

Fig. 3. Event Generation Speedup versus Queue Size: Illustrated arethe speedup values obtained by comparing the software event generationand queuing from the code in Table III to their hardware implementationcounterparts. The speedup values were derived on a dual Intel Pentium350-MHz RedHat Linux box running the 2.2.15 kernel. Compilation of thesoftware was with the GNU gcc compiler, version 2.95.2, using the optimizationflag. The speedup results indicate a second order of magnitude speedup.

Fig. 4. The Local Processing Element Design: The local processing element(PE) design uses two queues. The arrival queue holds the sorted list ofarrival events from the event generator. Service events, which are createdfrom processing successful arrival events, and events from adjacent networkprocessing elements, are stored in the service queue. A comparator samples theheads of both queues and indicates where the next minimum local time-stampedevent resides.

models congestion, emissions, and accounts for pedestrians. De-veloped in FORTRAN by the Federal Highway Administra-tion (FHWA), CORSIM is part of the TRAF family of simu-lation models. [CORSIM] combines TRAF-NETSIM, a simu-lation model of nonfreeway traffic, and FRESIM, a simulationmodel of freeway traffic [33]. NETSIM, the older of the twosimulators, grew out of the Urban Traffic Control System de-veloped for mainframes in the early 1970s. The CORSIM modeland its components comprise one of the first traffic simulationenvironments of their kind. CORSIM has been widely used inthe traffic engineering community and claims to have been cal-ibrated and validated in a wide variety of traffic and highwaydesign conditions. The FHWA granted access to the CORSIMsource code for study and evaluation.

In current applications, CORSIM is used to evaluate alter-natives planned for highway networks [33], [34]; it may be

Page 5: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 457

TABLE IVCORSIM FUNCTION CLASSIFICATIONS: THE CORSIM FUNCTIONS WERE CLASSIFIED ACCORDING TOEIGHT CATEGORIES. CORSIM FUNCTIONS OFTEN

CONTAINED A MYRIAD OF CATEGORY FUNCTIONALITY BUT WERECLASSIFIED ACCORDING TO THEMAJORITY OF THEFUNCTION CODE. IN SOME CASES, THE

SUBROUTINESPERFORMEDAPPROXIMATELY 50% ROUTING AND 50% EVENT LIST WORK SO A COMBINATIONAL CATEGORY WAS CREATED. THE EVENT

GENERATION, EVENTLIST, SCHEDULING, AND TIMER CATEGORIESARE DERIVED FROM FIG. 1

Fig. 5. Profile Chart of CORSIM on NT: Illustrated are the percentagesof CORSIM runtime used by eight categories of simulation functions whenrun under the NT operating system. The graph represents an average of 20simulations from the Georgia Institute of Technology CORSIM repository rununder the NT operating system and profiled using NT’s profiling tools. TheCORSIM functions were classified into the eight categories ofscheduling,scheduling/event_list, event_list, timer, event_generator, overhead, statisticsandshutdown. These categories are described in Table IV.

used, for example, to evaluate new traffic signal optimizationstrategies. The runtime required by the simulator has causedCORSIM to be used in off-line applications only. Real-timeapplications, however, are becoming more prevalent in trans-portation engineering, and in such applications, speed is critical.A study of CORSIM runtime characteristics determined thatthe processor tended to dwell in simulation scheduling andoverhead routines [35]. Therefore, attempts to accelerate trafficsimulation need to accelerate or eliminate overhead and eventscheduling.

1) CORSIM Profile: CORSIM was profiled under the NToperating system as a stand-alone application without the rest ofTSIS. Perl scripts were written to parse the resulting profile data.The runtime statistics from 20 traffic models were averagedand joined with classification categories based on the CORSIMfunctions. The pie chart data in Fig. 5 is the result of thesescripts. This figure illustrates the percentage of CORSIM run-

time devoted to each category of simulation function. CORSIMdwells mostly in itsschedulingandoverheadfunctions. There-fore, this simulation architecture proposal must carefully con-sider the acceleration of functions in these categories.

The first CORSIM category,overhead,is dominated by itsdata integrity routines which read data from input files, verifythat data, and then store the results for later retrieval. The pro-posed architecture assists in alleviating much of the overhead re-quired by CORSIM. For example, with the reconfigurable logicapproach, the simulator system must be configured before it isused, and error checking on the input data occurs once duringinitialization. Algorithms implemented in reconfigurable hard-ware must be configured before the system starts the simulation.Much of the data which is input into the CORSIM simulation isconfigured as hardware in the proposed simulator. The setup isbased on selections from available configurations or subconfig-uration model segments.

The version of CORSIM provided with version 4.2 of theTSIS package, having been constructed over time, is notmodular in its software functionality. Routines regularly blendinput data integrity, event list handling, and event schedulingfunctions. CORSIM source code is not generally publicly avail-able for research study and comparison. For these and a myriadof other reasons, a second simulator Trafix was developed formodeling the traffic scheduling software functionality. Trafixis written in C++ and is open source. Its development followsthe research provided in [31].

C. Trafix: A Road Traffic Simulator

In working with road traffic simulators, it becomes immedi-ately clear that the traffic research and management communityrequires an open-source traffic simulator. The simulator shouldalso have a standard general input file format, so that simulations

Page 6: An architecture for a nondeterministic distributed simulator

458 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

Fig. 6. The Trafix Display.: A view of the Trafix simulator is illustrated.Vehicles are moving from the left and bottom lanes to the top and right. Turningdecisions at the second intersection depend on the vehicle’s randomly assigneddestination. Cars, buses, and trucks are animated as different sized and coloredboxes.

Fig. 7. The Trafix Software Structure: Written to be modular, the Trafixsoftware is composed of three levels. The bottom,physical,level allows Trafixto interface with its input and output systems. As currently written, Trafix readsits input from Xfig files and displays its output to X windows displays. Themiddlesymboliclayer serves as an intermediate level between the simulator andits physical files and converts raw data into conceptual objects. These objectsinclude roads, intersections,andplaces. The top layer consists of simulationobjects, which includestimelines, road queues,andintersection queues.

can be easily tested under various simulators for verification.The open-source approach allows researchers to test varioustraffic theories on a standardized software platform in a reliableand reproducible fashion. Further, an inexpensive system canprovide smaller municipalities with access to a tool for makingtheir own local road networks and traffic signal timing schemesmore efficient. Maximizing traffic throughput and minimizingdelay would benefit trade and tourism both domestically andinternationally. The possible windfall of benefits is potentiallylarge.

The Trafix simulator was developed by one of the authorsin C++ with the GNU gcc compiler under the GNU-Linuxoperating system. The program was designed to verify thevehicle routing algorithms selected for this hardware simulatorstudy. Trafix is currently in a stage of development whichis analogous to the program precursors offetchmail [36].An open source program was needed to verify the trafficmovement algorithms, and after failing to find an existingsolution, Trafix was generated to fill the void. Trafix is nowavailable as a starting point for others who need to test trafficrouting algorithms, roadway patterns, etc. Trafix uses Xfig,a freely available, open-source UNIX drawing package, togenerate its input files which describe the input road networks.

Trafix displays its animated output in X windows as illustratedin Fig. 6. The code is written to be modular so that variouscomponents can be replaced as the user community requires.An overview of the overall modular design concept is presentedin Fig. 7. Attempts were made to allow the code to be easilychanged in the future, altering the current dependence on Xfiginput files and X windows output displays. In addition to usingXfig for its input and X windows for output, Trafix employsthe Standard Template Library (STLPORT) routines whereverexpedient to foster the reuse of code, which is intended to bothlead to efficiency and reduce errors.

At this time, Trafix forks two processes. The first process dis-plays two windows, each containing maps. One window holdsinput map symbols which have been used to generate the sim-ulation, and the second, illustrated in Fig. 6, depicts the back-ground map for the animated traffic display. This first processis intended to eventually migrate into a more suitable user in-terface as community interest materializes. The second processhandles the animation of the vehicle traffic. The animation isvisible only on the single graphic map display. Trafix simulatescar, bus and truck traffic moving through intersections and alongroads.

Trafix was created to allow a verification of the car-followingacceleration schemes employed. However, incidental materialswere added as convenient, and hooks are available in the soft-ware. The Trafix simulator code is GNU public licensed andfreely available from its web site at trafix.sourceforge.net.

IV. A NALYSIS

Section IV performs some basic mathematical analysis ofsimulation properties. Section IV-A uses mathematics to de-termine whether event or time-driven simulation yields resultsfaster under particular constraints.

Deciding between running the simulator in the event or time-driven mode is important for simulations in which event pro-cessing is not continuous. For an example of noncontinuousevent processing, consider the case of simulating telephone callswhere the calls are temporarily assigned virtual circuits withinthe communications network. For this example, the circuits arethe resources required by each event. An event generator cre-ates a sequence of calls which are placed into an event queue.Calls can be initiated if their required circuits are available whenthey actually execute. The executed call temporarily depletes thecircuit used from the available pool. However, in the telephonesimulation, the circuit does not require continuous adjustment ormodification. The execution of the telephone call simply needsto schedule a secondary event which will return the circuits tothe available pool for other calls to use when the current callcompletes. In this telephone example, the call does not requirecontinuous event processing. Logic simulation is a similar ex-ample, although deterministic. When a gate executes, it changesthe state of its output signals, but it does not need continuousevent processing. The gate only needs attention when one of itsinput signals is scheduled to change. Contrast these exampleswith traffic simulation. The event generator schedules new ve-hicles to enter the traffic network. The vehicle arrival times areplaced in the event queue. At the appropriate time, the vehicle is

Page 7: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 459

popped off the event queue and enters the traffic network. Oncemoving within the network, the vehicle’s acceleration, velocity,and position require constant updates. Traffic is a type of simu-lation which requires continuous event processing by the sched-uler during every simulation cycle.

Since traffic needs constant attention, the model naturallyfalls into the time-driven mode of simulation. Virtual circuittelephone simulation and logic simulation may be better suitedto an event-driven model. Additional hardware, referenced inSection VI-B.4b) can be included in the simulator design al-lowing the accelerated execution of both time and event-drivensimulation.

A. Event Versus Time Driven Simulation

The analysis in this section compares and examines eventversus time-driven simulation. Section IV-A.1 illustrates themaximum speedup which can be expected from running underan event versus a time-driven approach. Section IV-A.2 usesstatistics to find the solution point which optimizes runtime byselecting between the event versus the time-driven mode.

1) Expected Advantage of Event versus Time-Driven Sim-ulation: Reference [37] compares two distributed processingmethods which are analogous to event and time-driven simula-tion. One method is asynchronous and the other is synchronous.In the synchronous method, after each subtask is completed, allprocesses must reach a barrier before being allowed to processthe next subtask. The synchronous method is analogous to thetime-driven approach. The asynchronous method does not re-quire the barrier. Individual jobs run to completion as fast asthey progress. Felderman demonstrates that the asynchronousmethod has an expected potential speedup over the synchronousmethod by no more than where is the number of proces-sors used. So the speedup gained from event-driven processingover time-driven processing will be no more than .

2) Decision between Event versus Time-DrivenModes: When confronted with a network of random eventgenerators, the next expected event time to occur can becalculated usingOrdered Statistics. Frequently, an objectiveis to determine the fastest car in a race or the heaviest mouseamong those fed a certain diet [38]. Similarly, random variablescan be ordered according to their magnitudes. For this work,the shortest expected arrival time must be found in order todetermine which simulation approach, time or event-driven, isthe most appropriate.

Let denote independent continuousrandom variables which have the distribution functions shownin (3)

(3)

The distribution functions of (3) have the corresponding densityfunctions of (4)

(4)

Ordered random variables are denoted

(5)

where . Continuous random vari-ables allow the equality signs to be dropped. So the maximumvalue of is

(6)

and the minimum value is

(7)

For this work, the goal is to determine the minimum next ex-pected event time which is . The density function of ,denoted can be found as

(8)

Taking the derivative of both sides yields the density function

(9)

The expected time of the next arrival event can then be calcu-lated by finding the expectation of as follows:

(10)

For an actual simulator, the computation of (10) would beautomated given that the user supplies the appropriate and

. Specific statistical examples can be found in [35].

V. METHODS

The overall design employs three general architecturemethods to optimize and accelerate its performance. The firstmethod, providing perhaps the greatest acceleration, is theimplementation ofreconfigurable logic. Reconfigurable logicis composed of uncommitted elements whose interconnectionscan be programmed by the user. A design is partitioned amongthe available logic blocks. Signals among the blocks are thenrouted using a programmable interconnect network. For thisresearch, reconfigurable logic provides a compromise betweenthe flexibility of software programming and the speed ofASICs. The flexibility allows the operators to implementstochastic traffic generators of their choosing. The secondemployed method is referred to assystolic arrays,and thesearrays are a natural consequence of the application of recon-figurable logic. In a systolic array [39], data is pumped fromfunctional unit to functional unit at regular intervals, until thecomputation completes. Intermediate results can be passedalong in the pipeline, instead of being written back to a registerfile after every instruction [40]. Systolic systems consist ofinterconnected cells, each of which is capable of performing asimple operation. Systolic systems tend to have uncomplicatedcommunication and control structures which provide an advan-tage in design and implementation. Several cells are generallyjoined together to form an array or tree. Data flows through the

Page 8: An architecture for a nondeterministic distributed simulator

460 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

cells which are pipelined together. The third method employedto accelerate parallel event simulation is the application of areduction bus. A reduction bus is a communications structurewhich has the dual purpose of both communications and someminor simultaneous computation. A parallel bus is employedto synchronize the processing elements of the simulator andsimultaneously calculate the value of the next smallest globalsimulation event time when the simulator runs in event-drivenmode.

VI. A RCHITECTURE

The proposed architecture is composed of multiple pro-cessing elements united and synchronized toward the commongoal of accelerating discrete event simulation. In the case oftraffic simulation, each processing element is responsible forsimulating the vehicles within an intersection and the traffic onthe intersection’s outgoing roads. Adjacent traffic intersectionsare similarly co-located on the nodes within the architecture toprofit from the resulting data locality. Data dependencies arelocal to each processing element providing ample opportunityfor the concurrent processing of events on different nodes.

As illustrated in Fig. 1, the simulation architecture for eachprocessing element is divided into the three main categories ofevent generation, storage, and scheduling. Additionally, thereis the interconnection network which binds the elements intoa single cohesive computing machine. The architecture of thesystem is subdivided into the four smaller subarchitectures.An overview of the architecture is presented in Section VI-A.Individual processing element subcomponents are discussed inSections VI-B.1–VI-B.3. Section VI-B.1 describes the eventgeneration design. Section VI-B.2 describes the event queue,which stores events generated by the hardware describedin Section VI-B.1. At each processing element, events areretrieved from the event queues and scheduled and processedby scheduler hardware described in Section VI-B.3. Finally,Section VI-B.4 describes the network which unites and syn-chronizes the processing elements.

A. Distributed Multiprocessors

Fig. 8 illustrates the operational environment of the simulatorwhich is similar to [12]. A general purpose machine serves asa user interface and systemcontroller. The controller prepro-cesses the simulation data, partitioning, and loading it acrossthe various simulator processing elements. There may be morethan one controller initializing the simulator to ensure reason-able response time due to the size and scalability of the system.However, if there are several controller units, one will be des-ignated as the main controller, responsible for the initial simu-lation partitioning. Once the simulator is loaded, the main con-troller provides the initialstart signal, illustrated in Fig. 9. Thecontroller machines receive, post-process, and provide the sim-ulation results to the user. Intermediate simulation results canbe obtained by either programming the processing elements toautomatically post results as certain thresholds are crossed orby interruptions from the controller. The interruptions allow theuser to monitor simulation progress. Major adjustments are not

Fig. 8. System User Interface: The proposed preprocessing andpost-processing operational environments for the simulator architectureare similar to [12]. The Simulator Processing Element Network is composed ofa parallel reduction bus structure, a cross-point matrix, and a nearest neighborinterconnect. The parallel bus is used for synchronization and initialization. Thecross point matrix and the nearest neighbor interconnect are for interprocessorcommunications.

possible without halting the simulator and reconfiguring pro-cessing element logic. Results include traffic runtime statisticsand output values from monitored points at specific simulationtimes or conditions.

For the traffic simulation example, the road network is par-titioned by the controller and distributed across the processingelement network of Fig. 9. Each processing element receives asubsection of the traffic network map. Vehicle data flows intoeach processing element as the vehicle enters the correspondingtraffic map section. A cross-point matrix and nearest neighborconnections are used to transport the vehicle data between pro-cessing elements. Greater detail of the processing element sub-components is provided in the following sections. A control-ling processor at the core of the simulator initiates each simula-tion cycle using thestartsignal in both its time and event drivenmodes. In time-driven mode, the processing elements have al-ready exchanged input values for the beginning of the next sim-ulation cycle during the previous cycle using the communica-tions structure. In event-driven mode, the processing elementsmust wait until the next event time is determined before ex-changing data. Only data required for the next simulation cycleis exchanged by the processing elements, alleviating the needto exchange event scheduling times with the vehicle data. Pro-cessing elements signal that they are ready using thedonesignalline on the reduction bus, illustrated in Fig. 9.

In event-driven mode, when all processing elements have sig-naled the end of the current simulation cycle, the next eventtime is determined using the reduction bus of Section VI-B.4b).Data between the processing elements can then be exchangedfor the next time cycle. The time-driven mode avoids the nextsimulation time cycle determination. Once all processing for theprevious simulation time cycle is complete, the main control-ling processor initiates the next simulation cycle using theStartsignal.

B. Processing Elements

The local processing element architecture, consisting of anevent generator, local event queues, and a scheduler, is shown in

Page 9: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 461

Fig. 9. Processing Element Network: The simulator consists of a controllerand a network of PE, interconnected by both a shared parallel reduction busand a dedicated communications structure. The communications structure iscomposed of cross-point matrices laid out in approximately fully connected startopologies. Each PE is also directly connected to the neighboring PEs.

Fig. 4. Processing elements perform the actual scheduling cal-culation of each discrete event in the system. The processingelements compute the acceleration, velocity, and positions ofeach vehicle as they traverse the simulated road network. Eachprocessing element is responsible for part of the overall sim-ulation map, as partitioned by the controller during the simu-lation initialization. Parts of both the vehicle routing table andmap attributes are implemented in reconfigurable logic beforethe simulation starts. The processing elements each contain amicroprocessor, RAM, and EEPROM to provide added designflexibility.

Each processing element contains hardware to exchange sim-ulation data with other elements connected to the communi-cations structure of Fig. 9. Inbound events are handled by ad-ditional small communications FIFO queues not illustrated inFig. 4. These communications queues are used to maintain theordered inbound events received from other processing elementsand the ordered outbound events sent to other processing ele-ments.

1) Event Generation:Speedup of event-driven simulation isattacked from two vantage points. First, a separate event gener-ator is created which functions in parallel with, and indepen-dently from, the event scheduler. The event generator computesevent arrival times, service times, and resource requirementswith some partial parallelism (see Fig. 10). The resulting eventobjects are stored in a memory queue which is accessible tothe scheduling software. Although some data dependency ex-ists during event generation, partial parallelism at this stage isreasonable. Data dependency exists because event arrivals arecalculated as random offsets from the previous event’s arrivaltime. Also, service durations are calculated as random offsetsfrom the event’s arrival time. Partial parallelism is beneficialbecause the arrival and service offsets are not themselves depen-dent on anything. However, those offsets must then be added toeither the previous or current event’s arrival time, respectively.

Fig. 10. The Event Generator Flow Diagram: The event generator of Fig. 1is subdivided into arrival and service time generation. The time offsets can becreated in parallel. This design converts event generation software into a (2-D)reconfigurable systolic array. Reconfigurable logic boosts the execution speed ofevent generation by fostering parallel computation. In the systolic array depictedabove, data is pumped from one processing block to the next at regular intervalsuntil the data circulates to the event queue.

Speed up is accomplished by translating some simulationloop software into parallel, systolic, hardware and by im-plementing some filter processing as parallel hardware. Thehardware is designed through a combination of reconfigurablelogic technology, systolic arrays, and content addressablememory. The event generator from Fig. 1 is translated into bothsoftware and the hardware of Fig. 10 for timing comparisons.

In the hardware version, multiple calculations happensimultaneously. First, the three outer pipeline blocks of Fig. 10,Create Service Time Offset, Create Arrival Time Offset,andSet Resourcesexecute simultaneously.Create Service TimeOffset and Create Arrival Time Offsetgenerate the Poissonarrival and service times of (1) and (2). Next, the arrival timeoffset is added to the current clock time to determine the actualarrival time in theAdd-Offset-to-Previous-Arrival-Timeblock.In the next step, the service time offset is added to the actualarrival time yielding the time at which the event is finishedand its resources become available again. Simultaneously, thestart event data is matched to its resource requirements. TheAdd-Offset-to-Current-Arrival-Timeblock pumps out its valuein the next step. However, when the pipe is loaded, start andfinish events emerge from the pipeline simultaneously, witheach cycle.

The hardware version of Section III-A.1 was modeled usingAltera’s MaxPlus II® Field Programmable Gate Array (FPGA)simulation package. The design, written in the AHDL language,used the Flex 10K series FPGA chips. The MaxPlus II® designautomation package consists of a series of tools including aneditor, a compiler, and a simulator. The editor allows designsto be entered as text files in Altera high-level design language(AHDL), Verilog or VHDL (VHSIC hardware description lan-guage where VHSIC stands for very high-speed integrated cir-cuits). The compiler translates the design into files for simula-tion, timing, and device programming. The MaxPlus II® simu-lator provides timing information and allows design function-ality to be verified.

Page 10: An architecture for a nondeterministic distributed simulator

462 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

TABLE VEVENT GENERATOR AND EVENT QUEUE FPGA IMPLEMENTATION: THE ALTERA FPGAS USED TOSIMULATE THE EVENT GENERATION AND EVENT LIST

HARDWARE ARE ITEMIZED. THE NATURAL LOGARITHM UNIT USESPAIRS OF FPGAS TO FACILITATE ONE RESULT PERCLOCK CYCLE.THE LINEAR ARRAY IMPLEMENTATION IS DESCRIBED INSECTION VI-B-2.A

The Altera Flex 10K series of FPGAs, which serve as recon-figurable logic, have the following features. The devices con-tain 10 000 to 100 000 typical gates, 720 to 5392 registers, and6144 to 24 576 RAM bits. Additional routing features on thechip facilitate predictable interconnect delays, which providereliable simulation results. Software design support and auto-matic place-and-route tools are provided by Altera’s MaxPlusII® development system.

a) Event Generator Results:The design depicted inFig. 10 was translated into AHDL. The Event generator issynthesized as a combination of five chips. Four logarithmunits are required, two producing their results on the evenclock cycles and the other two producing results on the oddclock cycles. TheCreate Arrival Time Offsetand theCreateService Time Offsetblocks of Fig. 10 each require one oddand one even logarithm unit. In simulation at a 200 ns clockrate, the hardware version requires 200 ns, producing one eventper clock cycle. Therefore, we achieve a speedup of 150. Thisspeedup is just for the translation of the event generation soft-ware code as pipelined, systolic hardware. The event generatorimplementation results are listed in Table V.

2) Event Queue:The second problem strike point occurswithin the queue of waiting events. This queue is designed tohold the events in order of their arrival. The proposed eventqueue inserts new elements in and can pop the smallestelement off in .

a) The Linear Array: The service queue mechanism con-sists of alinear array which is described in [41] and illustratedin Fig. 11. The linear array maintains fast access to the minimumtime-stamped event. All new simulation events are inserted intothe left-most array element, the queue head, and when removed,the elements are popped off the head. Each element of the queuecontains two registers and a comparator. The larger of the tworesident elements may be passed to the right, and the smallerof the two elements may be passed to the left. Therefore thesmallest entry is always at the left-most queue element. Com-parators in each element and the queuepush/popsignal steer the2 2 multiplexor logic to route the correct entries into and outof the processing element registers.

The service queue always has the smallest element at its head,whose position can be reasoned as follows. Assume that at sometime, , the queue contains elements. Therefore, the left-mostelement, , has examined a sequence ofvalues, retaining thesmallest value. This value can be popped off in one move. Theelement to ’s right, , has examined at least values,so the second smallest value can be either at element, or at

Fig. 11. Linear Array Queue: The queue consists of a linear array ofprocessing elements. All new elements are passed into the left-most arrayelement, and when removed, the elements exit the same left-most element.Each element of the queue contains two registers and a comparator. The largerof the two resident elements may be passed to the right, and the smaller of thetwo elements may be passed to the left. Therefore, the smallest entry is alwaysat the left-most queue element. Comparators in each queue element steer themultiplexor logic to route the correct entries in and out of the processingelement registers.

element , but it must be in one of those two places andcan be accessed in two moves since the smallest element mustbe removed first.

The th smallest element to enter the array is in any positionfrom down to . Then the next smallest elementto enter the queue will be in any position fromdown to

, which provides the inductive step for thesmallest element. So theth smallest element can always beretrieved in steps. This queue allows us to push and pop eachelement in time. Examples are illustrated in Figs. 12 and13.

Fig. 12 illustrates a sequence of values being pushed into thearray. The top array illustrates the first time step, with each suc-cessive array depicting the same array during the next clockcycle. Comparators on each processing element and their as-sociated multiplexers steer the values into each element of thearray. Larger elements are pushed to the right. When events arepopped off the queue, the analogous sequence of steps is illus-trated in Fig. 13. Smaller elements are pushed to the left duringinsertion.

b) The Queue Model Results:The implemented hard-ware service queue is a five element design closely resemblingFig. 11. The linear array queue is capable of pushing one 16-bitvalue per 40 ns. The smallest queue value can also be poppedout at that rate. It is assumed that each simulator cycle needsto push one event and pop one event from the service queue.Therefore, the queue achieves an 80-ns cycle time. Queue datavalues also require pointers to the event data so that pairs ofvalues are needed to be pushed and popped off the queue.Conversely, when new elements are pushed into a software datastructure, the existing software elements must be fetched frommemory to allow the CPU to compare the stored elements tothe new arrival so that the insertion point can be determinedor an address must be calculated to determine a proper bin on

Page 11: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 463

Fig. 12. Linear Sort Array Input Example: A sequence of values being pushedinto the array is illustrated. The top array shows the first time step, with eachsuccessive array below depicting the same array during successive clock cycles.Comparators and multiplexers associated with each element steer the values ofthe array shown in Fig. 11. Larger elements are pushed to the right.

which to chain the new entry for hashing. Software methodsrequire more time and variable amounts of it.

Using Altera’s MaxPlus II® FPGA simulation package, theEvent Generator and the Service Queue have been simulated asindividual parts running with a clock rate of 80 ns. The servicequeue was simulated, allowing it to push an event during thefirst 40-ns half cycle and pop an event during the next 40 ns. Afive processing element queue was implemented on one AlteraEPF10K20TC144-3 chip utilizing 90% of the chip’s resources.

3) Scheduler:The results of Section III-B.1 determined thatdiscrete event simulation scheduling algorithms are an impor-tant target of acceleration research. However, unlike the eventgeneration and event queue implementations, the scheduler im-plementation is very simulation dependent. For instance, a dis-crete event simulation of road traffic might have a very differentscheduling algorithm than a biological scenario. For this study,the simulation of traffic was selected. The nature of microscopictraffic simulation is the determination of position, velocity, andacceleration along with routing and other considerations. Trafficsimulation has the added benefits of a significant amount of datalocality. Vehicles in a system tend to dwell in the same neighbor-hoods and their data dependencies rely on additional data asso-ciated with that locality. Even when vehicles move, they moveto an adjacent node within the traffic network. The work in thisstudy, and especially within this section, must be viewed in lightof these properties of traffic simulation.

The queues in Fig. 14 are not event queues as illustrated inFig. 1, where the number of vehicles generated and expelled isdetermined by the distribution of the user’s selected statisticalgenerator. The event queues of Section VI-B.2 are required onlyin conjunction with event generation and only on simulationsource nodes. Conversely, for the traffic model scheduler, allqueued vehicles in Fig. 14 are processed every simulation timecycle. The queued data represent vehicles moving on either aroad or through an intersection. Each vehicle is circulated fromthe appropriate queue into the updating calculation hardware ofeither Figs. 15 or 16, and then either back to themainFIFO orto the next traffic network node once per simulation cycle.

Fig. 13. Linear Sort Array Output Example: The figure illustrates a sequenceof values being popped out of the array. Comparators on each array element andmultiplexers between each element steer the values moving through the array.Smaller elements are pushed to the left.

Fig. 14. Scheduler Vehicle Queue: The scheduler vehicle queueimplementation is similar in bothmovement on a roadand movement inan intersection. Two FIFO’s are maintained. Newly arrived vehicles are placedin theentry FIFO. ThemainFIFO, is for those vehicles which are in progress,either along a road or through an intersection. The comparator between thetwo queues selects the vehicle with the most advanced position down thelane, and routes that vehicle’s data into the appropriate functional units ofFigs. 15 or 16 for either road-handling or intersection-handlingmovementcalculations,respectively. Vehicle data sets which are circulated back fromeither Figs. 15 or 16 are placed in the main FIFO for the next simulation timecycle’s computation.

The scheduler hardware implementation for the selectedtraffic example takes significant advantage of data locality.The selected model distinguishes road traffic and intersectiontraffic. Vehicles are assumed to be initialized and injected ontoa road. Properties such as the speed-limit, grade, and otherroad characteristics are considered constant with respect toeach road. The vehicle is composed of data which includesthe destination, velocity, type, etc. Some vehicle properties,free-flow acceleration for example, need not move with the

Page 12: An architecture for a nondeterministic distributed simulator

464 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

Fig. 15. Calculations for Vehicle Movement on a Road: The steps requiredto calculate a vehicle’s movement along a road are represented. For eachsimulation time cycle, vehicles on the road are moved from the main FIFOqueue and processed to adjust their acceleration, velocity, and position. Thevehicle data words are stored in the queues of Fig. 14 and are passed to themultiplexer at the top center of the diagram. Each vehicle passing through thecalculation pipeline may depend on its immediate predecessor’s calculationif the previous vehicle is the current vehicle’s leader in traffic. Because thelead time required to calculate the acceleration is four cycles and because adependency may exist where this vehicle’s acceleration may be required todetermine the following vehicle’s acceleration, all the possible accelerationoutcomes commence calculation immediately and concurrently. Accelerationdetermines the duration of each vehicle’s process time in the pipeline. As theaccelerations are being computed, the vehicle’s traveling lane and possible leadvehicle are determined. The appropriate acceleration is selected based on thevehicle’s relation to its leader, the vehicle’s distance to the end of the road, thetraffic signal value at the end of the road, the vehicle’s previous velocity, thespeed-limit, etc. The block diagram was implemented as a six-stage pipeline.

vehicle, but can be locally accessed based on the vehicle type,road grade, and current velocity in each simulation node.

The challenge to describe vehicular flow in a microscopicmanner led Reuschel and Pipes to formulate the phenomena ofthe motion of pairs of vehicles following each other as describedin (11) [29]. The derivation of the expression is described graph-ically in Fig. 17

(11)

Differentiation of (11) leads to (12), which is referred to asthe basic equation of the car-following models. Research groupsassociated with the General Motors Corporation developed alinear mathematical formula which fitted well against high-den-sity traffic data. The formula they derived, with the desire tomaintain a linear relationship, is provided in (13). Equation (13)differs from (12) by the introduction of , which is defined tobe the time lag response to the stimulus [29]

(12)

(13)

Reference [27] changed the linear property of (13) by al-lowing the constant sensitivity factor,, to become inverselyproportional to the separation distance between the vehicles.Gazis’s modification is illustrated in (14), where is a new

Fig. 16. Calculations for Vehicle Movement Through an Intersection: Avehicle’s movement through an intersection is similar to its movement along aroad as illustrated in Fig. 15. Again, the acceleration calculations determine thelength of the six-cycle pipeline. Movement through the intersection is slightlydifferent from movement along a road. For instance, it is assumed that thereis no traffic signal at the end of the intersection lane. For this study, a polarcoordinate system was applied in the intersections so the angular acceleration,velocity, and distance are computed for each vehicle.

Fig. 17. Car-Following Acceleration: The challenge to describe vehicularflow in a microscopic manner led Reuschel and Pipes to formulate thephenomena of the motion of pairs of vehicles following each other by theexpression:x � x = L + S( _x ) [29]. In this expression, it isassumed that each driver maintains a separation distance proportional to thespeed of his vehicle,_x plus a constant distanceL, which is composed ofthe length of the vehicle plus a distance headway as determined at standstillx = x = 0. The constantS is measured in time.

constant. In work subsequent to [27], Gazis proposed a moregeneral expression yielding the final version of the accelerationformula for microscopic car following illustrated in (15)

(14)

(15)

Equation (14) results when and . Equation (15)was used in the software simulator of Section III-C and in thehardware reconfigurable logic implementations resulting fromFigs. 15 and 16. In both the hardware and software simulations,the values used for, , and are the values used in [31], where

, and .The derivation of the formula for deceleration during

speeding is derived from the basic principles of (16), (18), and

Page 13: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 465

(20). Letting be zero leads to (17) and (19), respectively,from (16) and (18)

(16)

(17)

(18)

(19)

(20)

Then, combining (17), (19), and (20), letting be zero, andeliminating , the final form of (21) is derived. This form is usedin both the hardware and software models to compute deceler-ation when a vehicle is speeding. By letting go to zero, thesame equation is also used to stop at the end of a road

(21)

For this project, all roads are assumed to be straight, runningeither north/south or east/west. For turning computations re-quired within intersections, angular acceleration equations anal-ogous to (15) and (21) are used.

Vehicle acceleration on roads in the scheduler is determinedby a variety of conditions. One factor is whether or not the ve-hicle has a leader within its headway. The headway is a 4-s fol-lowing time based on the vehicle’s velocity. Other accelerationcriteria include the distance to the end of the road, the value ofthe traffic signal at the end of the road, and the vehicle’s velocitywith regards to the speed-limit. If the vehicle is determined tobe in a following mode, the vehicle’s acceleration is calculatedusing (15). Vehicles which do not follow a leader, are not ap-proaching the end of a road, and are not speeding, use a tablelookup to determine theirfree-flowacceleration. Similar to [31],the acceleration is determined by the most restrictive constraintof (22). Tables VI and VII provide the algorithms used to deter-mine vehicle acceleration

(22)

Note that although particular movement algorithms were se-lected for this particular study, the simulator is reconfigurable.Therefore, users are free to model traffic movement with al-gorithms of their choosing. The quantity of available reconfig-urable logic is the only constraint on the size of the possibletraffic algorithms.

c) Scheduler Results:Results from the scheduler containthe least speedup of the sections attempted. The major limita-tion to the experimental design lies in the division functionalunit and the data dependency between leading and following ve-hicles. An implementation of just a simple division functionalunit with registered input and output ports achieved a clock rateof 9.14 MHz. So one impediment to faster implementation onthe FPGAs is division. During the fitting of the traffic designs,the slowest routing implementation paths were composed of di-vision signal lines. If AHDL division library routines cannotbe accelerated, providing hardwired division functional units

TABLE VIACCELERATION DECISIONS FORROAD: THE ALGORITHM FOR DETERMINING

VEHICLE ACCELERATION DURING ROAD TRAVEL IS PROVIDED. ONLY

STOP-SIGN TRAFFIC SIGNALS WERESIMULATED , THEREFORE, VEHICLESSTOP

AT THE END OF A ROAD BEFOREPROCEEDINGINTO THE INTERSECTION.ACCELERATION FORFREE-FLOW TRAFFIC WAS DETERMINED BY TABLE

LOOKUP BASED ON VEHICLE SPEED AND TYPE

on FPGAs would certainly accelerate the traffic implementa-tions. A detailed processing element design capable of handlinga four-way intersection is illustrated in Fig. 18.

Comparing Tables VIII and IX, system bottlenecks can beseen to occur within the function for traversing an intersection.In software, this routine required 48.4 microseconds. In hard-ware, due to a data dependency in calculating acceleration be-tween consecutive vehicles, four cycles of a pipeline, running ata 7.54 MHz clock cycle, are required. Therefore, the speedup ofthe hardware implementation over the software implementationis 91.

4) Network: For simulation acceleration to be successful,speedup must occur within all facets of the architecture,including the processing element interconnection network.Section VI-B.4 presents a method of synchronizing indi-vidual nodes to form a processing element network capableof determining the smallest time-stamped event rapidly. Thebasic processor model used to implement the local processingelements is illustrated in Fig. 4.

Page 14: An architecture for a nondeterministic distributed simulator

466 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

TABLE VIIACCELERATION DECISIONS FORINTERSECTION. THE ACCELERATION

DECISIONS FORTRAVERSING AN INTERSECTION ARESIMILAR TO THE

DECISIONS FORTRAVEL ON A ROAD EXPLAINED IN TABLE VI, BUT IT IS

ASSUMEDTHAT A VEHICLE DOESNOT STOP AT THEEND OF THEINTERSECTION

BEFOREMOVING ON TO THE NEXT ROAD. FOR TURNING LANES, THE

CORRESPONDINGANGULAR ACCELERATIONEQUATIONS WEREIMPLEMENTED

The simulator is composed of individual nodes joined in anetwork. To prevent causality errors in conservative simulation,all nodes process the same simulation cycle simultaneously. Inconservative event-driven simulation, all individual nodes jumpto the simulation cycle which coincides with the smallest time-stamped event held within the network. Logistical difficultiesoccur in both the communications and sorting of event time-stamps. Each node’s local minimum time-stamp must be com-pared against all of the local minimum time-stamps in the globalnetwork.

In a simulation network, as shown in Fig. 19, nodes are gener-ally synchronized using either a time or event-driven simulationapproach. A single network architecture is constructed allowinga simulation to run following either the time or event-drivenmodel. The decision between the two models is made at the be-ginning of the simulation based on the analysis of Section IVand the selected model is used for the simulation duration. Acommunications network, which can be used to determine andselect the smallest time-stamp in a network of nodes when run-ning in event-driven mode is presented. A time-driven solutionis also presented using the same implementation.

a) Communications Architectures:Communicationssynchronization has often been a source of delay. In workon the CM-5, Legendza notes synchronization overheadaccounts for 70% to 90% of the total simulation runtime andtherefore severely limits speedup [43]. Traditional approachesin multi-processor simulation search for the smallest nexttime-stamp in a network of processing elements. The modelmay have active simulation model nodes distributed acrossthe processing elements in a balanced fashion, but eachprocessing element will have one minimum time-stamp forthe model nodes it handles. Each processor time-stamp mustbe compared against the other minimum time-stamps in thenetwork. Some of the more commonly expected network search

Fig. 18. Processing Element (PE) for four-way Intersection and ExitRoads: A detailed version of the PE design depicted in Fig. 4 is illustrated.This design is capable of modeling four-way intersections and consistsof enough scheduler subcomponent units to model the traffic entering anintersection from four directions and exiting the intersection on four outputroads. A single event generator module which contains its associated arrivaland service queues is included to allow the PE to serve as a simulation sourcenode. The PE contains four nearest neighbor interconnect FIFO’s and acommunications FIFO pair, which connects to its corresponding cross-pointswitch. An additional interface connects the processing element to the parallelbus illustrated in Fig. 22. A central crossbar matrix, similar to the Splash designdescribed in [42], connects the various processing element subcomponents.The design described here requires approximately 30–34 FPGAs. Six FPGAsare required for the event generator and the event queues. Each Schedulersubcomponent used in calculating vehicle movement requires three FPGAs,yielding a total of 24 FPGAs for the eight scheduler subcomponents. AdditionalFPGAs are reserved for channel control.

TABLE VIIISCHEDULERSOFTWARE FUNCTION PROFILE: THE FOUR MODULAR VEHICLE

MOVEMENT FUNCTIONSFROM TRAFIX WERETIMED ON A 600-MHz PENTIUM

III 7.0 SuSE LINUX BOX. THE TIMES PRESENTED ARE INMICROSECONDS. THE

SOFTWARE BOTTLENECK IS IN THEINTERSECTIONHANDLING FUNCTION. THE

TIME SHOWN IS THE TIME ELAPSED DURING FUNCTION EXECUTION. THE

FUNCTIONS ARE EXECUTED EACH TIME A VEHICLE IS PROCESSED

algorithms include network structures constructed as-ary treesdepicted in Fig. 20. To determine the minimum time-stamp insuch a network requires communications steps. Thesmallest time-stamp is filtered to the root of the tree, and fromthere, the result must be distributed to the rest of the network.This method requires communications steps.

Another view of the simulation notes that the larger thenumber of event generators which exist in the system, theshorter the expected time to the next event, . Althoughthe examples from [35] use homogeneous distributions, it isassumed that the trend holds for independent heterogeneousdistributions as well. So the larger the number of event genera-tors in the simulation, the faster the events will arrive, and the

Page 15: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 467

TABLE IXSCHEDULER CHIP IMPLEMENTATION: THE SCHEDULER SOFTWARE WASIMPLEMENTED AS FIVE SEPARATE COMPONENTS. THE INITIALIZE VEHICLE SETS THE

VEHICLE’S SOURCELOCATION AND DESTINATION AS THEVEHICLE IS INJECTEDINTO THE TRAFFIC STREAM. ALTHOUGH ITS CLOCK SPEED ISONLY 5.44 MHz, ITCAN PROCESSVEHICLESONCE PERCYCLE AND IS THEREFORENOT A BOTTLENECK FOR THESIMULATOR.VEH INTERSECTINIT PREPARESVEHICLES FORTRANSIT

THOUGH AN INTERSECTION BYINITIALIZING THEIR STARTING COORDINATES ANDLANE DESIGNATION. THE MODULE ALSO PERFORMSSOME VEHICLE ROUTING.THE MOVE IN INTERSECTIMPLEMENTATION IS THE SYSTEM BOTTLENECK. A DATA DEPENDENCYBETWEEN CONSECUTIVEVEHICLES REQUIRESFOUR CLOCK

CYCLES TO PROCESSBEFORE THESUBSEQUENTVEHICLE ACCELERATIONCALCULATION BEGINS AND THE CLOCK ONLY RUNS AT 7.54 MHz DUE TO DIVISION

OPERATIONS INSOME OF THEPIPELINE STAGES. TABLE VIII I LLUSTRATES THAT FOR THETIMED TRAFIX SCHEDULERFUNCTION, THE BOTTLENECK RESIDES IN

THE INTERSECTIONROUTINE. HERE, THE ROUTINE REQUIRESFOUR CYCLES, SO THESPEEDUPATTAINED BY THE HARDWARE OVER THE SOFTWARE IS91

Fig. 19. A Network of Processing Elements: A simulation consists of anetwork of event sources, sinks, and way points. Each must be synchronized tothe global system time clock. Two common methods of synchronization aretime and event-driven synchronization. The analysis in Section IV can be usedto gauge which method is faster. The illustrated time-driven simulation usesa controller/subordinate approach similar to Levendel [12]. The network coreillustrated in Fig. 23 serves as the main synchronizer which asserts thestartline at the beginning of each time cycle. Each network processor signals it isready for the next time cycle by asserting itsdoneline. The start and done linesare configured as reduction network lines illustrated in Fig. 21.

smaller the mean time between events grows. Asincreases,time-driven simulation becomes more and more practical.

b) Parallel Bus Architecture:The simulator incorporatesa parallel bus architecture accelerating event-driven simulationsynchronization. Microscopic traffic simulation, as notedin Section IV, and as modeled in Section VI-B.3, requiresprocessing updates every simulation cycle. Once vehicles areinjected into the traffic network, their position, velocity andacceleration are computed continually. Therefore, traffic, asmodeled in this paper, conforms naturally to a time-drivenmodel. However, the introduction of the parallel bus archi-tecture allows the simulator to further accelerate simulationswhich are more suitable to running in an event-driven mode.The addition of the parallel bus architecture greatly enhances

Fig. 20. K-ary Search Tree Network: TheK-ary search network topologyallows N processing elements in a network to compare individual localminimum time-stamp results to the winner of theK elements on the levelbelow. Successive winners compete in tournament-style comparisons.

the simulator and broadens its functionality to cover a widersuite of applications. The parallel bus architecture is mentionedhere and covered in greater detail in [21].

For the proposed algorithm, several transmitters must sharethe bus and be able to generate signals simultaneously. Thebus architecture can be handled by a bi-directional reductionlogic network. Employing a technology such as emitter coupledlogic (ECL) gives the interface reasonable transmission speed,and ECL hardware couples nicely with CMOS technology [44].ECL switching speed is accomplished by keeping transistors al-ways biased in their active regions. OR or NOR logic can beused to run buses in two directions as depicted in Fig. 21. Re-duction logic can be accomplished directly at the processing el-ement I/O points without processor intervention.

5) Search Algorithm:The algorithm for finding the networkminimum time-stamp proceeds in two basic phases. The firststep consists of a general elimination which prunes processingelements having time-stamps larger than, the base two ceilingof the global minimum time-stamp. The second phase of thealgorithm then finds the minimum among the remaining nodes.

6) Phase 1 Elimination:First, all network PEs find theirlocal minimum values. This search involves comparing the lead

Page 16: An architecture for a nondeterministic distributed simulator

468 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

Fig. 21. The PE Interconnection Network: PEs can be interconnectedin one or two dimensions. The interconnections consist of a high-speedemitter-coupled logic design. The buses link the PE together, allowing a rapidand semi-parallel determination of the next smallest time in the network. TheOR network assists in the computation of the smallest time-stamp and servesfor both computation and signal driving. In addition, each processing elementis directly connected to its north, south, east, and west neighbors.

elements of the service and arrival queues from Fig. 4 intime. A hardware algorithm for maintaining the smallest eventwithin a processing element is presented in Section VI-B.2.Next, each PE computes the difference between the currentglobal simulation time cycle and the next local minimumtime-stamp, , in time. Each PE determines thenumber of bits, , required to express . For example, 13,requires 4 bits, 1101. The PEs simultaneously pull the signalline representing low on the global parallel bus illustrated inFig. 22. After all PEs have floated theirvalues on the bus in

, the PEs whose value is greater than the bus minimumsignal line eliminate themselves from the search. The smallestasserted signal line of the parallel bus narrows the scope of thesearch to the limited range of numbers expressed in

(23)

All elements not eliminated in this first phase are referred toas active elements in the second phase.

7) Phase 2 Selection:The second phase of the algorithm canproceed in either of two methods. Method one requires a 3-bitreduction network, and method two requires a 2-bit reductionnetwork. The first method performs a binary search through therange of time-stamps isolated in Phase 1. The second methodperforms binary tournament style eliminations among the re-maining active nodes. The reduction network also provides thestart and done lines for the main synchronizer under time-drivensimulation as described by Fig. 19. Details are provided in [21].

a) Cross Point Matrix: The simulator cross-point matrixcommunications structure of Fig. 9 is laid out in an approxi-mately fully connected star topology. Each of the eight quad-rants depicted in the three-dimensional (3-D) network layoutof Fig. 23, contain 2-D arrays of processing elements. Each ofthese 2-D arrays is associated with a cross-point switch usedto allow processing element communication. The cross-pointnetwork, although serial, allows more direct connections be-tween the processing elements than the parallel reduction bus.For quadrant cubes, ten processing elements on an edge, each2-D processing element subarray contains 100 processing ele-ments. Using a 300 pin cross-point matrix, approximately onethird of the lines will connect directly to the processing elementsof the 2-D array. The other two thirds of the cross-point matrix

Fig. 22. Algorithm Phase 2 Method 2: Elements eliminated by the initialreduction step are illustrated inscribed with a cross. Signals flow through theeliminated processing elements. The data signals are shown traversing theupper bus. The lower two-signal bus represents the basic handshaking signals.The Edgesignal indicates to each element whether or not that element is anetwork edge element. All elements which have not self-eliminated duringthe first phase generate an active Edge signal and propagate the signal towardthe network core. TheAdjacencysignal is used to pair processing elements.Each active element which receives the Edge signal but not the correspondingAdjacency signal propagates its own Adjacency signal toward the directionof the network core. When either another active PE or the core receives theAdjacency signal, that element does not propagate the signal but insteadcompares its minimum local time-stamp with the time-stamp value received onthe Data bus. The minimum value of the pair becomes the minimum value atthe node closest to the core while the outer pair node is eliminated.

lines are used to connect the 2-D array to the rest of the 3-Dnetwork. There is a cross-point switch at the network core. Ad-jacent processing elements also connect directly to each other.

For traffic simulation, communications channels betweenprocessing elements are static. Roads and intersections remainfixed. The cross-point matrix virtual channels can be initializedwhen the reconfigurable logic is initialized, and similar to thereconfigurable logic, the virtual communications channels donot require “on-the-fly” reconfiguration. However, in orderto allow the simulator to be used as a general purpose non-deterministic simulator, the rest of the section is devoted to abrief analysis of those simulations which do require varyingcommunications channels.

The time required for communications using the cross-pointmatrix network can be analyzed by dividing the simulator pro-cessing time into thetime spent processing vehicles, ,and thecommunications time, . Therefore each simulationcycle is composed of as illustrated in Fig. 24.

Looking first at the time required to process events in eachprocessing element using both the traffic scheduler and thepipeline [45] as our model, let be defined in

num (24)

The limiting motion function of Table IX has a clock rateof 7.54 MHz or 133 ns/cycle. Using this clock rate and the 6stage pipeline implementation of Fig. 15 as a conservative es-timate, num . Let be 25 vehicles. The resulting

s to process the vehicles moving on the road.Next, the value of , the communications delay throughthe cross-point matrix is calculated. The simulator was imple-mented with 0.5 second time resolution, so let , bethe events, or vehicles which have finished processing at thecurrent processing element and need to be transferred to thenext during one simulation cycle. This estimate assumes thateach processing element handles one intersection and the roads

Page 17: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 469

Fig. 23. The 3-Dimensional Network Structure: Although trees have awonderfully logarithmic decreasing structure, they offer difficult geometricconstraints for actual implementation. A linear parallel bus offers a mucheasier to implement structure, but poses more difficult adjacency problems.In the network illustrated, each parallel bus is composed of reduction logic asshown in Fig. 21. Much of the communications can be accomplished by theProcessing Element (PE) I/O cells. The length of each bus is a trade-off betweencommunications circuit element switching speed, bus signal propagation speed,and physical PE geometry constraints. In this figure, the PE’s are arrayed alonglinear buses. Letting 10 elements reside on each bus, and 10 arrays of 100 PE’sper quadrant allows each network to contain 8000 elements. The core may becomposed of more than one processor, but for the purposes of this paper, thecore is assumed to be one unit.

Fig. 24. Processing and Communications Time: Each simulation time cycleis divided intot and t sub-components. The processing elementcycles through the Main and Entry queues updating the position, velocity, andacceleration information of each vehicle data structure duringt . Vehiclesthat must be transferred to the next node are moved duringt . User directedsystem interrupts would also occur during this later phase as well as systemsynchronization.

which exit from it. In the traffic simulation, these events repre-sent vehicles which have come to the end of a road and are nowentering the next intersection.

From [46], the propagation signal delay can be estimated asns/cm. The worst case communications scenario

involves passage through 3 cross-point switches. The first halfof the route is illustrated in Fig. 23; the second half of the routetravels outward from the core to a different cube corner. The firstcross-point switch in the worst case scenario is connected to thesending processing element’s array. This first switch is locatedat approximately the tip of the second arrow displayed in Fig. 23.The second switch resides at the network core, where the sphereis located in Fig. 23, and the third connects the receiving arrayto the network. Each processing element is a 10-cm cube. Theworst-case distance across a network composed of 8 1000 PEquadrants is 600 cm. Through that distance, the propagationdelay is 30 ns. The vehicle data messages are relatively longas compared to the gate value results of [12]. So let thedelay

in message transmissiontime, ns, as a conservativeestimate.

To communicate across the cross-point matrix, a point-to-point channel is negotiated between the two processing ele-ments. First,channel request & granttime, , is required toestablish the circuit. Once the process is complete,channel re-lease time, , is used to free the circuit. Finally, if the circuit isunavailable, a penalty oftime wasted in processing a blocked re-quest, , is incurred. For the calculation of , let denotethe number of events which encounter a busy channel. Assume,on average that the messages transmitted go half of the worstcase distance, or through or 1.5 switching matrix hops.The formula for the transmission time, is as follows:

(25)

Equation (25) assumes the average communications require1.5network hops which can result in1.5possible call blocks. Tocompute , the parameter values: ns,ns, ns, vehicles, ns, and %,which are based on the values from [12] are used to compute thecommunications delay. Using these values, computes to1.2 s. For this example, although is smaller than ,the values are close enough to indicate that the implementationof the communications system is an important consideration inthe machine design.

VII. RESULTS

Discrete event simulation acceleration is both needed and fea-sible. Examples of articles in the press [47] explicitly describethe requirement for access to accelerated means of simulation.This experimentation shows that by applying various architec-tural techniques, discrete event simulation can be significantlyaccelerated.

In Section III-B.1, using a representative software simulationmodel, typical bottleneck areas of simulation processing, iden-tified in Fig. 5, involve theschedulerroutines. Overhead rou-tines, as expected, must also be minimized. Because CORSIMwas neither modular nor current, Trafix, a software simulatorwritten modularly in C++ was used to verify the correctnessof the car-following algorithms. The Trafix Scheduler routinetiming results are found in Table VIII. Within the schedulersoftware, for the traffic simulation example, the software bot-tleneck location is further refined and identified as the intersec-tion movement routine which requires 48.4s to process eachvehicle.

Section IV reviews analysis which can be used to determinewhether a simulation will proceed faster in time or event-drivenmode. Equation (10) can be used to determine the expected timeof the next event. Knowing that interval time, the mode of op-eration which most rapidly advances a simulation can be used.

The interior design of the processing element architecture isdivided into theevent generation, event queue,andschedulercomponents. Each subsection of the processing element archi-tecture is individually explored. The event generator design ispresented in Section VI-B.1. The results of Section VI-B.1a)

Page 18: An architecture for a nondeterministic distributed simulator

470 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 51, NO. 3, MAY 2002

yield an implementation which can generate an event every 200ns. Events are produced rapidly enough that this section of thedesign is not a bottleneck to throughput. An event queue wasimplemented as a linear array in Section VI-B.2b). The eventqueue handles 16-bit words which can be used to point to the ad-dress of data interleaved in memory, or can be expanded to largersized words by working queues in parallel. The event queue im-plementation is capable of working against an 80-ns cycle time,both pushing and popping elements off in each cycle. The logicimplementation required for both the event generator and theservice queue can be found in Table V. Again, the speed of theevent queue results show that the event queue is not a throughputbottleneck. The event scheduler results for the traffic simulatormodel can be found in Section VI-B.3a). The scheduler sectionmodeled the scheduling algorithm in five components. The firstcomponent initialized a vehicle data object entering the networkwith its source and randomly selected destination. The next fourcomponents consisted of a pair to handle vehicles on a road anda pair to handle vehicles traveling through an intersection. Boththe road and intersection initialization implementation modelsset attributes before injecting vehicles either onto a road or intoan intersection. Section VI-B.3a) found the speedup of the soft-ware bottleneck in accelerated hardware to be a factor of 91. Asexpected, the scheduler component is the bottleneck in the pro-cessing element design detailed in Fig. 18. The overall designachieved an order of magnitude speedup over its software coun-terpart.

Although very implementation dependent, the scheduler canbe further accelerated by focusing attention on division in recon-figurable logic, which currently causes the largest pipeline stagedelays. Perhaps if the reconfigurable logic had some ASIC func-tional units embedded within the chip, faster designs would beimplementable. An approach similar to the digital signal proces-sors which often include special functional units may be worth-while.

ACKNOWLEDGMENT

The authors would like to thank the Altera Corporation forthe licenses, software, and support.

REFERENCES

[1] C. S. Chang, S. L. Ho, T. T. Chan, and K. K. Lee, “Fast AC train emer-gency rescheduling using an event driven approach,”Inst. Elect. Eng.Proc.—B, vol. 144, no. 4, pp. 281–288, July 1993.

[2] J. Kohn, R. Malm, C. Meiley, and F. Nemec, “The IBM Los Gatos logicsimulation software,”Proc. IEEE Int. Conf. Computer Design: VLSI inComputers, pp. 588–591, 1983.

[3] R. Sandroff, “New jump start for hearts?,”Consumer Rep., vol. 66, no.2, p. 8, Feb. 2001.

[4] A. W. VanAusdal, “Use of the boeing computer simulator for logic de-sign confirmation and failure diagnostics programs,” inProc. Advancesin the Astronautical Sciences 17th Annu. Meet., vol. 29, June 1971, pp.573–594.

[5] R. M. Fujimoto, “Parallel discrete event simulation,” inCommunica-tions of the ACM: ACM, Oct. 1990, vol. 33, pp. 30–53.

[6] M. Abramovici, Y. H. Levendel, and P. R. Menon, “A logical simulationmachine,”IEEE Trans. Comput.-Aided Design Integrated Circuits Syst.,vol. CAD-2, pp. 82–94, Apr. 1983.

[7] P. Agrawal, W. J. Dally, and W. C. Fischeret al., “Mars: A multipro-cessor-based programmable accelerator,”IEEE Design Test Comput.,pp. 28–35, Oct. 1987.

[8] R. Barto and S. A. Szygenda, “A computer architecture for digital logicsimulation,”Electron. Eng., vol. 52, no. 642, pp. 35–66, Sept. 1985.

[9] J. Bauer, M. Bershteyn, I. Kaplan, and P. Vyedin, “A reconfigurablelogic machine for fast event-driven simulation,” inProc. 1998 35th De-sign Automation Conf., 1998, pp. 668–671.

[10] T. Burggraff, A. Love, R. Malm, and A. Rudy, “The ibm los gatos logicsimulation machine hardware,” inProc. IEEE Int. Conf. Computer De-sign: VLSI in Computers, 1983, pp. 584–587.

[11] N. Koike, K. Ohmori, and T. Sasaki, “HAL: A high-speed logic simula-tion machine,”IEEE Design Test Comput., vol. 2, pp. 61–73, Oct. 1985.

[12] Y. H. Levendel, P. R. Menon, and S. H. Patel, “Special-purpose computerfor logic simulation using distributed processing,”Bell Syst. Tech. J., vol.61, no. 10, pp. 2873–2909, Dec. 1982.

[13] G. F. Pfister, “The IBM Yorktown simulation engine,”Proc. the IEEE,vol. 74, pp. 850–860, June 1986.

[14] S. Takasaki, N. Nomizu, Y. Hirabayashi, H. Ishikura, M. Kurashita, N.Koike, and T. Nakata, “HAL iii: Function level hardware logic simula-tion system,” inProc.—IEEE Int. Conf. Computer Design: VLSI in Com-puters and Processors Proc. IEEE Int. Conf. Computer Design: VLSIComputers and Processors—ICCD, Sept. 1990, pp. 167–170.

[15] M. Tomita, N. Suganuma, and K. Hirano, “Reconfigurable machineand its application to logic simulation,”IEICE Trans. Fund. Electron.Commun. Comput. Sci., vol. E76-A, no. 10, pp. 1705–1712, Oct. 1993.

[16] J. D. Myjak, “A massively parallel microscopic traffic simulation modelwith fuzzy logic,” M.S. thesis, Massachusetts Inst. Technol., Sept. 1993.

[17] A. T. Chronopoulos and C. M. Johnson, “A real-time traffic simulationsystem,”IEEE Trans. Veh. Technol., vol. 47, pp. 321–331, Feb. 1998.

[18] C. M. Johnson and A. T. Chronopoulos, “A communications latencyhiding parallelization of a traffic flow simulation,” inproc. 13th Int. Par-allel Processing Symp. 10th Symp. Parallel and Distributed Processing,Apr. 1999, pp. 688–695.

[19] M. Bumble and L. Coraor, “Architecture for a nondeterministic simula-tion machine,” inProc. 1998 Winter Simulation Conf., vol. 2, Dec. 1998,pp. 1599–1606.

[20] , “Implementing parallelism in random discrete event-driven sim-ulation,” in Lecture Notes Comput. Sci. 1388, Parallel and DistributedProcessing: IEEE Comput. Soc., Mar. 1998, pp. 418–427.

[21] , “A global synchronization network for a nondeterministic simu-lation architecture,” inProc. 1999 Winter Simulation Conf., Dec. 1999.

[22] M. Bumble, “A parallel architecture for nondeterministic discreteevent simulation,” (in http://etda.libraries.psu.edu/theses/avail-able/etd-0 311 101-115 158/), Ph.D. dissertation, The PennsylvaniaState University, University Park, PA, May 2001.

[23] P. A. Ioannou, C. C. Chen, and J. Hauser, “Autonomous intelligent cruisecontrol,” in Proc. IVHS America 1992 Annu. Meet., vol. 1, May 17–20,1992, pp. 97–112.

[24] A. Kanaris, P. Ioannou, and F.-S. Ho, “Spacing and capacity evaluationsfor different AHS concepts,” inProc. American Control Conf., vol. 3,June 1997, pp. 2036–2040.

[25] P. G. Michalopoulos, P. Yi, and A. S. Lyrintzis, “Development of an im-proved high-order continuum traffic flow model,”Transportation Re-search Record, vol. 1365, pp. 125–132, 1993.

[26] S. O. Simonsson, “Car-following as a tool in road traffic simulation,”in Proc. IEEE–IEE Vehicle Navigation and Information Systems Conf.,Ottawa—VNIS, 1993, pp. 150–153.

[27] D. C. Gazis, R. Herman, and R. B. Potts, “Car-following theory ofsteady-state traffic flow,”Operations Res., vol. 7, no. 4, pp. 499–505,1959.

[28] T. Junchaya and G. Chang, “Exploring real-time traffic simulation withmassively parallel computing architecture,”Transport. Res. Comm., vol.1, no. 1, pp. 57–76, 1993.

[29] A. D. May, Jr. and H. E. M. Keller, “Non-integer car-following models,”Highway Res. Board, vol. 199, pp. 19–32, 1967.

[30] A. D. May, Traffic Flow Fundamentals. Englewood Cliffs, NJ: Pren-tice-Hall, 1990, ISBN 0-13-926 072-2.

[31] Q. Yang, “A microscopic traffic simulation model for ivhs applications,”M.S. thesis, Massachusetts Inst. Technol., Depart. Civil Environmental,Aug. 1993.

[32] J. Walrand,Communication Networks: A First Course: Aksen Assoc.,1991.

[33] J. Clark and G. Daigle, “The importance of simulation techniques in ITSresearch and analysis,” inWinter Simulation Conf., S. Andradottir, K. J.Healy, D. H. Withers, and B. L. Nelson, Eds. Piscataway, NJ: IEEE,1997, pp. 1236–1243.

Page 19: An architecture for a nondeterministic distributed simulator

BUMBLE AND CORAOR: AN ARCHITECTURE FOR A NONDETERMINISTIC DISTRIBUTED SIMULATOR 471

[34] G. Daigle, M. Thomas, and M. Vasudevan, “Field applications ofCORSIM: I-40 freeway design evaluation,” inWinter SimulationConference Proceedings, D. J. Medeiros, E. F. Watson, J. S. Carson,and M. S. Manivannan, Eds. Piscataway, NJ: IEEE, 1998, vol. 2, pp.1161–1167.

[35] M. Bumble, L. Coraor, and L. Elefteriadou, “Exploring CORSIM run-time characteristics: Profiling a traffic simulator,” inProc. 33rd Annu.Simulation Symp. 2000, Ap. 2000, pp. 139–146.

[36] E. S. Raymond, “The cathedral and the bazaar,” in 1997Linux-Kongress, Atlanta Linux Showcase, 1997.

[37] R. E. Felderman and L. Kleinrock, “An upper bound on the improvementof asynchronous verses synchronous distributed processing,” inSimula-tion Series SCS Multiconf. Distributed Simulation, vol. 22, Jan. 1990,pp. 131–136.

[38] R. L. Scheaffer, “Introduction to probability and its applications,” inTheDuxbury Advanced Series in Statistics and Decision Sciences. Boston,MA: PWS-K, 1990.

[39] H. T. Kung, “Why systolic architectures,”IEEE Comput., vol. 15, pp.37–46, Jan. 1982.

[40] R. W. Hartenstein, J. Becker, R. Kress, and H. Reinig, “High-perfor-mance computing using a reconfigurable accelerator,”Concurrency:Practice and Experience, vol. 8, no. 6, pp. 429–443, July–Aug. 1996.

[41] F. T. Leighton,Introduction to Parallel Algorithms and Architectures:Arrays, Trees, Hypercubes. San Mateo, CA: Morgan Kaufmann, 1992.

[42] M. Gokhale, W. Holmes, A. Kopser, S. Lucas, R. Minnich, D. Sweely,and D. Lopresti, “Building and using a highly parallel programmablelogic array,”Comput., vol. 24, no. 1, pp. 81–89, Jan. 1991.

[43] U. Legedza and W. E. Weihl, “Reducing synchronization overhead inparallel simulation,” inProc. 10th Workshop Parallel and DistributedSimulation (PADS ’96), May 1996, pp. 86–95, SCS, San Diego, CA.

[44] B. A. Chappell, T. I. Chappell, S. E. Schuster, H. M. Segmuller, J. W.Allan, R. L. Franch, and P. J. Restle, “Fast CMOS ECL receivers with100-mv worst-case sensitivity,”IEEE J. Solid-State Circuits, vol. 23, pp.59–67, Feb. 1988.

[45] M. M. Mano, Computer System Architecture, 3rd ed. EnglewoodCliffs, NJ: Prentice-Hall, 1993.

[46] A. Clements,Microprocessor Systems Design: 68 000 Hardware, Soft-ware, and Interface, 3rd ed. Boston, MA: PWS, 1997.

[47] B. Schecter, “Putting a darwinian spin on the diesel engine,”The NewYork Times, p. D3, Sept. 19, 2000.

Marc Bumble received the Ph.D. degree from theComputer Science and Engineering Department,Pennsylvania State University, University Park,PA., and the M.S. and B.S. degrees in electricalengineering from the University of Pennsylvania,Philadelphia in 1988 and 1993, respectively.

His current research investigates architectures foraccelerating nondeterministic simulation, includingthe application of reconfigurable logic. Recentprojects include the design and implementation AirTraffic Control (ATC) software for the Enhanced

Traffic Management System (ETMS), critiques of ETMS software from asafety and reliability standpoint, and digital signal processing of ATC stationheadset audio channels. His work applies open source architectures andalgorithms toward solving transportation and industrial robotics problems andapplications.

Lee D. Coraor received the B.S. degree in electricalengineering from Pennsylvannia State University,Ubiversity Park, PA, in 1974 and the Ph.D. degree inelectrical engineering from the University of Iowa,Iowa City, in 1978.

He was a member of the faculty at The SouthernIllinois University-Carbondale, IL, from August1978 to August 1980 and then joined the Penn-sylvannia State faculty, Department of ElectricalEngineering. He is currently an Associate Professorin the Department of Computer Science and

Engineering at Pennsylvannia State University. His current research interestsinclude reconfigurable computing applications, intelligent memory designs,computer architecture and digital systems. Recent projects have includedthe use of reconfigurable FPGAs for implementing event-driven simulation,implementation of special purpose hardware/software on aircraft for real-timecollision avoidance, the design and implementation of a dual computercontrol system for a microgravity continuous flow electrophoresis system,and the development of the SmartDIMM Platform for use as a reconfigurableSystem-On-Chip prototype.