6
Formal transformation of a KPN specification to a GALS implementation Syed Suhaib Signal Electronics and Embedded Systems Lab General Electric Research Center Albany, NY Bijoy A. Jose and Sandeep K. Shukla FERMAT Lab Virginia Polytechnic Institute and State University Blacksburg, VA Deepak A. Mathaikutty Microarchitecture Research Lab Intel Corporation Santa Clara, CA Abstract Kahn process networks (KPNs) provide a model of com- putation for streaming audio, video and various multime- dia applications. However, the KPN model consists of un- bounded FIFOs between these communicating processes which need to be realized by other means. Application of a design transformation process to a KPN style specifica- tion towards a Globally asynchronous locally synchronous (GALS) implementation is one way of achieving this. Fur- thermore, this transformation process needs to preserve the Kahn principle. In this paper, our main contribution is the presentation of one such refinement based design transfor- mation that preserves the Kahn principle. We present cor- rectness preserving transformation towards a lookup-based architecture where the communication between processes is facilitated by a shared on-chip lookup storage structure. This refinement methodology is generic, and various alter- nate schemes of GALS implementation can be derived. 1 Introduction Globally asynchronous locally synchronous (GALS) de- signs are gaining importance due to the failure of syn- chrony assumption 1 in large synchronous single-clock de- signs. This is due to the ever increasing clock frequen- cies, which causes signal propagation time between differ- ent components to be longer than the clock period [3]. In a GALS design, the synchronous components run at their independent clocks, and the communication between them occurs asynchronously. The synchrony assumption holds within each synchronous component. However, there are other challenges facing the design of GALS systems. There is a lack of tools and design methodologies to facilitate GALS designs. In most cases, GALS designs are con- structed using ad hoc methods, where synchronous compo- 1 Computation and communication occurs within a single clock cycle. nents are encapsulated with some wrapper logic and com- munication is either handshake-driven [8] or bounded FI- FOs are used [10]. Improper handling of synchronization aspect between these different components may result in design errors such as unwanted deadlocks [7]. Furthermore, these ad hoc approaches are not easily subjected to formal reasoning about the correctness of a design. Hence, we need to identify the basic ingredients for a successful GALS de- sign methodology. Most of the design languages used in the industry such as SystemC, SpecC, support synchronous specifications, and use discrete event as the Model of Computation (MoC) for simulation. A GALS design can be modeled in these lan- guages using discrete event MoC, however, it is not very natural. Furthermore, synchronization errors may be intro- duced by using such ad hoc approaches, and lack of vali- dation tools for GALS make it difficult to ensure their cor- rectness. We believe that depending on specific application domain, GALS design can be made easier by using alter- native specification formalisms. For example, GALS based hardware implementation of multi-media processors can be facilitated by starting from a KPN based model. We, there- fore use KPN as the model for capturing correct behaviors of such applications, and propose refinement schemes for GALS architectures. For designing GALS architectures, various communi- cation protocols can be used. Some of the ideas have been borrowed from existing literatures including signaling protocols [6, 1], latency insensitive protocols [4, 13], and dataflow architectures [6, 2]. For example, in a handshake- based architecture, the communication between processes occur based on well known four-phase handshake protocol, whereas in the fifo-based architecture, the communication protocol borrows ideas from latency insensitive protocols with a fifo interface connecting the processes running on different clocks. The controller-based architecture consists Forum on Specification and Design Languages 2008 978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 84

[IEEE Design Languages (FDL) - Stuttgart (2008.09.23-2008.09.25)] 2008 Forum on Specification, Verification and Design Languages - Formal transformation of a KPN specification to a

Embed Size (px)

Citation preview

Formal transformation of a KPN specification to a GALS implementation

Syed SuhaibSignal Electronics and Embedded Systems Lab

General Electric Research CenterAlbany, NY

Bijoy A. Jose and Sandeep K. ShuklaFERMAT Lab

Virginia Polytechnic Institute and State UniversityBlacksburg, VA

Deepak A. MathaikuttyMicroarchitecture Research Lab

Intel CorporationSanta Clara, CA

Abstract

Kahn process networks (KPNs) provide a model of com-putation for streaming audio, video and various multime-dia applications. However, the KPN model consists of un-bounded FIFOs between these communicating processeswhich need to be realized by other means. Application ofa design transformation process to a KPN style specifica-tion towards a Globally asynchronous locally synchronous(GALS) implementation is one way of achieving this. Fur-thermore, this transformation process needs to preserve theKahn principle. In this paper, our main contribution is thepresentation of one such refinement based design transfor-mation that preserves the Kahn principle. We present cor-rectness preserving transformation towards a lookup-basedarchitecture where the communication between processesis facilitated by a shared on-chip lookup storage structure.This refinement methodology is generic, and various alter-nate schemes of GALS implementation can be derived.

1 IntroductionGlobally asynchronous locally synchronous (GALS) de-

signs are gaining importance due to the failure of syn-chrony assumption1 in large synchronous single-clock de-signs. This is due to the ever increasing clock frequen-cies, which causes signal propagation time between differ-ent components to be longer than the clock period [3]. Ina GALS design, the synchronous components run at theirindependent clocks, and the communication between themoccurs asynchronously. The synchrony assumption holdswithin each synchronous component. However, there areother challenges facing the design of GALS systems. Thereis a lack of tools and design methodologies to facilitateGALS designs. In most cases, GALS designs are con-structed using ad hoc methods, where synchronous compo-

1Computation and communication occurs within a single clock cycle.

nents are encapsulated with some wrapper logic and com-munication is either handshake-driven [8] or bounded FI-FOs are used [10]. Improper handling of synchronizationaspect between these different components may result indesign errors such as unwanted deadlocks [7]. Furthermore,these ad hoc approaches are not easily subjected to formalreasoning about the correctness of a design. Hence, we needto identify the basic ingredients for a successful GALS de-sign methodology.

Most of the design languages used in the industry such asSystemC, SpecC, support synchronous specifications, anduse discrete event as the Model of Computation (MoC) forsimulation. A GALS design can be modeled in these lan-guages using discrete event MoC, however, it is not verynatural. Furthermore, synchronization errors may be intro-duced by using such ad hoc approaches, and lack of vali-dation tools for GALS make it difficult to ensure their cor-rectness. We believe that depending on specific applicationdomain, GALS design can be made easier by using alter-native specification formalisms. For example, GALS basedhardware implementation of multi-media processors can befacilitated by starting from a KPN based model. We, there-fore use KPN as the model for capturing correct behaviorsof such applications, and propose refinement schemes forGALS architectures.

For designing GALS architectures, various communi-cation protocols can be used. Some of the ideas havebeen borrowed from existing literatures including signalingprotocols [6, 1], latency insensitive protocols [4, 13], anddataflow architectures [6, 2]. For example, in a handshake-based architecture, the communication between processesoccur based on well known four-phase handshake protocol,whereas in the fifo-based architecture, the communicationprotocol borrows ideas from latency insensitive protocolswith a fifo interface connecting the processes running ondifferent clocks. The controller-based architecture consists

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 84

of a centralized controller that governs the execution of theprocesses, similar to dynamic tagged-token dataflow archi-tecture [1]. Lookup-based architecture involves communi-cation based on fast data accesses from a lookup storagewhich is located on-chip [14]. However, we need to ensurethe correctness of these GALS architectures.

Design Methodology: The design of GALS architec-ture can be summarized as follows (Figure 1): Giventhe description of a system, the first step is to identifythe behaviors of the system as a collection of concur-rent processes communicating asynchronously with a KahnProcess Network (KPN) MoC. A KPN consists of processesthat communicate via point-to-point unbounded FIFO chan-nels. The nodes represent the computation aspect of themodel, whereas the channels connecting these nodes rep-resent their communication. In the KPN models that weconsider, a process executes when data is available on allFIFOs on its inputs and writes to all FIFOs on its outputs.To ensure the correctness of the KPN model, the specifica-tion can validated via functional simulation. Transforma-tion from the KPN model to the target GALS design willhappen next. Refinements to the GALS design are made us-ing a components library. Each step of the design methodol-ogy is discussed using formalism and a proof of correctnessis provided.

KPN Model

Refinement

Components

Library

Refinement to GALS

Design Document(description, properties,

constraints, etc)Simulation-based

Validation

Figure 1. Design Methodology for GALSMain contributions: The main contributions of this

work are as follows:

• Design transformation: We provide a correct-by-construction design methodology for creating a GALSdesign by refining a design specified as a KPN. Ourrefinement involves transforming each process with ablocking read and non-blocking write to a process withblocking read and blocking write. The communicationbetween the refined processes is facilitated by asyn-chronous communication with a shared on-chip stor-age location.

• Preservation of Kahn principle: We show ourrefinements preserve the Kahn properties, whereeach refined process is deterministic, continuous andmonotonic. Instead of providing the proof similar tothe one presented in [9] for Kahn properties, we referto deterministic I/O automata which have been shown

to preserve the Kahn properties [12]. In this paper, weshow how our architecture can be mapped to equiva-lent deterministic I/O automata.

• Correctness preserving refinement: We show thatour refined GALS model is correct with respect to itscorresponding KPN model. Our notion of correctnessis latency equivalence2. We show that a KPN modeland its corresponding GALS model in our architectureare latency equivalent. We first show that each processand its refinement is latency equivalent, and then showthat composition of processes preserves latency equiv-alence.

2 Architectural DescriptionThe transformation to a GALS model from a KPN model

involves: (i) replacing the unbounded storage elements (in-finite FIFOs) with bounded storage elements, and (ii) en-forcing blocking read and blocking write conditions on theprocess. In our architecture, we use an on-chip lookup stor-age unit (LUS) to store the data. This LUS is a boundedstorage structure, and enables communication between dif-ferent processes. Each process reads and writes to this LUSstructure. Blocking read and blocking write conditions areenforced for each process by composing the process with astorage mapping unit (SMU).

2.1 On-chip Lookup Storage (LUS)

The data communicated between the processes is storedin the LUS. The LUS is split into different segments, whereeach segment represents a communication connection be-tween two processes in the KPN model. The size of eachsegment depends on how many data elements can be storedbefore the consumer process starts processing the data. Forn point-to-point connections in the KPN model, there are nsegments in LUS unit. Each segment i has a bound szi as-sociated with it. The segment size szi for each i connectiondenotes maximum number of elements that can be storedbetween two processes. Let SG represent the set of all seg-ments of the storage, and addrk ∈ N denote the addressof segment k. Each segment k is associated with an indexidxk and a static bound bndk, where bndk ∈ N. The indexpoints to the location of the data in the segment, and boundrepresents the maximum number of data elements that canbe stored in the segment (or maximum value of the index).

In this paper, we do not focus on computing the optimalsize for the storage, however, many existing works [5] focuson the buffer optimization problem between the processes.

We now look at how the data is organized in the LUSunit. We assume that the most significant bit (MSB) of thedata accessed represents the present bit. The present bit ifset to 1 implies that the data is valid, otherwise it is invalid.

2We define latency equivalence later in the paper.

2

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 85

For example, consider data size to be 31 bits. So, 32 bit datais stored at each address location in the LUS, and its mostsignificant bit (MSB) represents the present bit.

LUS behavior: The read and write access of the LUS isa major constraint in the multiprocess model. A producer-consumer model is ideal to understand the LUS behavior. Inan asynchronous environment, if a producer and consumertries to access the same segment in a LUS, the access has tobe granted in a manner which will protect the data integrity.As LUS is divided into different segments for each point-to-point connection, each segment has a read controller anda write controller.

LUS Read Controller: The read access for the LUS isgranted in a manner as shown in Figure 2. The read con-troller consists of four states: Idle, PRead,CRead, andPCRead. The Idle read state of LUS denotes that there areno pending read requests. PRead state denotes that pro-ducer’s read request is being processed, and CRead statedenotes that consumer’s read request is being processed.PCRead state denotes that both the producer as well asthe consumer are requesting read operation. A transition tothese states denote that a request has been received. Theread request from a producer and a consumer is denotedby ProReq and ConReq respectively. A transition out ofthese states denote that the read operation had been com-pleted, and appropriate data has been sent back to the re-spective process. This operation completion to the producerand consumer of the segment is denoted by /SendPro and/SendCon respectively. For read operation, simultaneousread requests from producer and consumer can be handledtogether by granting both processes access to the data in thesame segment.

PCRead

PRead

Idle

CRead

ProReq+ConReq

/SendPro+

/SendCon

/SendCon

ProReq

/SendPro

ConReq+/SencPro

ConReq

ProReq+/SendCon

Figure 2. LUS Read Controller BehaviorLUS Write Controller: The write access for LUS has

to be dealt with greater care than the read operation. Weneed to ensure that no read can happen to the same ad-dress in a segment during the write operation. Hence, thewrite requests are atomic. Figure 3 represents the behav-ior of how LUS handles the write requests. For write op-eration, LUS consists of three states: Idle, PrCon, andWrCon. The transitions ProReq and ConReq representsthe write request from the producer and consumer. A write

request is accompanied by the address and data, the the LUSwrites the data in the corresponding address. PrCon andWrCon represents that the write operation is being per-formed for the producer and consumer respectively. Oncethe write is complete, acknowledgement signals (AckProand AckCon) are sent to the processes to denote comple-tion of write requests.

Idle

WrConWrPro

ConReq

AckCon

AckPro

ProReq

Figure 3. LUS Write Controller Behavior

2.2 Process Refinement

The refinement involves composing each process p ∈ Pof the KPN model with a storage mapping unit (SMU) thatgoverns the execution of the process. The refinement yieldsP ′ where each p′ ∈ P ′ is a refinement of its correspondingp ∈ P . The refinement step is shown in Figure 4.

p.

.

.

.

a1

a2

an bn

b2

b1

p'

p.

.

.

.

a1

a2

an bn

b2

b1

SMU.

.

.

.

Refinement

Clock

Control

Signals

Data

Signals

Figure 4. Process Refinement

The SMU consists of the addresses of the input and out-put data for its corresponding process. The SMU containslocal storage elements to store the data for inputs/outputsof the process. This is based on the number of input andoutput fields. We assume initially that all storage is empty.The SMU also has the capability to extract the MSBs ofthe data retrieved. This can be implemented as a simplefunction. The SMU maps the addresses of each input/out-put to the correct lookup storage locations. Table 1 definessome of the basic operations for accessing the data. TheSMU provides data to its corresponding process. For eachinput/output signal of the process, there is a segment in theLUS associated with the signal. Therefore, for each signal,there is a control and data signal connecting to the appro-

3

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 86

Table 1. Lookup OperationsFunction Name Descriptionrdloc(LS, addr, idx)Function rdloc returns the data at the ad-

dress addr, and index idx.wrloc(addr, d, idx) Function wrloc at the address addr and in-

dex idx, writes the data d.rdprs(d) Function rdprs reads the data d, and returns

its present bit.wrprs(d, t) Function wrprs writes the present bit t to

the data d.

priate segment. Next, we define the storage structure of theSMU.

Data Structure: The storage structure contains fieldsfor the inputs and outputs. Each input and output field isdivided into two parts: address and bound. The addresspart points to the location of the inputs/outputs in the LUS.Initially, the address part for each input and output fieldcontains its starting address. The bound part represents themaximum number of valid data locations that can be storedstarting from the initial address location. In other words,the bound represents the maximum valid values that can besaved at a given time.

Functionality of the refined process: The behavior ofthe refined process p′ is shown in Figure 5.

Idle

ReqDataWrData

Execute

Clock

ExeDen

ExeGnt

Figure 5. Process BehaviorThe process can be in one of the four states: Idle,

ReqData, WrData, and Execute. In the Idle state, p′ iswaiting for the arrival of its clock. On arrival of the clock,the state of p′ changes to ReqData. In this state, the fol-lowing happens: (i) the SMU sends a request to LUS forretrieving data for the corresponding addresses of the in-put and outputs. (ii) Extract the present bit of the dataretrieved. (iii) If the condition that the present bits of thedata at all inputs are ‘1’ and that of all the outputs are ‘0’,then execution is granted (ExeGnt), and the transition toExecute state is enabled. If the condition of the presentbits is false then the transition to the Idle state is enabled.At the Execution state, all the inputs are available and theoutput data can be produced. Hence, the data is sent to theprocess p for execution. After the execution, the output datais stored in the SMU. The transition to WrData is enabled

after the execution. In the WrData state, the followinghappens: (i) The present bits of the output data are set to‘1’. (ii) The present bits of the input data (data retrievedearlier) are set to ‘0’. (iii) The data is written back to thecorresponding addresses of the input and output. (iv) Theaddresses of all inputs are incremented by ‘1’ % bound topoint to the next read location, and the address of all out-puts are also incremented by ‘1’ % bound to point to thenext write location. (v) Once the acknowledgement is re-ceived from LUS, that data has been stored, the transitionto the Idle state is enabled.

3 Ensuring Correctness of our ArchitectureIn this section, we present the correctness proofs for our

architecture. To ensure correctness, we need to show thatthe refined architecture maintains the Kahn principle. Toprove this, we show (i) each encapsulated process in our ar-chitecture is deterministic, continuous and monotonic, and(ii) the behaviors of the processes in the GALS architectureis correct with respect to its KPN counterpart. Our notionof correctness is latency equivalence. Before we define la-tency equivalence, we provide some basic definitions.

Let D be the set of data values, and T be a set of tags,where T ∈ N. The set of all events is denoted by E where,an event e ∈ D×T is defined as a value-tag pair. However,in the systems we consider, a special event called absentevent denoted by τ may occur3. Hence, an event with aninformative value is called a valid event(e ∈ D × T ) oth-erwise it is an absent event (τ ). Two events are said to beidentical (ei = ej) iff they contain the same data values andtheir corresponding tags are identical.

Consider P to be the set of all processes of the KPN, andeach p ∈ P follows the blocking read and non-blockingwrite policies. We now define helper functions that we usein the paper. We use the notation expn

1 to be a textual re-placement of exp1, exp2, ..., expn. For example, list of sig-nals s1, s2, ..., sn can be denoted as sn

1 . Abusing the nota-tion, we extend this to the scope of the functions. A functionfunc used as func(sn

1 ) implies that all si ∈ sn1 from 1 to

n are arguments to the function. On the other hand, a func-tion used as func(s)n

1 implies that the function is appliedto each individual si as a single argument. Throughout thepaper, we use these short hand notations.

Definition 1 Signal: A signal s = eiejek... is defined as asequence of events which are ordered based on their tags,where i < j < k and i, j, k ∈ N.

For a signal s, s[i] denotes its ith event. The set of allsignals is denoted by S. An empty signal is denoted by[ ]. Two signals s1, s2 are said to be identical, s1 = s2 iff∀ei ∈ s1 ∧ ∀fi ∈ s2, ei = fi.

3An absent event may occur due to lack of data in the producer or dueto a consumer’s request to delay a transmission.

4

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 87

Definition 2 Signal prepend: Given s ∈ S and e ∈ E, wedefine the prepend operator ⊕, where e ⊕ s = s′, s.t. e isthe first event of s′ and s is the rest of the signal.

Definition 3 Latency Equivalence: The two signals s1

and s2 are said to be latency equivalent, s1 ≡e s2 ⇔strip(s1, n) = strip(s2, n), where strip : S → S be de-fined as, strip(s, n) = σ(s, 1, n) and,

σ(s, i, n) =

⎧⎪⎨⎪⎩

s[i] ⊕ σ(s, i + 1, n), if (s[i] = τ ∧ i = n)σ(s, i + 1, n), if (s[i] = τ ∧ i = n)s[i], if (s[i] = τ ∧ i = n)[ ], otherwise

The function strip returns a signal without any absentevents. The function σ takes the signal s, the start index i,and signal size n with respect to the number of events in thesignal. The function outputs only valid events, and the ab-sent events are removed. These definitions can be extendedto processes also. For two latency equivalent processes,if their corresponding input signals are latency equivalent,then their output signals are also latency equivalent.

We first show each process in our architecture hasthe Kahn property, i.e. deterministic, continuous andmonotonic. Instead of repeating the proof in [9], we referto the deterministic input/output (I/O) automata [11], wherethe composition of such deterministic (I/O) automata hasbeen shown to preserve the Kahn properties [12]. An I/Oautomata is a formalism used to describe and reason aboutnetworks of concurrently executing processes. An I/O au-tomata consists of input actions, output actions and inter-nal actions, with the requirement that all input actions arealways enabled. This means that whenever inputs are real-ized, the automata can appropriately react to them.

Theorem 1 If M is a deterministic I/O automata, then it iscontinuous and monotonic.[12]

Now, to show that the processes of our architecture arecontinuous and monotonic, we show that they can be cor-rectly transformed into deterministic I/O automata. We firstconsider the refined process p′. Let addrin1 , addrom

1 de-note the addresses of the n inputs and m outputs of theprocess, idxin1 , idxom

1 denote their corresponding indices,and bndn

1 , bndm1 denote their corresponding bounds. We de-

note the data values for inputs and outputs as din1 and dom1 .

Assuming do′, di′, idxi′, and idxo′ denote newer values,and p ∈ P is the process being refined, the functionalityof the refined process can be represented by the followingfunctions.

The read operation for inputs and outputs is defined asfollows, where LS represents the corresponding segment inthe LUS unit.

For each addr ∈ addrin1 , dij = rdloc(LSj , addrj , idxj)For each addr ∈ addrom

1 , doj = rdloc(LSj , addrj , idxj)

Now, the following is the execution condition for theprocess:

if ((∀di ∈ din1 : rdprs(di) = 1) ∧(∀do ∈ don

1 : rdprs(do) = 0))

If the condition is false, the refined process transitionsback to Idle state. If the condition holds true, then afterprocess execution, the following behavior occurs.

For each k,

do′k = wrprs(p(din1 ), 1)di′k = wrprs(dik, 0)

wrloc(addrik, di′k, idxik)wrloc(addrok, do′k, idxok)

idxi′k = (idxik + 1)mod bndikidxo′k = (idxok + 1)mod bndok

The definitions of the refined process presented are de-terministic. The functions rdloc and wrloc, read and writethe data from the LUS unit, and are clearly deterministic.Furthermore, LUS contains data which can be modified byonly wrloc function, and hence is deterministic. Next, weshow these units are continuous and monotonic.

Theorem 2 The processes in our lookup-based architec-ture are continuous and monotonic.

Proof sketch: We can easily transform the behavior of therefined process to its equivalent I/O automata. The inputsinclude arrival of events on the clock signal, arrival of datafrom read operation, and arrival of acknowledgement afterwrite operations. These events are mapped to input actions.The output actions include the events on control signal anddata signals to the LUS. The remaining events are mappedto internal actions. Now, to ensure that the inputs are en-abled in all states, first we consider the arrival of the clock.If the clock arrives in the Idle state, the transition is madeto the ReqData state. In all the other states, if the clockarrives, it remains in the same state. The input data actionsare seen from LUS only in the ReqData after an asynchro-nous event is sent to the LUS. In any of the other states,this event will not be seen as no read request is sent. Sim-ilarly, the acknowledgement (input action from LUS) willonly happen after the write data signal is sent to the LUS.Hence, for all possible inputs, the I/O automata of the re-fined process takes appropriate actions, and hence can berepresented as deterministic I/O automata. By Theorem 1,the processes are continuous and monotonic. Similarly, theLUS unit can also be represented by I/O automata, since allthe actions that occur are asynchronous, and the appropriatedeterministic responses are generated.

Next, we show that the behavior of our refined processp′ ∈ P ′ is correct. With p′, we also consider the corre-sponding segments where the data for its input and outputis stored.

5

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 88

Theorem 3 The behavior of a process p ∈ P and its refinedprocess p′ ∈ P ′ are latency equivalent.

Proof sketch: The process p computes a valid value onlywhen all its inputs are available (blocking read). The con-dition of the SMU ensure that the corresponding refinedprocess computes when (a) all its inputs are available and(b) it is able to write data to its outputs. Now, if we onlyconsider condition (a), then the behavior is equivalent top. The condition (b) ensures that the data does not over-ride an existing valid value. Once the p′ consumer reads theexisting valid value, it will make it invalid. So, for a sin-gle process, the consumer acts as an environment, and weassume that the environment is fair as in KPN model. Fur-thermore, since there is a fair environment assumption, theproducer to p′ will also not produce valid data until the datais consumed by p′. Hence, the output of p and p′ will beidentical for the same input.

Next, we show that the condition holds true for the com-position of such processes.

Theorem 4 The behavior of two composed processesp1, p2 ∈ P (p1 and p2 are connected by a point-to-pointchannel) is equivalent to the behavior of the composition ofthe corresponding refined process p′1, p

′2 ∈ P ′ (p′1 and p′2

communicate via an asynchronous segment from our LUSunit).

Proof sketch: Composing p1 and p2 together implies thatone process is a producer and other must be a consumer.In the KPN model, if p1 is the producer and p2 is the con-sumer and if X is a sequence of inputs, then p2(p1(X))represents the behavior of this model. Now, consider theircorresponding refined processes, p′1 and p′2, that communi-cate by a segment in the LUS. Using Theorem 3, we knowthat p1(X) and p′1(X) are equivalent. Now, since p′1 and p′2share the same segment, p′1 can write only at the address lo-cation where data has not been read by p′2 (invalid data), andthe address of p′2 will point to the correct reading location(base assumption on how the initial address locations areassigned). Hence the outputs of p′2(p

′1(X)) will be stored

at locations where p′2 will read from. As a result, the out-put of the GALS model will be p′2(p

′1(X)). According to

Theorem 3, p2(p1(X)) and p′2(p′1(X)) are equivalent.

Using induction for the number of processes and its cor-responding segments, we can prove that the behavior of theKahn model and our GALS model is equivalent.

4 ConclusionIn this paper, we have present a correctness preserving

design transformation from a KPN model towards a GALSmodel. Our formal methodology of refinement from a KPNspecification into a GALS implementation is generic, andvarious alternate schemes of a GALS implementation canbe derived. In this paper, we show refinements to a lookup-based architecture, and show how to establish correctness

with respect to the KPN model. Lastly, meta-stability isavoided in this architecture as the read and write operationsfrom LUS happen only after the arrival of the clock. If anew clock arrives during a read/write to the LUS, it is ig-nored.

References

[1] Arvind and R. Nikhil. Executing a program on the mittagged-token dataflow architecture. IEEE Transactions onComputers, 39(3):300–318, March 1990.

[2] D. Culler Arvind. Dataflow architectures. Annual review ofcomputer science, 1, 1986.

[3] M. Bohr. Interconnect scaling - The real limiter to high per-formance VLSI. In IEEE Int. Electron Devices Meeting,pages 241–244, 1995.

[4] L. Carloni, K. McMillan, and A. Sangiovanni-Vincentelli.Theory of latency-insensitive design. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems,20(9):1059–1076, 2001.

[5] E. Cheung and H. Hsieh. Automatic buffer sizing for rate-constrained kpn applications on multiprocessor system-on-chip. In Proceedings of High Level Design Validation andTest, 2007.

[6] J. Dennis. First version of a data flow procedure language. InG. Goos and J. Hartmanis, editors, Proceedings of the Pro-gramming Symposium. Springer-Verlag, 1974.

[7] M. Geilen and T. Basten. Requirements on the execution ofkahn process networks. In Proceedings of the 12th EuropeanSymposium on Programming, ESOP 2003, 2003.

[8] X. Jia and R. Vemuri. The gapla: A globally asynchronouslocally synchronous fpga architecture. In Proceedings ofthe 13th Annual IEEE Symposium on Field-ProgrammableCustom Computing Machines (FCCM’05), pages 291–292,Washington, DC, USA, 2005. IEEE Computer Society.

[9] G. Kahn. The semantics of a simple language for paral-lel programming. In J. L. Rosenfeld, editor, Informationprocessing, pages 471–475, Stockholm, Sweden, Aug 1974.North Holland, Amsterdam.

[10] D. Kim, M. Kim, and G. Sobelman. Asynchronous FIFOInterfaces for GALS On-Chip Switched Networks. In Inter-national SoC Design Conference, 2005.

[11] N. A. Lynch and M. R. Tuttle. An introduction to input/out-put automata. Technical Memorandum TM-373, MITLCS,Nov 1988. TM-351 revised.

[12] N.A. Lynch and E.W. Stark. A Proof of the Kahn Principlefor Input/Output Automata. Information and Computation,82(1):81–92, 1989.

[13] S. Suhaib, D. Mathaikutty, D. Berner, and S. Shukla. Vali-dating families of latency insensitive protocols. IEEE Trans-actions on Computers, 55(11):1391–1401, 2006.

[14] S. Suhaib, D. Mathaikutty, and S. Shukla. Dataflow archi-tectures for gals. In Third International Workshop on For-mal Methods for Globally Asynchronous Locally Synchro-nous Design, 2007.

6

Forum on Specification and Design Languages 2008

978-1-4244-2265-4/08/$25.00 © 2008 IEEE Page 89