13
CONCURRENCY PRACTICE AND EXPERIENCE, VOL. 3(3). 145-157 (JUNE 1991) ‘Blade Runner’: a real-time speech recognizer MIKE CHONG*. FRANK FALISIDE, TIM MARSLANDt AND RICHARD PRAGER Cambridge University Engineering Department Trumpington Street Cambridge CB2 IPZ, UK SUMMARY A real-time prototype speech recognizer has been implemented on a &-processor distributed- memory parallel computer. Simple phrases are recognized in approximately 4 to 10 seconds. Scalability, performance and flexibiiity are the three main aims of this implementation, the ultimate goal being to construct a large vocabulary speech recognizer whkh responds qukkly. A set of three techniques is investigated in this implementation: asynchronous methodology to minimize synchronization overheads, distributed control to avoid a central communications bottleneck, and dynamic load balancing to provide a flexible response to an unpredictable Computational load. The effect on memory, processor time allocation and communications is observed in real-time using hardware monitoring aids. 1. INTRODUCTION Parallel processing is of rapidly increasing importance in many fields, including automatic speech recognition[22]. This trend is in part because the computational needs of speech recognition systems which accommodate large vocabularies, are growing faster than the computational power which uniprocessor architectures can provide cost-effectively. This paper describes the implementationof Blade Runner, a real-time speech recognizer on the ParSiFal T-rack[151, a distributed-memorymultiprocessor computer built at Manchester University. Blade Runner is a prototype implementation of the voice activated hologram browser system named after the Warner Bros. film. Three goals of a real-time speech recognition system are: scalability, performance and flexibility. Scalability allows the implementation to be expanded to accommodate larger vocabularies, essential if speech recognition is to be used successfully in fume applications, e.g. a speech typewriter. Performance is measured by the time taken to recognize a spoken phrase (performance regarding accuracy is discussed in Reference 7). The goal, to respond as quickly as possible, is of particular importance because of the interactive nature of the browser application. Finally, each spoken utterance is unique, the nature of speech is that it is unpredictable. The recognition system must be sufficiently flexible to respond to the recognition procedure’s rapidly changing resource demands. The ParSiFal T-rack (Figure 1)[15] consists of 65 processor nodes, connected to a Sun 3/110 host computer via a T-800 transputer Boll VME interface card, and a RS- 232 link; an INMOS T-800 transputer with 1 Mbyte of memory forms an additional processor node. Sixty-four nodes are hard-wired in a necklace configuration using two links of each transputer; the other two links are connected to two COO4 switch cards, * Currently with NEC Corporation, 1-1. Miyazaki 4-Chome. Miyamae-ku. Kawasaki, Kanagawa 216 Japan. t Currently with Sun Microsystems Inc.. 2550 Garcia Avenue, Mountain View, CA 94043, USA. 104&3108/91/030145-13$M.50 01991 by John Wiley & Sons, Ltd. Received 22 May 1990 Revired 8 February I991

‘Blade Runner’: A real-time speech recognizer

Embed Size (px)

Citation preview

CONCURRENCY PRACTICE AND EXPERIENCE, VOL. 3(3). 145-157 (JUNE 1991)

‘Blade Runner’: a real-time speech recognizer

MIKE CHONG*. FRANK FALISIDE, TIM MARSLANDt AND RICHARD PRAGER Cambridge University Engineering Department Trumpington Street Cambridge CB2 IPZ, UK

SUMMARY A real-time prototype speech recognizer has been implemented on a &-processor distributed- memory parallel computer. Simple phrases are recognized in approximately 4 to 10 seconds. Scalability, performance and flexibiiity are the three main aims of this implementation, the ultimate goal being to construct a large vocabulary speech recognizer whkh responds qukkly. A set of three techniques is investigated in this implementation: asynchronous methodology to minimize synchronization overheads, distributed control to avoid a central communications bottleneck, and dynamic load balancing to provide a flexible response to an unpredictable Computational load. The effect on memory, processor time allocation and communications is observed in real-time using hardware monitoring aids.

1. INTRODUCTION

Parallel processing is of rapidly increasing importance in many fields, including automatic speech recognition[22]. This trend is in part because the computational needs of speech recognition systems which accommodate large vocabularies, are growing faster than the computational power which uniprocessor architectures can provide cost-effectively. This paper describes the implementation of Blade Runner, a real-time speech recognizer on the ParSiFal T-rack[ 151, a distributed-memory multiprocessor computer built at Manchester University. Blade Runner is a prototype implementation of the voice activated hologram browser system named after the Warner Bros. film.

Three goals of a real-time speech recognition system are: scalability, performance and flexibility. Scalability allows the implementation to be expanded to accommodate larger vocabularies, essential if speech recognition is to be used successfully in fume applications, e.g. a speech typewriter. Performance is measured by the time taken to recognize a spoken phrase (performance regarding accuracy is discussed in Reference 7). The goal, to respond as quickly as possible, is of particular importance because of the interactive nature of the browser application. Finally, each spoken utterance is unique, the nature of speech is that it is unpredictable. The recognition system must be sufficiently flexible to respond to the recognition procedure’s rapidly changing resource demands.

The ParSiFal T-rack (Figure 1)[15] consists of 65 processor nodes, connected to a Sun 3/110 host computer via a T-800 transputer Boll VME interface card, and a RS- 232 link; an INMOS T-800 transputer with 1 Mbyte of memory forms an additional processor node. Sixty-four nodes are hard-wired in a necklace configuration using two links of each transputer; the other two links are connected to two COO4 switch cards,

* Currently with NEC Corporation, 1-1. Miyazaki 4-Chome. Miyamae-ku. Kawasaki, Kanagawa 216 Japan. t Currently with Sun Microsystems Inc.. 2550 Garcia Avenue, Mountain View, CA 94043, USA.

104&3108/91/030145-13$M.50 01991 by John Wiley & Sons, Ltd.

Received 22 May 1990 Revired 8 February I991

146 MIKE CHONG ET AL.

....................

.............

COO4 switches

65 transputer T-rack i i... ...................................................................................... " ................................................................................................... *

Figure 1 . ParSiFal T-rack and SUN host computer system

which are controlled by the switch control transputer. A unique feature of the T-rack is a backplane bus which runs through all 65 processors, enabling run time debugging messages and monitoring information to be obtained directly, without having to rely on the integrity of a processor-to-processor communication chain[3]. The main advantage of a distributed-memory architecture such as the T-rack. is that more computational power can be added by increasing the number of processors, important if the speech recognizer is to have a degree of scalability.

It is interesting to compare the T-rack (Figure 1) with other computer architectures used in speech recognition systems. In BEAM[4], a shared bus architecture computer used by the SPHINX speech recognizerll91, bus contention problems are partly compensated for by having local cache memory for both data and instructions. Other architectures include processor trees[29], pipelines[6], SIMD[9,25] and dedicated hardware[24,26]. In each case the recognition algorithm is implemented using a synchronous methodology. In contrast, Blade Runner uses an asynchronous methodology to reduce synchronization overheads[ 131, and hence improve performance.

The authors used the C language[2] in preference to occam[27,16]. because C allows the effective use of data structures, pointers and dynamic memory allocation, three techniques which are used extensively in Blade Runner.

2. SPEECH RECOGNITION ALGORITHM

The speech recognition algorithm uses a neural network[20] to recognize phones (a phone being a popular choice for a subword speech unit[22], and hidden Markov models[28]

A REAL-TIME SPEECH RECOGNIZER 147

( ) .I Wordmodel

Recognised phrase -

Figure 2 . Speech recognition algorithm

to generate scores for words and phrases from the sequence of phones. Figure 2 shows how the speech recognition algorithm works.

Digitized speech is converted into a sueam of phonetic tokens using a neural networkr71. Phone classification using neural networks can lead to more accurate recognition results[30]; however, they can be computationally expensive to run. Blade Runner uses a novel CART tree neural network[8] which only needs two T-800 transputers for real-time operation.

In the recognition process, a stream of phonetic tokens derived from the utterance are matched against a sequence of word models by a set of word matching jobs. A word model consists of a precomputed set of parameters of a hidden Markov mode1[28]. When matched against a stream of phonetic tokens, each word model generates a score which expresses the closeness of the stream of tokens to that word. The word sequence which generates the highest total score being the resultant recognized phrase.

Blade Runner uses an LALR(1) grammar[7,17] to control which word models should be matched next, rather than, say, to explicitly accept or reject a hypothesized phrase. In the example in Figure 2, the grammar determines that the 'Activate' word model should be followed by the word models for 'BladeRunner' and 'Grid'. Phrases that are not accepted by the grammar are therefore automatically not searched for. As an example, a total of six word matching jobs are generated to match the grammar shown in Figure 2.

1. Activate 4. Shutdown 2. [Activate] BladeRunner 3. [Activate] Grid

5. [Shutdown] BladeRunner 6. [Shutdown] Grid

Because of the low latency requirement, Blade Runner cannot wait until it detects the end of the utterance, and so starts the recognition process as soon as the first phonetic tokens become available. Thus word matching jobs are spawned as phonetic tokens appear and more hypotheses are generated by the grammar.

A word matching job matches a word model to a section of the phonetic sequence, and requires three components before it can run, namely: the phonetic tokens from the neural network classifier, the particular word model, and the state associated with the grammar that selects the next word model hypothesis.

Even with the simple grammar and limited vocabulary (29 words) used in Blade Runner, there were 195 possible phrases. A beam search technique[21] is therefore used to eliminate word matches which are unlikely to be correct, reducing significantly the amount of computation required.

148 MIKE CHONG ETAL.

3. IMPLEMENTATION

Figure 3 shows Blade Runner’s topology on the T-rack. Eight transputers are used for signal processing and neural network tasks, generating a phonetic token every 12.8 ms. A sequence of tokens representing a spoken utterance is separated from the constant stream of tokens generated by a silence detection mechanismp] running on the phonetic token control transputer node. These tokens are fed via TC1 and TC2 to 53 nodes arranged in a back-to-back ternary tree which run word matching jobs as word matching processes.

The word matching algorithm is implemented as a loosely synchronous scheme[ 111; word matches proceed asynchronously, but are synchronized to the arrival of phonetic tokens. A word matching process is not partitioned across nodes, the calculations required for a word match being performed in isolation on a node; this coarse-grained approach helps to reduce the inter-processor communication overhead. Since the vocabulary is small (29 words), a further reduction in internodal communications is gained by preloading the grammar and word models into the memory of each word matching node. Phonetic tokens are generated as the user is speaktng, so these are broadcast to all 53 word matching nodes as they become available.

The recognition procedure starts by the grammar generating word matching job requests. These are distributed amongst the 53 word matching nodes where they are run as word matching processes. When a word matching process terminates it may spawn more word matching job requests which match the next words in the sequence. This continues until all required word matches have been matched to the utterance’s token sequence.

3.1. Job control

The beam search technique[21] requires the best scores for every phonetic token, generated by the word matching processes running on the 53 word matching nodes, to be accessable by those processes. A separate copy of these scores is stored in each of the 53 word matching nodes. This is a mutual consistency problem with replicated data.

The consistency of the best scores for each phonetic token is maintained by the broadcast of best scores messages. These are generated by word matching processes which generate the local best score for a phonetic token on a node. A node, on receiving such a best score message, compares and updates the score for the token specified in that message, and rebroadcasts the message onwards if it had to change its own score.

The compare and update task on each node is a multiple reader, multiple writer mutual exclusion problem. In this implementation the mutual exclusion overhead is minimized by implementing the compare and update procedure without using transputer machine code instructions that allow the process to be descheduled, thereby making that procedure an atomic action.

3.2. Load balancing

Owing to the unpredictable nature of speech, resource demands cannot be predicted in advance, resources are therefore allocated as the recognition procedure proceeds. Load balancing is simplified to deciding which word matching jobs to run on each of the 53 word matching nodes. Word matching jobs are passed on from node to node until they

A REAL-TIME SPEECH RECOGNIZER 149

Controt Transputer (Silence detection)

TC I

U

I Image processing

] TC2

Figure 3. T-rack loplogy

150 MIKE CHONG ETAL.

are accepted by a node. Livelock is avoided by ensuring that those jobs which have travelled furthest are given priority to run on a node.

33. Message passing

Transputen use the same instructions to pass data down external communication links as down internal communication channels. It is this feature of the transputer’s instruction set that provides the possibility of delaying the binding of processes to processors. Never- theless, in Blade Runner, data is passed between the processes running on the same processor by passing a pointer to that data down a channel. Although this technique is not applicable to processes which run on separate processors, it is substantially faster than copying data.

Transferring data between separate processors requires the data to be copied. In Blade Runner data structures are serialized, copied across and deserialized on arrival using a subset of external data representation routines (XDR)[31]. Each data type is assigned a single routine which carries out both tasks of serializing and deserializing data.

All inter-processor message traffic is buffered, making best use of the transputer’s DMA engine on each link, as well as allowing each processor to proceed asynchronously.

3.4. Memory allocation

Dynamic memory allocation is used in Blade Runner because this makes best use of the one megabyte of memory in each node. In addition to the C malloc() heap memory allocation routine an additional routine buffer pool of 32-byte blocks is used[7], helping to overcome the two problems of heap fragmentation and memory allocation speed.

3.5. Termination detection

The recognition procedure is not continuous: separate phrases are recognized one after another. Termination detection is therefore needed and is carried out by TC2 (Figure 3) using a termination detection mechanism based on the diffusing computation algorithm of Dijkstra and Scholten[lO]. Every word matching job request generated is reported to TC2 using a birth message, and the termination of a word matching job is reported to TC2 by a death message. Termination is detected once the difference between the number of birth messages and the number of death messages reaches zero. Note that there is a race condition between the birth messages and the death messages; it is ensured that the birth messages reach TC2 first by assigning them higher message priorities[7].

3.6. Debugging tools

Debugging was a major problem, owing principally to the asynchronous nature of Blade Runner’s implementation. The T-rack’s backplane bus is used by BackprofT231, a software debugging tool used to print variables and monitor processes in real-time. Backprof runs two types of process, a server process which runs on the switch control transputer (Figure I), and a client process on each of the 64 necklace processors. The server acts as an interface between the host computer and the client processes, communicating with the host direcily via an RS-232 link (Figure 1). A client process

A REAL-TIME SPEECH RECOGNIZER 151

collects monitoring information and debugging messages from user processes running on its node, and this data is passed back to the server process via the T-rack’s backplane bus.

4. PROFILING RESULTS

In order to evaluate the implementation, the system was used to recognize 35 phrases spoken twice (70 utterances in total). There were one to six words per phrase[7]. Blade Runner taking from 4 (for the phrase ‘stop’) to 10 seconds (for the phrase ‘rotate left and pan down seven’) to recognize a phrase. An extra profiling process was run on each of the 53 word matching nodes to collect information on memory usage, inter-processor message communication and the number of word matching processes running. The inclusion of this process made no observable difference either to the running time or to the accuracy of Blade Runner. After the recognition of each utterance, the data collected by each profiling process is printed out using Backprof and the T-rack’s backplane bus.

4.1. An example run

Figure 4 shows a profiled run for the test phrase ‘Activate BladeRunner’, for the word matching node next to TC1 (Figure 3). Figure 4(a) shows the amount of heap memory used as the recognition procedure progresses. Figure 4(b) shows the arrival of messages during the recognition procedure; note that their arrival is unpredictable, owing to the asynchronous nature of the recognition process. Figure 4(c) shows the number of jobs running, in this case five word matching jobs were run sequentially.

Figure 5 shows the number of word matching processes active and the number of word matching job requests in transit, over the complete network of 53 word matching nodes, during the recognition procedure for the phrase ‘Activate BladeRunner’. Note that the number of word matching processes running remains relatively constant when compared with the number of job requests generated. This is evidence that the take-up of word matching jobs happens smoothly.

4.2. Processor usage

Measurements showed that the front end required six (Figure 3) 20 MHz T-800s to convert digitized speech samples into phonetic tokens in real-time. Table 1 shows the percentage of CPU allocated to tasks for the word matching node adjacent to TC1. This node spent up to 50% of its time alternating between processes (in monitor operations[5,14]), which is indicative of that node being in a high message traffic region. Processors in lower message traffic density regions spent up to 70% of their CPU time processing word matching job calculations.

4.3. Load balancing

Figure 6 shows the occurrence of the number of word matching processes run on a word matching node during an utterance. Note that the distribution is approximately Gaussian. Table 2 shows the peak number of word matching jobs run on a node. For only 3.7% of the time was there a node idle, and there were never more than three word matching jobs running concurrently on the same node.

152 MIKE CHONG El‘ AL.

i

I I 1 I I I 1

0.1 second tlme Intervals 0 20 40 60 80 100 120 140 0

a) Heap memory usage in Kbytea

I I I 1 I

P .- 40

f

m 30 B E B 20

2

UI oJ

L

10 2

0 20 40 60 80 100 120 140 0

0.1 second time intervals b) Messages received during each 0.1 second time interval

L 60- ...... Word matching job requests B - Word matching jobs

4 0 ~ -

30.-

A

0.1 second time intervals 0 20 40 60 80 100 120 140

Figure 5 . Number of word matching jobs and requests versus 0.1 second time intervals for the 53 word processing processor-utterance Xctivate BladeRunner’

A REAL-TIME SPEECH RECOGNIZER 153

Table 1. Percentage of CPU usage for tasks on the word matching processor ncxt to TC2 ~

Task ??l CPU

Process alternation 50 Word matching job calculation 40 Memory allocation 5 Buffering 3 Input communications 2 Output communications 1

; 500 '5 450 x-x Pdsson distribution

8 400 + . . . .+ Normal distribution 350 300 250 200 150 100 50

0 0 4 a 12 16 20 24 28 32

Number of HMM pbo run on a transputer per utterance

Figwe 6. The occurrence of the number of word matching jobs run on a processor pet utterance

Table 2. The occurrence of peak number of jobs run over 35 utterances (spoken twice) for the 53 word matching processors

Peak jobs Occurrences Percentage

0 138 3.7 1 1058 28.5 2 2486 67.0 3 28 0.8 4+ 0 0.0

Total 3710

These results (Figure 6, Table 2) demonstrate that the load balancing scheme was effective and the word matching jobs were distributed evenly across the 53 word matching node network.

4.4. Memory usage

Figure 7 shows the average amount of memory used plotted against the average number of messages received by a processor during the recognition of an utterance. Referring to Figure 3, the 0 in Figure 7 represents the two transputers adjacent to TC1 and X2, o the next six of the adjacent layer, x 18 of the inner layer and, finally, + the 27 transputers of the Centre layer. This shows that transputers in high communication traffic regions

154 MIKE CHONG ET AL.

XI 30- ' 27.- 6 24.-

21

uf 18.- 5 15.- Y a '2.- P 9.-

i 6.- 3.- 0-

o 3 Outmost (2) 0 o o Outer (6) x x inner (18)

0 X

x o .- + + Centre (27)

% X B

I I I ++ I + I I I

Figure 7 . Average amount of memory used versus number of messages received shown for each of the 53 word mafching processors

Table 3. Correlation matrix between word matching jobs, messages and memory averaged over 35 utterances spoken twice, over all 53 word matching processors

Jobs Messages Memory

Jobs 1 .oo 0.15 0.36 Messages 0.15 1 .oo 0.64 Memory 0.36 0.64 1 .oo

use the most memory. On each word matching node observed memory usage was more closely connected to traffic density than to the number of word matching processes run. Table 3 shows the correlation matrix between the number of word matching jobs run, the number of messages received and the average memory used over time, over all 53 word matching nodes. In this implementation it is therefore likely that memory limitations will be reached because of communication traffic buffering rather than word matching process requirements.

As a consequence of memory being mostly used for buffering messages, it was observed that 84% of memory requests were under 32 bytes in size, hence these requests were allocated memory from the buffer pool. The maximum amount of heap used on any one processor was 94 Kbytes (in addition to the 34 Kbytes of static memory used on each processor).

5. DISCUSSION

This paper has described the implementation of a real-time speech recognizer on a 66- node distributed-memory parallel computer. Although this implementation is still in a preliminary stage of development, it has demonstrated the importance of the following five points.

Asynchronous methodology. The T-rack being a distributed-memory multi-processor computer must rely on the transputer links to pass messages. Synchronization of these message can be a considerable overhead[l,l3], so it is important to keep this

A REAL-"ME SPEECH RECOGNIZER 155

overhead to a minimum. An asynchronous methodology allows each processor to proceed independently, reducing the synchronization overhead.

Coarse grained methodology. The transputer was originally designed with Occam in mind, where! the mapping of processes to processors is intended to be as automatic as possible. Although many processes can be run concurrently on a transputer in a time- sliced fashion, only one process is in fact run at any one instant in time, the transputer having only one CPU. Bearing this in mind Blade Runner explicitly maps pmesses to processors using a coarse-grained methodology, the goal being to run a single process on each processor 100% of the time. In addition a coarse-grained methodology also helps to reduce inter-nodal communications, by storing data locally on each node.

Dynamic load balancing. Speech is affected by many variables, e.g. background noise and speaker variability; therefore, the computational demands of a real-time speech recognition system are difficult to predict. Since the implementation is asynchronous, race conditions will exist between word matching processes, so, even if all external conditions were the same an identical repeat run would be most improbable. The problem of load balancing was therefore solved by using a dynamic load balancing scheme which reacted to the varying computational load conditions as they developed.

Distributed control. If a central point of control were used, synchronization costs to the single central controlling process would be high, and a communication bottleneck would be likely to develop. Distributed control was therefore used, each word matching node deciding to accept a word matching job request locally.

The cost of such an implementation is that the information to make these control decisions needs to be distributed to all 53 word matching processors. In Blade Runner's case this information was the best score for every phonetic token, indeed 88% (of the number) of all messages passing between the 53 word matching processors were best score messages. In addition a distributed termination detection scheme is required; however, this introduces a relatively small overhead, birth and death messages contributing just 2% to the total number of messages.

Hardware debugging tools. The T-rack's backplane bus was invaluable to the implementation of Blade Runner. Although such a shared bus is not scalable, provided that it is used as a debugging aid, and not for application communications, scalability is maintained.

Although Blade Runner works, two problems remain: the distribution of the grammar and word models, and how to ensure that deadlock does not occur because of insufficient memory on a processor node.

Scalability is important because useful speech recognizers will need to accommodate large vocabularies (of perhaps 2000 words). Currently the grammar and word models are stored as a separate copy on each processor. This is not scalable, since it would be prohibitively expensive if the vocabulary was large; hence some way must be found to distribute this data from a central store, or from a limited number of distributed stores. Unfortunately inter-processor communication is already a limiting factor. Changing from the current static local storage method of word models and grammar data to one where

MIKE CHONG ETAL.. 156

that data is also distributed may therefore substantially reduce the response time of Blade Runner. One possible solution would be to distribute word models following the grammar-generated toplogy so that those word models that are more likely to be needed on a processor node are also more local.

The vocabulary used in this prototype recognizer is small, and in these tests the memory limit of 1 Mbyte per processor was not reached, therefore no method was needed to guarantee the avoidance of deadlock due to insufficient memory; however, as memory demands increase with larger vocabularies, such a deadlock avoidance scheme will be required. Techniques which require the acquiring and releasing of locks could be used[18,12], but this will also significantly reduce Blade Runner’s performance.

6. CONCLUSION

Blade Runner’s performance is acceptable, with the recognition of a spoken uaerance running in near real-time. Although a distributed processor architecture of this kind does form a scalable solution, experiments with more processors are needed to determine the degree of scalability.

ACKNOWLEDGEMENTS

One of the authors (Mike Chong) was supported by a personal grant from the UK Science and Engineering Research Council whilst this work was carried out. The ParSiFal project was funded by Alvey grant number GRDD882. The authors would like to thank the academic and industrial partners that took part in the ParSiFal project, and in particular the team from Manchester University, without whom Blade Runner would not have been possible. Finally, the authors would like to thank the referees of this paper for their helpful and valuable comments.

REFERENCES

1. T. S. Axelrod, ‘Effects of synchronisation barriers on multiprocessor performance’. Parallel

2. K. Bailey, Logical Systems Transputer Toolset, Logical Systems, C O N ~ ~ S . Oregon, 1989. 3. P. C. Bentley. Tracing Facilities for the T-rack, Logica Cambridge Limited, 1988. 4. R. Bisiani. T. Anantharaman and L. Butcher, ‘Beam: An accelerator for speech recognition’,

5. P. Brinch Hansen, Operating Systems Principles, Prentice-Hall, Englewood Cliffs, NJ. 1973. 6. S. Chatterjee and P. Agrawal, ‘Connected speech recognition on a multiple processor pipeline’,

7. M. W. H. Chong. ‘Subword units and parallel processing for automatic speech recognition’.

8. M. W. H. Chong and F. Fallside. ‘Classification and regression tree neural networks for

9. N. T. Condick and D. T. Chalmers. ‘A transputer based speech recognition system’, in ZCASSP

10. E. W. Dijkstra and C. S. Scholten. ‘Termination detection for diffusing computations’. If.

11. G. C. Fox, ‘Parallel computing comes of age: supercomputer level parallel computations at

Computing, 3, 129-140 (1986).

in ICASSP 89. 782-784. IEEE. May 1989.

in ICASSP 89, 774-777, IEEE, May 1989.

Ph.D. thesis, Cambridge University Engineering Department, Cambridge, UK. 1990.

automatic speech recognition’, in Proc. INNS-90. 187-190, Paris, July 1990, IEEE.

89, 797-800, IEEE, May 1989.

Proc. Letters, 11. 1-4 (1980).

CalTech’, Concurrent-ractice and Experience, 1. 63-103 (1989).

A REAL-TIME SPEECH RECOGNIZER 157

12. D. Gelemter, ‘A DAG-based algorithm for prevention of store-and-forward deadlock in packet networks’, IEEE Trans. Computers. c-30. 709-715 (1981). . .

13. A. Greenbaum. ‘Synchroniskion costs on multiprocessors’, Parallel Conputing, 10. 3-14 (1989).

16.

14. 15.

C. A:R. Home. ‘Communicating sequential processes’. Commr. ACM. 21. 66-77 (1978). M. S. Illiev and A. E. Knowles. ‘Run-time program debugger for the MU T-rack’. Technical Report PSF/MU/WP5/97/9. Manchester University, Manchester, UK, 1987. INMOS, Bristol, UK. Transputer Development System, Rentice Hall, Englewood Cliffs, NJ. 1988.

17. S. C. Johnson, ‘Yacc: Yet another compiler compiler’. Technical Report 32, Bell Laboratories, Murray Hill, NJ, 1975.

18. H. F. Korth. ‘Deadlock freedom using edge locks’, ACM Trans. Database Systems, 7.623-562 (1982).

19. K. F..Lee. Automatic Speech Recognition-The Development of the SPHINX System, Kluwer

20. R. P. Lippmann. ‘An introduction to computing with neural nets’, IEEE ASSP Magazine, April.

21. B. T. Lowerre. The HARPY speech recognition system, Ph.D. thesis, Camegie-Mellon

22. J. Mariani, ‘Recent advances in speech processing’. in ICASSP 89.429440. IEEE, May 1989. 23. T. P. Marsland and F. Fallside, ‘BackproE a symbolic dynamic debugging and monitoring

tool for the parsifal transputer rack’, Technical report, Cambridge University Engineering Department, Cambridge. UK, 1989. In preparation.

24. S. Miki and K. Intoh, ‘Speaker-independent isolated-word recognition LSI’. in ICASSP 89, 793-796. IEEE. May 1989.

25. E. M. Mumolo and F. Pazienti. ‘Large vocabulary isolated words recogniser on a cellular array processor’, in ICASSP 89.785-788, IEEE. May 1989.

26. H. Murveit. J. Mankoski. J. Rabaey, R. Brodersen. R. Schwartz and A. Santos. ‘A large- vocabulary real-time continuous-speech recognition system’, in ICASSP 89. 789-792. IEEE. May 1989.

27. D. Poutain and D. May, A Tutorial Introduction to Occam Programming. BSP Professional Books, Osney Mead. Oxford, 1987.

28. L. R. Rabiner and B.-H. Juang. ‘An introduction to hidden Markov models’, IEEE ASSP Magazine, January, 4-16 (1986).

29. D. B. Roe, A. L. Gorin and P. Ramesh. ‘Incorporating syntax into the level-building algorithm’, in ICASSP 89,778-781, IEEE. May 1989.

30. A. Waibel, T. Hanazawa. G. Hinton, K. Shikano and K. J. Lang, ‘Phoneme recognition using timedelay neural networks’, IEEE Tram. ASSP, 3, 328-339 (1989).

31. Sun MiaoSystems, External Data Representation: Sun Technical Notes, Revision A, May 1988.

Academic Publishers, Lancaster, U.K., 1989.

4-22 (1987).

University, Pittsburgh, PA, 1976.