Download pdf - the yeas: Dr. Robert Rowe, Dr. Gilben Soulodre, Dr. W. Andrew …nlc-bnc.ca/obj/s4/f2/dsk3/ftp04/nq64671.pdf · 2005. 2. 9. · Acknowledgements 1 would like first to thank my advisor,

Abs tnc t

This thesis describes a mmputer program called TUnavarp, which implements a model of

the processes used to segment and recognize melodic fragments while listening to a real-

time musical performance. The model, called the Listener, accepts as input a Stream of MIDI

data. The output of the model is a representation of the performance data that includes the

segmentation of the music into melodic fragments and a listing of the recognized melodic

fragments.

The Listener model is divided into two discrete processes: the Segmenter and the

Recognizer. All of the processes operate in real-time and analyze the musical data as it is

performed. The Segmenter uses a preprocessor to provide the model with an interna1

representation of the performance and serves as a short-term perceptual memory that is used

by the other processes. The Segmenter parses individual voices in the performance into

musiwlly relevant fragments. The Recognizer uses the dynamic timewarp algorithm (DTW)

(Itakura, 1975), which was originally developed for time alignment and comparison of

speech and image patterns. The DTW enables the Recognizer to compare and wtegorize the

contour of a new melodic fragment with a collection of previously recognized melodies.

Good results were obtained in the segmentation and recognition of melodic fragments in

music ranging from Bach fugues to bebop jazz.

At the present time, computers can not be expected to listen to music the way humans do.

But it is possible and instructive to build systems that attempt to deal with specific aspects

of musical intelligence. By becoming more aware of both the possibilities and the

limitations of such systems, one may learn how people listen to music while gradually

raising the musical quality and usefulness of computer music systems.

Cette thèse présente le logiciel "Timewarp" qui modélise les processus employés pour

segmenter et reconnaître des fragments mélodiques lors de l'écoute d'une pièce musicale. Le

modèle reçoit un flux de données MIDI et renvoie une représentation de l'interprétation

musicale qui inclut une liste des fragments mélodiques reconnus.

Le modèle se divise en deux processus distincts de segmentation ("Segmenter") et de

reconnaissance ("Recognizer"). Ces processus fonctionnent en temps réel en analysant les

données musicales au fur et à mesure qu'elles sont reçues. Le processus de segmentation

utilise un module de pré-traitement qui fournit au modèle une représentation de

l'interprétation et fait aussi office de mémoire perceptive à court :erme utilisée par les autres

processus. II extrait ainsi, à partir des voix individuelles, les fragments musicalement

pertinents. Le processus de reconnaissance emploie 1' algorithme "dynamic timewarp" qui

fut développé à 1' origine pour 1' alignement temporel et la comparaison des formes de la

voix parlée et des images. L' algorithme permet au module de reconnaissance de catégoriser

le contour 6' un fragment mélodique nouveau en le comparant à un catalogue de mélodies

déjà reconnues. De bons résultats ont été obtenus lors de la segmentation et la

reconnaissance de fragments mélodiques allant de la musique de Bach su jazz bebop.

Les ordinateurs ne peuvent à ce moment-ci "écouter" de la musique comme le font les

humains. II est toutefois possible et instructif de bâtir des systèmes s' attaquant à des

aspects spécifiques de 1' intelligence musicale. En prenant conscience des possibilités et des

limites de tels systèmes, on peut apprendre comment se fait l'tcoute musicale tout en

améliorant graduellement les qualités musicales et 1' utilité des systèmes d' informatique

musicale.

Acknowledgements

1 would like first to thank my advisor, Dr. Bruce Pennycook for his support and supervision

during this project. 1 would also like to thank the following people who have helped me over

the yeas: Dr. Robert Rowe, Dr. Gilben Soulodre, Dr. W. Andrew Schloss, Dr. Ichiro

Fujinaga, Sean Terriah, Sean Ferguson, Anne Holloway, and Jason Vantomme. A special

thanks goes to René Quesnel for tnnslating my abstract into French, and Geoff Mitchell for

performing the musical exanples.

1 would like to thank my employer, RealNetworks, Inc., for granting me the time off so that

1 could complete this dissertation.

Finülly, 1 would like to thank my wife, Kimm Brockett Stammen, for her love and support.

She provided me with the inspiration and confidence to wrnplete this dissertation.

This research was, in part, supported by a research grant from the Social Sciences and

Humanities Research Council of Canada.

Table of Contents

Abstract ............................................................................................................................. 2

Résumé ............................................................................................................................ 3

Acknowledgements ........................................................................................................... 4

Table of Contents ............................................................................................................. 5

1 . Introduction .................................................................................................................. 8

1.1 Introduction .................................................................................................... 8

1.2 Prognm Ovewiew .......................................................................................... 9

1 3 Problem Specification ..................................................................................... 10

1.4 Implementations of the Listener Mode1 ........................................................... 15

1.5 Dissertation Roadmap ..................................................................................... 18

2 . The Computer as Listener ............................................................................................. 19

2.1 The Birth of the Computer Musician .............................................................. 19

2.2 Lejaren Hiller

The Computer as Composer ..................................................................... 20

2.3 Stanley Gill

The Self-Correcting Computer Musician .................................................. 23

2.4 R . S . Ledley

..................................................... Criticism and the Need for a Grammar 24

2.5 G . M . Koenig

The First Artificial Intelligence Computer Musician ................................. 25

2.6 Noam Chomsky

n i e Structured Computer Musician .......................................................... 25

2.7 Otto Laske

The Listening Cornputer Musician .......................................................... 2 6

2.8 Msrvin Minsky

"Why do we like Music?" ........................................................................ 3 0

2.9 Current Implementations of the Computer Musician ...................................... 35

2.9.1 Roger Dannenberg

The Computer Musician as Performer .......................................... 35

2.92 David Rosenthal

The Computer Musician as a Listener ........................................... 36

2.93 Robert Rowe

The Computer Musician as Listener, Composer, Performer

...................................................................................... and Critic 37

2.10 How do \Ve Recognize and Remember Melodies? ....................................... 38

2.11 Conclusion .................................................................................................... 41

3 . The Segmenter .............................................................................................................. 43

.................................................................................................... 3.1 Introduction 43

3.2 Grouping Preference Rules ............................................................................. 46

3.2.1 GPR 1 Avoid Small Groups ............................................................ 47

3.2.2 GPR 2 Proximity Rules ................................................................... 47

3.2.3 GPR 3 Change Rules ....................................................................... 48

3.3 Application of GPR to a Live Performance ..................................................... 50

................................................................................................ 3.4 Event Memory 58

3.5 Determining Rhythmic Types ......................................................................... 59

......................................................................... 3.6 Segmentation Rule Evaluation 63

3.7 RT Segmenter Rules ....................................................................................... 63

3.8 N4 Rules ......................................................................................................... 65

......................................................................................................... 3.9 N3 Rules 67

................................................... 3.10 Handling Incorrect Segmentation Choices 67

............................................................................................ 3.1 1 Segment Length 70

....................................... 3.12 Display of Event Data and Segmenter Information 70

3.13 Tenor Madness

A Real-time Example ............................................................................... 72 3.14 Summary ...................................................................................................... 85

3.14.1 Limitations and Future Work ......................................................... 85 4 .The Recognizer ............................................................................................................. 87

4.1 Introduction .................................................................................................... 87

4.2 Use of Melodic Contour ................................................................................. 88

4.3 Applications of Dynamic Programming to Music ........................................... 89 43.1 Other Melodic Fragment Recognition Systems ............................... 90

4.4 Implementation of the Dynamic Timewarp Algorithm .................................... 91 4.5 Real-time Listening Example .......................................................................... 97 4.6 Summary .................................................................................................... 112

5 . Real-time Segmentation and Recognition of Vertical Sonorities ................................... 113 5.1 Introduction ................................................................................................. 113 5.2 Real-time Chord Recognition .......................................................................... 113 5.3 Revision of Parncutt's Mode1 ...................................................................... 117 5.4 Real-time Example .......................................................................................... 121

........................................................................................................ 5.5 Summary 125

6 . Conclusions .................................................................................................................. 126

Appendix A . The Event Record ......................................................................................... 128

References ........................................................................................................................ 133

1. Introduction

1.1 Introduction

In 1981, M a ~ i n Minsky asked the simple question, "Why do we like certain tunes?"

(Minsky, 1981). In response to his own question, Minsky offered two possible answers: Xe

like melodies because they have certain structural features or we like them because they

resemble other tunes we like. n i e f i a i answer has to do with the laws and rules that make

tunes pleasant. The second answer forces us to look not at the tune itself, but at ourselves

and how we perceive music. It causes us to look at how we actually listen to music.

However, Minsky's use of the word resemble forces us to ask another seemingly simple

question: How do we know if two tunes resemble each other? As we shall see, this simple

question wiil prove to be difficult to answer.

During a live musical performance, the sound that is generated by an ensemble travels

through the air to a listener's ears and stimulates various part of her brain. She somehow

manages to organize this stream of sound into the notes played by each individual

instrument. She may then group the notes into motives and the motives into phrases. She

may then compare these melodic fragments with other fragments that she has previously

heard. As she listens, she becomes familiar with the music and may even think that she has

heard the piece before. After the performance, the listener may even be able to hum a few of

the melodies from the performance. If she were to hear the piece in a future performance,

chances are she would quickly recognize ihat she had heard it before. But what are the

perceptual and cognitive processes that enable her to do al1 of this with little or no effort on

her part?

How does the listener segment the music into melodic fragments and then recognize iishe

has previously heard any of these fragments? Traditionül musicology, through the analyzing

8

of musical scores and historia1 documents, cemot hclp us find the answer to this question

(Laske, 1989). During the past few decades, res~archers of musical perception have studied

how listeners recogiiize and remember melodies. The fields of cognitive musicology and

music perception now offer theories on the recognition process. But how can we test if

these theories are valid?

1.2 Prognm Ovewiew

This thesis describes a computer program called Timewarp, which implements a model of

the real-time processes used to identify and recognize melodic fragments while listening to a

musical performance. The model, called the Listener, accepts as input a stream of MlDl data

that represents the sound produced by the musical performance. MIDI, or Musical

Instmment-Digital Interface, is a standard way of recording performance information from

electronic instruments. A MlDI representation of a performance is called a sequence and

can be stored in a standard MlDI file (SMF). The MlDl recording contains basic

performance information including the pitch of the note (MIDI note number), the onset and

release times for each note and how loudly the note was played (MIDI velocity). This

information contains enough detail to accurately reproduce a musical performance.

The output of the model is a graphical presentation of the performance that includes the

segmentation of the music into melodic fragments and the recognized melodic fragments.

The Timewarp application is not a complete melodic recognition system as it does not

attempt to recognize entire melodies or works of music. Instead the system looks at the

problem of segmenting a stream of music data into musically significant melodic fragments

of 4 to 16 notes in length.

Recognizer 9 Figure 1.1: The Listener hlodel

The Listener model is divided into two discrete processes, the Segmenter and the

Recognizer. These processes are shown in Figure 1.1. All of the processes operate in real-

lime and analyze the musical data as it is performed. The Segmenter contains a preprocessor

that provides the model with an interna1 representation of the performance and serves as a

short-term perceptual memory that is used by the other processes. The Segmenter panes

individual voices in the performance into musically relevant fragments. The Recognizer uses

the dynamic timewarp algorithm (DTW) (Itakura, 1975), which was originally developed for

time alignment and comparison of speech and image patterns. The DDV enables the

Recognizer to compare and categorize the contour of a new melodic fragment with a

collection of previously recognized fragments.

1.3 Problem Specification

The process of listening to a Stream of music may be broken down into a set of simpler

tasks. Each of these tasks proves to be a complex problem for a computer. As we shall see,

programming a computer to perform the simple task of listening to a monophonic melody is

anything but simple.

13.1 Source Sepant ion

When listening to the performance of a musical ensemble, a human listener is able to

identify the various instmments in the ensemble. Or, for example, in the case of a Bach

fugue, the individual voices of the fugue. This task is called source sepantion and it is a

crucial step in listening to music. Source sepantion allows the user to connect the individual

notes in the music to each other, thereby enabling the user to hear the individual parts of the

musical score.

Source sepantion, especially fmm acoustic or digital sources, has proven to be an extremely

difficult task for computers. Several researchers have devoted entire dissertations and books

to the subject. Good examples of this work are found in Schloss (1985), Bregmm (1990)

and Ellis (1992). For the purposes of this research, the Listener rnodel sidesteps the

complexities of source separation by accepting MIDI as its input. MIDI allows each

instrument or voice in the ensemble to be assigned a specific MIDI channel. A Listener is

then allocated to listen to each channel, thereby ensuring that each Listener receives a

monophonic input representing a single voice in the ensemble.

1.3.2 Segmentation

One of the most important tasks in the Listener model is the segmentation of the musical

data into melodic fragments. These fragments must make sense musically in that they must

be aligned with motivic or phrase boundaries. The location of the end of a melodic fragment

is called a group boundary for it denotes the end of one fragment and the start of the next

fragment. An exarnple of a group boundary is shown in Figure 1.2. This exarnple uses four

notes, N1 N2 N3 N4 to locate the group boundary at N2. N2 is perceived as a group

boundary due to the length of the half note in relation to the surroundag quarter notes.

Grmp Boundary I

Figure 1.2: Detecting a gmup boundary

Figure 1.3 shows the segmentation of a short musical example. In this fugue subject there

are four melodic fragments. The Segmenter must locate these group boundaries in real-the

and pass the fragments on to the Recognizer. The Recognizer is then responsible for

recognizing that fragments 1.2 and 3 are sirnilar. Chapter 3 will discuss the operation of the

Segmenter in greater detail.

Figure 1.3: Segmentation into melodic fragments

(Input: Alto voice, Fugue No. 2 in C Minor, B \ W 847, J. S. Bach)

1.3.3 Quantization

Before the Segmenter can determine the location of group boundaries, it must examine

certain characteristics of every note. The duration of each note is an especially important

feature. The Segmenter examines individual note durations to determine the location of a

group boundary. The mode1 would therefore prefer that al1 quarter notes were periormed

with the exact same duration. For example, Figure 1.4 illustrates a precise performance of 5

quarter notes where each quarter note is 500 milliseconds (ms) in length. The Segmenter

would easily detect that al1 of these notes are quarter notes.

U 1 I 1 I I

Milliseconds: O 500 1 O00 1500 2000

Figure 1.4 Exact performance

However, performance of music, even by highly tmined musicians, rarely provides for the

exact execution of durations. Such a performance would be perceived as robotic and

expressionless. Musicians Vary ihe duration of each note in the performance for expressive

purposes and also due to the fact ihat there is an innate variability to our motor functions. A

"typical" performance of the quarter notes may be similar to the one shown in Figure 1.5. In

this example only the fourth quarter note is 500 ms in duration. The others Vary from 450

ms to 570 ms in duration.

Milliseconds: O 480 1050 1600 2100

Figure 1.5: Typical performance

This variability problem is exacerbated when accelerandos or ritardandos are performed. in

these situations, each individual quarter is longer or shorter than the previous note. In the

example shown in Figure 1.6, the accelerando creates successive quarter notes that are

approximately 10% shorter in duration than the previous quarter note. Situations such as

these are difficult for computers to handle. The quarter notes mus! quantized or made to be

equivalent before the Segmenter can properly examine them. Chapter 3 describes how the

use of a short-term memory in the Segmenter handles the variability of note duntion as well

as other musical features.

h t l 1 1 I I l

1. I , I I l 1, e - I - I -

I

el 1 I I 1 1 Milliseconds: O 500 950 1350 1700

Figure 1.6: Accelerando performance

1.3.4 Recognition

Once the Segmenter has detected the location of a group boundary, the melodic fragment is

sent to the Recognizer. The Recognizer uses the dynamic timewarp algorithm (DTW),

which was originally developed for time alignment and comparison of speech and image

patterns. For example, if you were to hear the word ball spoken quickly, "ball" or slowly,

"baaaall", you would have no problem understanding the word even though the sonic

characteristics of the word have been temporally altered between enunciations. This same

situation exists in music performance. Even though the notes on the musical score are

precisely notated, their realization by a human performer will contain many variations from

the score due to musical interpretation and technical accuracy. Composes will also modify

and develop motives and phrases by varying their note structures. For example, the Bach

fugue subject presented in Figure 1.3 contains three iterations of the same motive where

each iteration has been slightly modified. The task for the Recognizer is Io detect that these

three fragments are in fact musically similar.

As will be presented in Chapters 2 and 4, resesrch into melodic perception suggests that

melodic contour is one of the most salient features used to remember and recognize melodic

sequences (Bartlett and Dowling, 1980). In order Io mode1 this behavior, the Recognizer

uses the DTW to match the melodic contour of a new melodic fragment with a set of

previously recognized contours.

1.4 Irnplernentations of the Listener Model

The Listener model has been implemented as a computer program called Timewarp. The

progrdm runs both as a stand-alone Macintosh computer application called Timewarp and

as a Max (Puckette and Zicarelli, 1990) extemal object called rnelsl~ape. The Max version of

Timewarp is shown in Figure 1.7. In this Max patch, there are three melshape objects, each

listening to a sepante voice of the Bach fugue. The sepante voices of the fugue have been

assigned to separate MIDI channels. The fugue is realized in real-time by the playSMF

object which plays Standard MIDI files.

The results of the listening session are shown in a display window (Figure 1.8). In this

window, the location of the group boundaries are labeled by the Segmenter. The recognized

melodic contours are drawn above each fragment. Using this display, it is possible to verify

the performance of the Listener model.

Figure 1.7: The Listener Model in the Max Environment

Figure 1.8: Timewarp application displny vvindow

1.5 Dissertation Roadmap

Chapter 2 presents previous work by various researchen that is relevant to this dissertation.

Chapter 3 discusses in detail the operation of the Segmenter component. Chapter 4 presents

the Recognizer component and describes its use of the DTW. Chapter 5 presents a chord

recognition component. While this component is not directly part of this thesis, it represents

a real-time process for the real-time segmentation and recognition of vertical sonorities.

Concluding remarks are presented in Chapter 6. Appendix A describes the structure of the

Event record used by the Segmenter.

2. The Computer a s Listener

"How do we know if two tunes resemble each other?" The concept of the Listener model

developed from the effort to answer this question. The Listener rnodel emulates the

processes used to segment and recognize melodic fragments while listening to a real-time

musical performance. The idea of a computer as a musical listener has developed over the

past sevenl decades as researchers in many diverse fields have tried to answer similar

musical questions. Can a computer be programmed to compose music? Or analyze a piece

of music? From our vantage point ai the end of the 2dh century, it is possible to look back

over the previous four decades and trace the growth of what 1 will cal1 a "computer

musician" (Le. a computer system that is able to compose, analyze or listen to music). In

this chapter, 1 will follow the development of the computer musician betweeti 1956 and the

present. More importantly, 1 will examine the computer's ability to model human perception

Q of music. The research presented has proven to be most intluential in the development of the

Listener model.

2.1 The Birth of the Computer Musician

The idea of the computer musician began with a very simple question. "What makes the

melodies of simple nursery tunes so appealing?" As we s h ~ l l see, this seemingly simple

question is not easily answered. Richard Pinkerton asked this question in a 1956 Scient@

American article entitled "Information Theory and Melody" (Pinkerton, 1956). In asking

this question, Pinkerton was searching for a universal set of rules or features that could

define "appealing" melodies. But do these universal rules even exist?

Pinkerton chose to define music as a form of communication. By using this definition, it

was then possible to apply communication theory to music. Communication theory is based

on the concept of entropy, a numerical index of disorder. When there is a lot of uncertainty

and disorder, entropy is high. Likewise, when there is much symmetry or pa'rerned

arrangement, the entropy is low. According to Pinkerton, a composer must make the entropy

of a melody low enough to give it some sort of pattern and at the same time high enough so

that it has sufficient complexity to be interesting. Maximum entropy would mean that al1 the

notes would have equal probability of being chosen. By applying information theory to

music, it would be possible to calculate the entropy or average information per note for

certain kinds of elementary melodies. This in turn would give an indication of meaning or

information that could be expressed by such melodies. Pinkerton discoveied that a certain

amount of redundancy or repetition is necessary in order to have tuneful melodies.

Pinkerton's article concluded that melody, rhythm and harmony could al1 be fitted into a

statistical theme, and that it was therefore possible to build machines that could compose

music. A set of tables could be constructed thrit would "compose Mozartian melodies or

themes that would out-Shostakovich Shostakovich" (Pinkerton, 1956, p. 86).

Pinkerton's ideas, while being overly optimistic, did in fact outline some of the early hopes

for a computer musician. From the very beginning, one of the first musical applications of

the computer was composition. But contrary to Pinkerton's claims, we still do not have

computers that can "out-Shostakovich Shostakovich". Moreover, Pinkerton's definition of

"meaning" as the degree of entropy of a melody really does not bring us any closer to

understanding why and how Our minds like certain lunes.

2.2 Lejaren Hiller: The Computer as Composer

In 1959 another researcher asked the more ambitious question, "Cm a computer compose a

symphony?" (Hiller, 1959). Lejaren Hiller believed it was possible based on the following

reasons:

1) Music is a sensible form govemed by the laws of organization which permit fairly exact

codification. Therefore computer-produced music which is "meaningful" is possible as

far as the laws of music organization are codifiable.

2) Computers a n be used to create a nndom universe in accordance with imposed rules,

musical or othenvise.

3) Since the process of creative composition may be thought of as an imposition of order

upon an infinite variety of possibilities, a fairly close approximation of the composing

process may be done with a computer.

Hiller, like Pinkerton, viewed music as "a compromise between chaos and monotony" and

as an "ordered disorder lying somewhere between complete randomness and complete

redundancy" (Hiller, 1959, p. 110). Hiller recognized that the appreciation of music involved

not only psychological needs and responses, but meanings imported into the musical

experience by reference to its cultunl context. However, Hiller chose not to include this side

of music in his version of the computer musician. Instead, he looked at what he called the

objective side of music which he defined as existing in the score apart from the composer

and listener. The information encoded there relates to such quantitative entities as pitch and

t h e and "is therefore accessible to rational and ultimately mathematical analysis" (Hiller,

1959, p. 110). From this reasoning it is apparent that Hiller believed universal rules of

music composition existed, and that these mles could be separated from the "human" aspect

of music composition.

Hillrr was ahead of his time in his considention of the aesthetic aspects of music composed

by a computer. Hiller believed the aesthetic significance or value of a music composition

depended considerably upon its relationship to our inner mental and emotional transitions.

This relationship is largely perceived in music through the articulation of musicai f o m or

the semantic content of music and this in tum could best be understood in terms of the

technical problems of musical composition. Since the articulation of musical forms is the

primary problem faced by the composer, it seemed most logical to Hiller to star! his

investigation of computer composition by attempting to restate the techniques used by

composers in terms both compatible with information theory and tnnslatable into computer

prognms uti!izing sequential-choice opentions as a basic for music generation. Using this

objective viewpoint, Hiller derived five basic principles involved in music composition

(Hiller and Issason, 1959):

1) The formation of a piece is an ordering process in which specified musical elements are

selected and arranged from an infinite variety of possibilities, i.e. from chaos.

2) Both order and chaos contribute to the musical structure.

3) The two most important dimensions of music on which a greater or lesser order can be

imposed are pitch and time.

4) Because music exists in time, memory and instantaneous perception are required in the

understanding of musical structures.

5) Tonality, a significant ordering concept is considered the result of establishing pitch

order in terms of memory recall.

Hiller's process of generating computer music was divided into two basic operations. In the

first operation, the computer generated random sequences of integers which were equated to

the notes of the scale, rhythmic patterns, dynamics, etc. In the second operation, each

random number was screened through a series of arithmetic tests expressing various rules

of composition and was either used or rejected depending on which rules were in effect. If

accepted, the nndom integer was used to build up a "composition". If it was rejected, a new

integer was genented and examined. The process was repeated until a satisfactory note was

found or until it became evident that no such note existed, in which case the composition

thus far \vas erased to allow a fresh start. Using this method, the Illiac computer at the

Univenity of Illinois composed the IIIiac Suite for string quartet (Hiller and Issacson,

1959).

2 3 Stanley Gill: The Self-Correcting Computer hlusician

Stanley Gill (1964) pointed out that the main difficulty with the process of altemate nndom

generation and selection in computer composition was that the computer may lead itself into

a dead end. In other words, the computer musician would continue in one direction until the

rules that govemed its selection process made it impossible for the composition process to

continue. This was a problem also acknowledged by Hiller. It was therefore desirable to

allow the computer to backtnck so it could te-examine alternative choices at an earlier point

in the composition. Gill used a technique thüt retained at üny moment not one, but eight

competitive versions of the partial composition, each completely specified up to a certain

point, but not necessarily the same length. The genemtion process took one of these partial

compositions, or sequences, at random and extended it according to the compositional rules

and criteria. Its value wüs then compared with the other existing sequences. The weakest

sequence was then rejected and the whole process repeated. Each sequence was linked

backwards in t h e from the end to the beginning. At the end of the composition, the

sequence with the highest value was chosen. As we shall see, this concept of competing

versions is a central idea in Minsky's theories. Gill's ideas gave his computer musician a

limited ability to correct itself.

2.4 R S. Ledley: Cnticism and the Need for a G n m m n r

The question now arises whether composition of music by amputer is really creativity. R.

S. Ledley briefly described the use of a computer in musical composition as part of his

discussion on prognmming a computer IO achieve intelligence. In particular, he referred to

the use of computers for creative purposes. Creativity "produces structure out of disorder,

form out of chaos, but structure and form must meet aesthetic requirements as well"

(Ledley, 1962, p.371). In Ledley's opinion there was a lack, in amputer-genemted music, of

some of the necessary ingredients of creativity including:

1) Over-al1 planning or direction was missing, leading to a sense of incompleteness.

2) The resulting music was digressive, lacking symmetry and the recursive building of

ideas.

Ledley identified two problems cornputers have in the composition process:

1) The problem of irnparting fom, direction and unity to a particular musical composition.

2) The problem of comprehending an over-all, across-the board characterization of a style,

i.e. an abstraction that is recognizable in collections of compositions by a single

composer.

Ledley believed that the solutions to these problems would probably involve the notion of

syntactical concept formation, which would give the computer the ability to comprehend

musical abstractions and to use such abstractions as a guide to creativity (Ledley, 1962, p.

375). In essence, Ledley's criticism indicated that the current computer musician was unable

to generate satisfactory structures because it did not have full use of a grammar.

2.5 G. M. Koenig: The F i n t Artificial Intelligence Computer Musician

Between 1965 and 1970, two prognms written by G. M. Koenig, Project One and Project

2, (Koenig 19703; 1970b) represented a f i s t step toward an artificial intelligence (AI) view

of a computer musician (Laske, 1981). These prognms embodied a composition theory

about the processes used by composes to compose a musical work. These prognms were

the f i s t knowledge-based systems for composition in which a composer could define a

series of steps or rules that led from the ovenll design of a composition to the more detailed

specification of the musical surface. Koenig's programs offered the first computer-assisted

composition system for composes where a composer guided the programs to the final

musical output by specifying a wide variety of parameters to the system. The human

composer guided the computer musician to generate music that had the potential to

overcome some of Ledley's criticisms. The need for a Listener, a process to listen and

evaluate the output of the computer musician is evident from Koenig's work.

2.6 Noam Chomsky: The Structured Computer Musician

In 1965, Noam Chomsky presented his concept of a genentive grammar. Chomsky defines

a generative language as a "system of rules that can iterate to generate an indefinitely large

number of structures" (Chomsky, 1965, pp. 15-16). Chomsky's system of rules for a

generative grammar was divided into three major components. These were called the

syntactical, phonological, and semantic components. According to Chomsky, the syntactical

component specified an infinite set of abstract formal objects, each of which incorponted al1

information relevant to a single interpretation of a particular sentence. The phonological

component of a grammar determined the phonetic form of a sentence generated by the

syntactic rules. Finally, the semantic component determined the semantic meaning of a

sentence. This semantic component related a structure generated by the syntactic component

to a certain semantic representdtion (Chomsky, 1965). The syntactic component of a

grammar must specify, for each sentence, a "deep structure that determines the syntactic

interpretation and a surface structure that determines it phonetic interpretation" (Chomsky,

1965, p.15-16). The deep structure would be interpreted by the semantic component and the

surface structure by the phonetic component.

Looking back at Ledley's criticisms in view of Chomsky's theory, it was a lack of deep

structure that was missing from the compositions generated by an information theory-based

system. The computer musician did have a syntactical component for generating sequences.

This syntax was the rules used in the selection of note values from the randomly generated

numben. However, the computer was unable to interpret the deep structure of the sequences

it generated, and therefore had no way in which to evaluate the structures it composed. It

was totally incapable of any semantic interpretation. Chomsky's theories of generative

grammars proved to be quite innuential on the later development of a computer musician.

2.7 Otto L ~ s k e : The Listening Computer Musician

Otto Laske regarded semantic processing as a matter of reconstruction. A "listener who

perceives a musical evect may be said to understand that event if he is capable of specifying

how it may be reproduced" (Smoliar, 1976, p.112). The problem of constructing semantic

structures may be regarded as decompilation. The acoustic signal, as perceived by the

listener, is the "lowest-level language" representation of musical information. The computer

musician would need to decompose the acoustic signal into a "machine-level" representation

of the information. A representation in terms of notes may be said to be a high-level

reconstruction of the machine-level information. This "higher level language must

incorporate some sort of mode1 of musical perception, since reconstruction can only arise

from the listener's perceptual activities" (Smoliar, 1976, p. 113). The computer musician of

the 1970's could not decompose and reconstruct semantic meanings from its own

compositions.

Notice ths: in this discussion of a computer musician's ability to compose music, we are

suddenly talking about listening! In order for a computer musician to be able to compose or

perform likc a human, it must be aware of perceptual processes. ï h i s need for a perceptual

model was expressed by several researchers in the early 1970's. For example, in 1971,

Barry Verme stated:

"We seem to be without a sufficiently well-defined 'theory' of

music that could provide that logically consistent set of

relationships between the elements which is necessary in

order to program, and thus specify, a meaningful substitute

for our own cognitive processes!' (Vercoe, 1971, p. 324)

During the early years of the 1970's. Otto Laske formed many of his theories of human

cognition of music. Since 1970, Laske has adopted a procedunl view of music, regarding it

as a set of cognitive tasks people are able to perform. According to Laske, theories of music

needed to acknowledge !his task-dependency. Studies in computer-aided composition of

music would be relevant both for providing tools for creative activity and for understanding

such an activity. Laske (1978) discussed the panllels between computer music systems and

a model of human memory. From a cognitive point of view, one similarity is provided by the

concept of an information-processing system as a distributed memory system. Cognitive

psychology views the humün mind as a set of distinct but procedurally-interrelated buffers.

The defining elements of these buffers are parameters like time-constant of storage, transfer

rate, access-time of a buffer, capücity in number of symbols, and others. These are the same

attributes that define computer memories. According to Laske, intelligent computer music

systems needed to be based on an understanding of this parallelism.

Laske described a mode1 for human memory as a chain of submemories consisting of

echoic, perceptual, short-term, working, contextual and long-term memories. Human

memory primarily functions by storing the temporal structure of occurring events and

constructing successive intemal representations of these events. Sonic event data enters our

minds and is first stored in our echoic memory, a temporary memory buffer that contains

the most recent sounds that we have heard. The contents of the echoic memory are not

perceived by the listener until they have been stored in the perceptual memory. This

perceptual memory is capable of storing up IO sevesal seconds of event data. n i e transfer of

event data from the perceptual memory to the short term memory is called the conscious-

time. During this time the listener may become aware of certain perceptual configurations

such as pitch, timbre, duration, meter and loudness. It is at this level of perception where we

perceive the musical present. Any information pushed out of this memory will be lost if it

has not been tnnsferred to long-term memory.

Laske theorizes that there is also a higher cognitive level which he labels the "in!erpretive-

time". It is at this level "where the illusion of lasting time is created by memory through

interpretations of events on a high level of abstraction" (Laske, 1978, p. 40). These high

level events will be stored in long-term memory. It is our long-term memory that allows US

to remember the musical past.

According to Laske, Our working memory handles al1 of the interactions between our

perceptual, short-term and contextual memories. The contents of the working memory are

the musical present of which we are conscious. Our musical past may be divided into Iwo

types, the cultural past and the immediate past. The immediate past is often referred to as

our "musical context" which may be considered to be a semantic mode1 of the current

auditory world of a listener. This musical context is thought to be stored in a portion of our

long-term memory called the contextual memory. This contextual memory is the currently

active portion of our long-tem memory. h k e believes that this conceptual memory a n

function in either a syntactical or semantic mode.

The syntactic mode is able to define structural representations of music, thus making it

possible to distinguish different levels of musical structure. Music syntactic networks are

comparable to tree representations of the hierarchy of the structural levels of music.

Semantic concepts are the listener's interpretation of the musical structure and structural

levels. They are bound to a music's past, both in its music tradition and the immediate past.

A semantic network may be thought of as a linked-list of musical interpretations. A listener

can switch behveen these hvo modes while 1ister:ing to a musical performance.

In a later article Laske (1980) developed a cognitite theory of the music listening process.

According to Laske, music understanding occurs through the mapping of musical structures

into memory, where musical pasts are stored. The "perception of music is made possible by

concepts in memory that represent musical pasts that act as precedents to which new sonic

events may be matched" (Laske, 1980, p. 75). The listener therefore generates a sequence of

current pasts of the music. A musical experience is "the total of al1 current pasts a listener

has construed during a listening session" (Laske, 1980, p. 77).

In Laske's view of the listening process, the listener makes an initial musical interpretation.

As the music progresses, the listener maintains this interpretation as long as it continues to

hold true. However, there will gradually emerge another possible interpretation as a result of

a newly arrived set of perceptual features. The listener must gradually unlearn the old

interpretation while acquiring the new one. Listening is therefore a continuous process of

learning and unlearning in which previous interpretations are replaced by succeeding ones

as demanded by the new perceptual findings.

f t is important to note that by the mid 1970's Laske had formulated a cohesive theory of

music cognition. Laske's theories pointed out what was wrong with the information theory

model of human perception and offered many potential solutions. Laske realized that a

computer musician needed to incorponte a working model of perceptual processes. A

computer musician must be able to process information in a manner similar to the processes

of the human mind. iaske's ideas offered computer music researchers a theoretical model

that could have helped them create a computer musician able to more closely model human

musical perception and understanding. But it appears that iaske's theory had little influence

on a computer musician outside the realm of research into linguistic models of music

perception (Roads, 1979). To a large extent, Laske's theory remained a theory without an

actual working model. Lake's theories did have a large influence on the development of the

Listener model presented in this thesis. However, the theories of Marvin Minsky appeaï to

have been a stronger influence on the current versions of the computer musicim.

2.8 Mawin Minsky: "Why do we like Music?"

As mentioned in Chapter 1, Mawin Minsky asked the question "Why do we like Music?"

Minsky believed that one of the problems with music theory is that it \vas afraid to ask

questions such as these. Music theory was not just about music, but how people processed

it, "To understand any art, we must look below its surface into the psychological detail of its

creation and absorption" (Minsky, 1981, p.29). Music theory was unable to help a computer

musician understand music, for it had become stuck trying to find universal truths. ï h i s was

the saine problem information theorists such as Pinkerton and Hiller had, for they had

belicved it was possible to reduce music to a collection of rules that would enable a

computer musician to compose as well as humans. Minsky felt that we cannot find any such

univenal laws of thought.

"Both memory and thinking interact and grow together. We

do not just leam things, we leam ways to think about things;

then we leam to think about thinking itself. Before long, our

ways of thinking become so complicated that we cannot

expect to understand their details in terms of their surface

operation, but we might undentand the principles that guide

their growth." (Minsky, 1981, p.28)

Music is recognized and undestood by a listener because the music engages the previously

acquired knowledge of the listener. But after a listening session, most of the listener's

memories of the music fade away. However, if the listener were to hear the same music

again, he or she would recognize it almost immediately. Minsky believes that something

must remain in the mind to cause this and suggests that perhaps what we leam is not the

music itself, but a way of listening to it.

Before attempting to answer the question "Why do we like music?" Minsky asked several

other questions. One such question is "What is the diffcrence between merely knowing (or

remembering, or memorizing) and understanding?" To understand something we must

know what it menns. However, an idea seems meaningful only when we have several

different ways to represent it. We must have several different perspectives and associations

for an idea. Understanding is therefore the process of looking at an idea from many

different perspectives. Something has "meaning" only if we are able to examine it in several

different ways. Minsky theorized that this is why those who seek "real meanings" never

find thern.

Minsky also asked the question, "Why do we like certain lunes?" This question is

essentially the same as the one Pinkerton asked in 1956. Minsky offered two possible

answes. \Ve like certain tunes because they have certain structural features or we like

certain tunes because they resemble other tunes we like. The first answer has to do with the

laws and rules that make tunes pleasant, which were what Pinkerton, Hiller, etc. were

searching for with information theory. However, this answer implies the existence of a

univenal set of essential features which Minsky feels are impossible to discover. The

second answer forces us to look not ai the tune itself, but at ourselves and how we perceive

tunes. However, the verb 'resemble' forces us to define the rules of musical resemblance.

How do we know if two tunes resemble each other? These rules are dependent upon how

melodies are represented in each individual's mind. In Laske's theories, representations of

these melodies were stored in a long-tenn memory. In Minsky's theory we store melodies in

a "society of agents".

According to Minsky, our minds consist of a network or "society of agents". Minsky

defines an agent as "any part or process of the mind that by itself is simple enough to

understand, even though the interactions among groups of such agents may produce

phenomena which are much harder to understand" (Minsky, 1986, p.326). Each agent

knows what happens to some of the others, but little of what happens to the rest. Thinking

consists of making these mind-agents work together. Productive thought is the process of

breaking problems into different parts and then assigning these parts to the agents that

handle them best. When one listens to music, various facts of what is heard activate various

agents. These agents are connected in various ways, which affects how the listener

processes the music.

In Minsky's view of memory, cognitive processes and memory structures are not separated

from each other. This differs from Laske's theory of many separate buffers and a working

memory that interprets the music to achieve understanding of the structure. According to

Minsky:

"We often speak of memory ris though the things we know

were stored away in boxes of the mind, like the objects we

keep in the closets of our homes. But this raises many

questions. How is knowledge represented? How is it stored?

How is it retrieved? How is it used? ...[ Our theory of

memory] tries to answer al1 these questions at once by

suggesting that we keep each tl~ing rtu Iearn close 10 the

agents tl~ar learn in tlrefirstplace." (Minsky, 1981, p.28)

Our knowledge, stored in this manner, becomes easy to reach and easy to use. This theory

is based on the idea of a type of agent called a "Knowledge-line" or "K-line!' Whenever

one gets a good idea, solves a problem, or wants to remember something, he or she activates

a K-line to represent it. .4 K-line is a "wirelike structure that attaches itself Io whatever

mental agents are active when you solve a problem or have a good idea" (Minsky, 1986,

p.82). Later, when one activates this K-line, the agents attached to it are also activated. This

puts the person into a "mental state" much like the state the mind wris in when it received the

original input. This makes it possible for us to solve new, but similar problems or recognize

similar situations. This is how we are able to remember music and know if one musical

fragment resembles another. Two similar fragments of music will have a similar effect on

the mind, because they will tend to activate the same agents. M e n one hears a familiar piece

of music, one's mental state (Le. set of currently active agents or K-line) will be similar to

the mental state induced by a previous hearing.

Each perceptual experience activates a structure called a frame. A frame is "a representation

based on a set of terminals to which other structures may be attached. Normally, each

terminal in connected to a default assumption which is easily displaced by more specific

information" (Minsky, 1986, p.328). Minsky described a frame as being like an application

f o m with many blanlis or slots to be filled. These blanks are the teminals referred to by the

above definition. Terminais are used as connection points Io which we can mach other

kinds of informs!ion. Any type of agent c m be attached to a frame-terminal, including K-

lines. The mind remembers millions of frames, each representing a stereotyped previous

experience.

Using these agents, K-lines and frames, Minsky described a theory of niusic listening.

When listening to music we only have access to the musical present, the notes currently

being played. We use the rhythm as "synchronizrition pulses" to match new phrases against

older ones. These phrases are examined for difference and change. As differences and

changes are sensed, the rhythmic frames fade from our awareness. This process of

matching allows our minds to "see" things, from different times, together (Minsky, 1981).

a The concept of K-lines and frames proved to be quite influential in the development of the

DTW based melodic fragment recognition system. As presented in Chapter 4, the

Recognizer matches an unknown melodic fragment with a collection of known melodic

fragment referred to as templates. The templates are a set of features used to represent a

fragment. The DTW algorithm used to compare fragments provides a distance measure that

indicates the degree of similarity between two fragments. This allows a fragment to be

considered similar to a variety of templates. As fragments are added to the system, they are

clustered into areas of similar fragments. This clustering allows new fragments to be

recognized as similar Io entire sets of recognized fragments. The musical present

represented by a new melodic fragment is quickly matched with a large collection of

fragments representing the Listener model's musical past.

2.9 Current Implementations of the Computer Musician

Minksy's ideas presented have also had an enormous influence on recent versions of the

computer musician. 1 will now discuss these influences on the various versions of the

computer musician as created by Dannenberg, Rosenthal, and Rowe.

2.9.1 Roger Dannenberg: The Computer Musician a s Performer

Roger Dannenberg's (1984) version of the computer musician created a real-time

accompanist for live performers. The computer was given a score containing parts for the

soloist and for the corresponding accompaniment and was assigned the task of performing

the accompaniment in synchronization with the live performer, jus! as a human accompanist

would do. This approach required the computer to have the ability to follow the soloist.

Dannenberg identified three problems the computer had in fulfilling its role of an

accompanist:

1) Detecting and processing input from the performer

2) Matching this input against a score of expected input

3) Generating the timing information necessary to control the generation of the

accompaniment.

Dannenberg expected that the live solo performance would contain mistakes or be

imperfecily detected by the computer, so il was necessary to allow for these performance

errors when matching the actual solo against the score. The normal variations in tempo

resulting from human musical expression could also easily confuse the computer

accompanist. He presented an efficient dynamic programming algorithm for finding the best

match between solo performance and the score is presented (Dannenberg, 1984). In

35

producing the accompaniment, it was also necemry to genente a time reference that varied

in speed according to the soloist. The computers concept of time was derived from

diiferences behveen the amval times of performed events and the expected times indicated

by the score.

With his computer accompanist, Dannenberg created a computer musician that was able to

match the musical present (solo performance) with a musical past (the stored version of the

score). However, the algorithms for his program appeared to have been influenced more by

computer science than cognitive theory. His accompanist had no higher-level understanding

of its role as an accompanist, and was unaware of the musical form, style etc. of the piece it

is performing. In fact, the computer would become easily lost if the performer made too

many errors. In reality, this model was not much different from those created by the

information theorists. Dannenberg did address a few of these shortconiings in a later

version of his accompanist (Dannenberg and Mont-Reynaud, 1987; Dannenberg, 1989;

Grubb and Dannenberg, 1994).

2.9.2 David Rosenthal: The Computer Musician as a Listener

David Rosenthal(1989) of M.I.T. presented a computer model of the process of listening to

simple rhythm. The model consisted of:

1) A way of dividing the rhythm into appropriate chunks

2) A means of constructing recognizers for the chunks

3) An organization of the recognizers into a hierarchical structure.

This model denved its conception of the mind's workings from Manin Minsky's The

Sociefy of Mind (Minsky, 1986). Specifically, Rosenthal attempted to create a society of

agents capable of recognizing simple rhythmic patterns. The p m g m listened to a series of

events and constmcted recognizen (agents) which attempted to recognize certain pattems of

events. As the events were processed, the p r o g m decided whether the event was part of a

recognized pattern or was a new pattem. If the progrdm decided ihe pattern was new, it

con~tmcted a new recognizer capable of recognizing future occurrences of the pattern.

Rosenthal also claimed that his progrdm also had the limited ability to constnict higher-level

recognizen which could recognize pattems of other recognizers (Rosenthal, 1992).

Rosenthal's model, while limited in scope, is an interesting working model of Minsky's

theories. This version of the computer musician was capable of constructing new

recognizers as it listened to music. Rosenthal's computer musician was able to expand its

understanding of the music, perhaps in a manner similar to our own minds.

2.93 Robert Rowe: The Cornputer blusician as Listener, Composer, Performer and

Critic

Robert Rowe (1991) developed one of the most advanced versions of the computer

musician. His program Cyplter, was capable of using musical concepts in listening,

composing and performing. His implementation was strongly based on Minsky's societal

architecture in which "many small, relatively independent agents cooperate to realize

complex behaviors" (Rowe, 1991, p. 12). The program had two main components, a listener

and a ployer. Rowe claimed that his program was capable of classifying several perceptual

features arising from a musical context, tracking the changes in these features over t h e , and

attaching compositional responses to them. It could also identify tonal regions, harmonic

progressions and likely beat periods. The program also had a limited ability to examine and

crkicize the quality of its own musical output.

Rowe's work had a profound influence on the development of the Segmenter and

Recognizer presented in this dissertation. His Listener model was adapted by Pennycook

and S t m e n (1993; 1994) to create a working model of a jazz improviser. Pennycook and

Stammen felt that the feature set used by Rowe (density, register, speed, dynamics, duntion

and harmony) \vas not sufficient for a real-the jazz improviser. They modified the key

identification agent in the Listener to improve the recognition of jazz harmonic structures

(Stammen, Pennycook and Pamcutt, 1994). The implementation of the real-time harmonic

recognition system is described in Chapter 5. The phrase detection methods used by Rowe

were replaced with the Segmenter presented in Chapter 3. Finally, Rowe's pattern matching

algorithms, based on a string matching algorithm, led to the development of the DTW

algorithm for matching melodiccontours presented in Chapter 4. It is interesting to note that

Rowe (1994) added the Segmenter and Recognizer algorithms to his Cypher application.

Rowe extended the DTW based melodic recognition component to develop a program

capable of recognizing important sequential structures on an arbitrary stream of MIDI data.

Rowe's work is perhaps the most complete implementation of a cognitive music theory to

date. Its ability to derive higher levels of musical understanding from low level specialists

offers some evidence as to the validity of Minsky's theories and the ideas presented by this

dissertation. He has also implemented the idea of a listening composer that was desired by

Ledley, Vercoe and Laske in the early 1970's. Rowe's work points the way to future

versions of computer musicians that will be constructed from many levels of agents.

2.10 How do We Recognize and Remember Melodies?

We now return to the question, "How do we know if two tunes resemble each other?"

Recent research into melodic perception suggests that melodic contour is one of the most

salient features used to remember and recognize melodic sequences. Melodic contour m'y

be defined as a set of directional relationships between successive notes of a melody

(Dowling and Fujitani, 1971). Contour represents the overall pattem or shape of a melody.

This pattem of ups and do\vns chmcterizes a particular melody, Plowing one to recognize a

melody even if it has been altered in some way. This suggests that the melodic contour is an

important part of what is remembered when one rememben a melody. Melodic contour also

contributes strongly to the recognition of transposed melodies.

Davies and Jennings (1977) undertook a novel experiment to test memory for tonal

sequences. Groups of trained and untrained musicians were asked to represent the contour

of a tonal sequence by drawing it on a piece of paper. They were also asked to draw the

melodies according to the interval sizes. Although the musicians were generally superior,

there was IittIe difference between musicians and non-musicians in terms of perception of

melodic contour. Both groups performed at a much lower level when estimating interval

sizes. This suggests that pitch intervals are not normally coded in terms of magnitude.

Dowling (1978) separated the use of contour and scale information in the recognition of

melodies. Dowling defined scale in terms of the tonality of the melody, and in his study he

made comparisons between tonal and atonal melodies. Inexperienced musicians with less

than two years of musical training relied more on contour information than scale of key

information to remember and recognize melodies. Experienced musicians were found to use

both contour and s a l e information.

Several studies by Dowling (1971; 1978; Bartlrrt and Dowling, 1980) have found that

melodic contour is an abstraction from melody that can be remembered independently of

pitches. According to Dowling, the contours of brief, novel atonal melodies can be retrieved

from short-term memory even when the sequence of exact intervals cannot. In addition to

being preserved in short-term memory, melodic contours also seem to be retrievable from

long-term memory independent of interval sizes (Dowling, 1978).

Melodic contour has also been investigated in studies that aimed to evaluate a bi-

dimensional model of pitch. ï h i s model proposes that both tone height (the ovenll pitch

level of a tone) and tone chroma (the position of the tone within the octave) are important in

melody perception. Massaro (1980) has assessed how contour and tone chroma

information are used in the identification of familiar melodies and recently l emed melodies.

The results suggested that tone chroma alone is not a sufficient cue for identification and

must therefore be accompanied by contour information. Contour and chroma together

contribute to accurate identification of melodies. Tone chroma allows a given note of a

melody to be tnnsposed up or down by one or more octaves without having much effect on

the recognizability of a melody. Contour alone can be used to identify a melody only if

listenes have some linowledge of the set of tunes from which to select their answers.

Dyson (1984) revealed perceptually salient aspects of contour which may be interpreted as

contour features. She called these features contour revrrsals, or locations where the melody

changes direction. The relative importance of contour reversals is independent of the

magnitude of pitch change or the general shape of the melody. The findings of this study

show that revesals in melodiccontour are in some way similar to visual features in that they

are treated as areas of high information by the listener. Thus, in a single hearing, listeners

may be attempting to extract as much information as possible by aiming for the points of

high information value, tbe corners, or reversal. These features may therefore be thought of

as defining the shape of the melody while the slopes or non-reversais fil1 in the detail in

between the reversals. Dyson concluded that melodic contours provide a figura1 description

of novel tone sequences. Contour reversais serve as features contributing to a perceptual

representation that gives a global outline of the melody to which further detail may be

added.

In view of the evidence that contour is one of the most salient features used to remember

and recognize melodic sequences, the Recognizer described in Chapter 4 uses an algorithm

that compares the melodic contours of nvo melodic fragments. The Recognizer uses the

dynamic timewarp algorithm (DTW) (Itakun, 1975), which was originally developed for

t h e alignment and comparison of speech and image patterns. The DTW enables the

Recognizer to compare and categorize the contour of a new melodic fragment with a

collection of previously recognized melodies.

2.11 Conclusion

It is now time to attempt to answer a few of the questions that began this chapter. "Can a

computer compose, analyze or listen as well as a human being?" Looking at its current level

of development, one would have to answer no. The computer musician is still a new born

child that has barely entered into the real world. Current versions have given it the tiniest of

minds, minds that contain no musical past, no musical culture. The computer musician is

still incapable of understanding music as we do. However, we must acknowledge that this

primitive child has indeed made progress. In light of its most recent vcrsions and the

advancements made in computer rechnology, the future look pmmising.

A potential hindrance to the development of an advanced computer musician is voiced by Bo

Alphonce:

"Music theory is one of the oldest disciplines in the litcrate

history of civilization; still, systematic, coherent, well-

developcd and cornputable music theory is both very young

and mther scarce." (Alphonce, 1980, p. 26)

The d~velopment of a computer musician is most dependent on our own understanding of

music. While the theories of Laske and Minsky offer sorne intriguing insights into the

potential working of the human mind, they are incomplete. "A music analyst approaching

the computer quickly realizes that running out of theory means running out of program

code" (Alphonce, 1980, p. 26). At present, program code is the life blood of a cornputer

musician. Without the continued development of cognitive music theories, a computer

musician will remain grossly inferior even to untrained music listeners.

3. The Segmenter

3.1 Introduction

The first component we will examine in the Listener model is the Segmenter. As shown in

Figure 3.1, the Segmenter receives as input a stream of MIDI data, which represents a single

monophonic v o i e in a live musical performance. The Segmenter is responsible for dividing

the musical stream into melodic fragments at the motive or phrase level. These melodic

fragments are 4 Io 16 notes in length. When the Segmenter detects a group boundary, the

new grouping is sent to a recognition process that attempts to match the new segment with

previously recognized fragments. The Segmenter is the most important component in the

Listener model for it must accurately identify, in real-tirne, the melodic fragments that will be

sent to the Recognizer.

Segmenter a Recognizer +

t Figure 3.1: The Segmenter

In this chapter, 1 will be referring to the first eleven measures of an improvisation on the

tune "Tenor Madness" by Sonny Rollins (Rollins, 1957) to dernonstrate the operations

performed by the Segmenter. The notes of the improvisation are shown in Figures 3.2a and

3.2b. An exact transcription of the performance is shown in Figure 3.2a and a quantized

version is shown in 3.2b.

Figure 3.2a: Improvisation on Tenor Madness, mm. 1-11

(as performed)

Figure 3.2b: Improvisntion on Tenor Madness, mm. 1-11

(quantized version)

The role of the Segmenter is to listen to the improvisation in real-time and group the notes

into short melodic fragments as shown in Figure 33.

Figure 3.3: Melodic fragments created by the Segmenter

3 2 Gmuping Preference Rules

The Segmenter is largely based on the Grouping Preference Rules (GPR) proposed by

Lerdahl and Jackendoff (1983). Their grouping theory consists of a set of rules that

describe the organization of the musical surface into groups. According to Lerdahl and

Jackendoff, the grouping of a musical surface is "an auditory analog of the partitioning of

the visual field into objects, parts of objects, and parts of parts of objects" (Lerdahl and

Jackendoff, p. 36). Their rules fo: grouping appear to be idiom-independent in that a

listener needs to know little about a musical genre in order to assign grouping structure to

pieces in that idiom. This idiom-independent nature and the well-defined structure of the

GPR made them an excellent starting point for the development of the real-the Segmenter.

The GPR form a set of preference rules that determine group boundaries by examining the

local detail of a monophonic stream of music. The GPR detect changes in attack points,

articulation, dynamics, duration and register that could lead to the perception of a group

boundary. The Segmenter has adapted several of the GPR for use in detrcting group

boundaries or segmentation points in a real-time stream of music. Given a sequence of four

notes, N1 N2 N3 N4, as shown in Figure 3.4, i: is possible, using one or niore of these

rules, to detect a segmentation point between the notes N2 and N3. The GPR focus on the

transitions from note to note and pick out the transitions that are more distinctive than the

surrounding ones. These more distinctive boundaries are the ones that a listener will favor

as group boundaries.

In order to locate a group boundary, the GPR examine a sequence of four notes containing

three transitions, NlIN2, N2lN3, and N31N4. The transition N2lN3 is a candidate for a

group boundary if it differs from the surrounding transitions NIIN2 and N3lN4. The group

boundary between N2 and N3 therefore defines two groups with one group ending at N2

and the other starting with N3. The GPR used by the Segmenter in the Listener mode1 are

described below.

Group Boundary

Figum 3.4: GPR segmentation point

3.2.1 GPR 1 Avoid Small Groups

GPR 1 states that groups containing a single event or very few notes should be avoided.

In order to avoid groups of two or less notes, GPR 1 has been implemented by the

Segmenter. The Segmenter will also create a group boundary after 16 notes to avoid very

large groups.

3.2.2 GPR 2 Proximity Rules

The GPR Proximity rules are used to detect breaks in the music flow which are perccived as

group boundaries. Given a sequence of four notes, N I N2 N3 N4, the transition between

notes N2 and N3 may be heard as a group boundary if the time interval from the end of N2

to the beginning of N3 is greater than that from the end of N1 to the beginning of N2 and

that from the end of N3 to the beginning of N4. This rule is known as GPR 2a or the

SludRest rule. Examples are shown in Figures 3.5 and 3.6. The SludRest rule is useful for

detecting group boundaries at ends of phrases or points of rest in the music.

Figure 3.5: GPR 2a SludRest mle

Figure 3.6: GPR 2a SludRest rule

Another GPR proximity rule is called GPR 2b or the Attack-Point rule, and is shown in

Figure 3.7. The Attack-Point rule mesures the interval of lime behveen the attack points of

N2 and N3. This time interval is called the inter-onset interval or 101. If the 101 behveen N2

and N3 is greater than the IO1 between N1 and N2 and N3 and N4, then there exists a

group boundary between N2 and N3.

Figure 3.7: GPR 2b Attack-Point rule

3.2.3 G P R 3 Change Rules

Group boundaries may also occur at transition points where there is a distinct change in

register, dynamics, articulation or length. The GPR 3 Change rules examine the N2lN3

transition and compares it with the NIIN2 and N3/N4 transitions. The register rule, GPR

3a, detects a group boundary at N2/N3 if the interval distance between N2 and N3 is jreater

than both N I N 2 and N3lN4. GPR 3a is shown in figure 3.8.

Figure 3.8: GPR 3a Register rule

The Dynamics rule, GPR 3b, involves a change in dynamics benveen N2 and N3, but not

between N1 and N2, and N3 and N4. The Dynamics rule is shown is Figure 3.9.

Figure 3.9: GPR 3b Dynamics rule

Figure 3.10 demonstrates GPR 3c, the Articulation rule. This rule requires a change in

articulation between N2 and N3. There must be no change in articulation in the N1M2 and

h13/N4 transitions.

Figure 3.10: GPR 3c Articulation rule

The final change rule, known as GPR 3d or the Length rule, is shown in Figure 3.11. The

Length rule requires a change in note length between notes N2 and 1.13. Notes N1 and N2

must not differ in length and Notes N3 and N4 must also not differ in length.

Figure 3.11: GPR 3d Length m l e

It is interesting to note that sevenl of the GPR may occur at the same location in the music

thereby reinforcing each other. Group boundaries determined by the GPR may also occur

in conflicting positions in the music such as the example shown in Figure 3.12. In this

example, the GPR 2a SlurIRest rule and the GPR 2b Attack point rules are in confiict. In

order to solve these conflicting group boundaries, it would be desirable to assign each mle a

numerical degree of strength (Lerdahl and Jackendoff, 1953, p.47). However, Lerdahl and

Jackendoff preferred not to assign these weights. As we shall see later on in this chapter, the

Segmenter uses a set of weights to solve these conflicting group boundary situations.

= , : ' I ' 8 1

t f f ) ~ ! i n i 2 - GPR GPR

Figure 3.12: GPR group boundary conflicts

33 Application of GPR to a Live Performance

We will now attempt to apply the GPR to a real-the musical performance. As mentioned

above, we will be using the first twelve measures of an improvisation on the tune Tenor

Madness. by Sonny Rollins The application of the GPR to the first two measures is shown

in Figure 3.13.

Figure 3.13: Application of GPR to a live performance

Using the GPR, the first group boundary will not be detected until the fifth note is played.

This rr .,. s in a considerable latency in the recognition of the first group boundary, as the

GPR must wait for the notes N3 and N4 to be played after the 3 beats of rest. In this

example, the latency is equal to four beats. The GPR are therefore unsuitable for a real-time

implementation where group boundaries must be located during a musical performance.

As shown above in Figure 3.4, the Grouping Preference Rules (GPR) require four notes,

NI, N2, N3, and N4 to determine whether a group boundary occurred at N2. As a result, the

GPR must wait for both N3 and N4 to be perfonned before a group boundary decision can

be made at N2. The two note delay required by the GPR is therefore unsuitable for a real-

time implementation where group boundaries must be located during a musical

performance. Clearly, some modifications to the application of the GPR were needed.

Figure 3.14: The Segmenter Processes

In order to overcome the real-lime limitations of the GPR, the Segmenter is subdivided into

three independent segmentation processes. These processes are shown in Figure 3.14. Two

of these processes are adaptations of the GPR and have one and two note latencies. We will

refer to these processes as the N3 and N4 segmenters, for they determine that a group

boundary occurred at N2 when event N3 or N4 are received by the Segmenter. The other

process is called the Real-Time (RT) segmenter. It locales group boundaries in real-the.

The N4 segmenter uses events NI-N4 to determine the group boundary while the N3

segmenter uses only events NI-N3. The rules and MIDI features used by the N3/N4

segmenters and their equivalent GPR rules and musical features are shown in Figure 3.15.

Examination of the MIDI features in Figure 3.15 suggests that the processing for each

GPRISegmenter rule occurs at a specific temporal location in an event. For example, GPR

Pa, 2b, 3a, 3b, 3c and their equivalent Segmenter mles can al1 be evaluated immediately each

lime a note-on arrives at the Segmenter (Figure 3.16). GPR 3d can be tested immediately

with the arrival of each note-off (Figure 3.17). GPR 2a and 3c are mapped onto N3/N4

segmenter rule 1 becüuse both of these rules examine the amount of lime between an event's

note-off and the next event's note-on. The N3 and N4 segmenters therefore treüt slurs and

articulation as the same featurc. GPR 2a and 3c add an additional latency to the segmentes.

These rules actually require the arrival of event N5 to calculate the articulation feature for

N4 (See Figure 3.18). In other words, the Segmenter musi wait for the N5 note-on to

evaluate the N4 note-off to note-on value. This results in a three note latency before the N2-

N3 transition can be tested. This three note latency is clearly unsuitable for a real-time

segmenter.

GPR Feature N3IN-l MIDI Feature Rule

GPR 2a SlurPRest Rule 1 note-off to next note-on

GPR 2b Attack Point Rule 2 note-on to next note-on

GPR k Register Rule 3 MIDI note nurnber

GPR 3b Dynamics Rule5 MIDI velocity

GPR 3c Articulation Rule 1 note-off to next note-on

GPR 3d Length Rule 4 note-on to note-off

Figure 3.15: GPR to N3/N4 Segmenter Rules

I For each note-on event evaluate:

GPR 2a SlurIRest GPR 2b Attack Point GPR 3a Register

On GPR 3b Dynamics GPR 3c Articulation

Figure 3.16: Note-on rule evaluations

t For each note-off event evaluate:

off GPR 3d Length

Figure 3.17: Note-off rule evaluations

t l Off On

Figure3.18: 3 note latency of GPR 2a and 3c

In order to reduce the latency of the GPR, the RT segmenter tests for a group boundary as

each event arrives at the Segmenter. The RT segmenter uses two interna1 clocks to detect a

group boundary and marks a boundary when one of the following situations occurs:

1) The performer is holding a long note;

2) The performer has stopped playing (a rest);

3) A pre-defined maximum number of total notes for a segment has been exceeded.

N1 N2 : N3 N4 NS

Figure 3.19: Segmentation example

Recognition -PT Figure 3.20: The Segmenter Pmcesses

The operation of the RT, N3 and N4 segmenters is shown in Figures 3.19 and 3.20. When

the note-on of N1 arrives at the RT segmenter, an interna1 clock is set to go off at a

predetemined lime in the future. Since N2 arrives before the clock bas expired, the clock is

reset again and no group boundary is marked by the RT segmenter. The N3 segmenter then

tests the N1-N2 transition and does not mark a group boundary. During the sustain of N2,

the RT note-on clock expires before the arriva1 of N3, thereby causing the RT segmenter to

mark a group boundary. When the note-on of N3 arrives at the Segmenter, the N3

segmenter confirms the group boundary at N2, while the N4 segmenter tests the NI-N2

transition. When N4 amves at the Segmenter, the N4 segmenter confirms the previous N3

and RT segmentation at the N2-N3 transition. This same procedure is followed for each

note-off arriving at the Segmenter. This chain of segmenters allows for a higher degree of

accuracy in the detection of musical groupings. The performance of the RT segmenter is

alwvays verified one note later by the N3 segmenter. The N4 segmenter, having available

both future and past data around the group boundary, verifies the performance of both the

N3 and RT segmenters. When either the N3 or N4 segmenters detect an error made by the

RT segmenter, they update the settings of the RT segmenter's interna1 clocks. This allows

the RT segmenter to be dynamically adjusted to changing musical situations.

As mentioned above, the GPR introduce a two to three noie latency before a group

boundary is detected. The RT, N3 and N4 Segmenters attempt to reduce this latency to zero

depending on the musical situation. The latency of the various Segmenters is shown in

Figure 3.21. The RT Segmenter attempts to determine a group boundary immediately and

therefore hüs a zero note latency. The N3 Segmenter waits for notes NI N2 and N3 and has

a one note latency. The N4 Segmenter must wait for notes N1 N2 N3 and N4 and therefore

has a two note latency.

Latency: O 1 I

Segmenter: RT 1 N3 N4 I ! I I I

GPR: NI N2 1 N3 N4 I I

Figure 3.21: Segmenter Rule Latency

In order to avoid segmentation at every feature transition, each segmenter assigns a weight

to each feature that increases with the size of the feature transition. In this manner, larger

changes in register, dynamin, duration, and articulation as well as group lengrh have a

greater contribution to the detection of a musical boundary. While Lerdahl and Jackendoff

avoided the assignment of weights to the GPR, 1 found it necessary to utilize a weighting

system. The ranking of the rules are in general accoidance with the ranking reported by

Deliége (1987) (Figure 3.22). Since the Listener mode1 is currently using MID1 as input to

the system, the Segmenter is unable to consider timbre (Deliége, Rule 7). Our use of

weights to decide between potential group boundaries is similar to the phrase detection

system used by Rowe (1993).

LerdahlIJackendoff

GPR 2a (SlurIRest)

GPR 2b (Attack Point)

GPR 3a (Register)

GPR 3b (Dynamics)

GPR 3c (Articulation)

GPR 3d (Length)

Stammen/Pennycook Deliége

Rule 1 Rule 1

Rule 2 Rule 2

Rule 3 Rule 3

Rule 5 Rule 4

Rule 1 Rule 5

Rule 4 Rule 6

Figure 3.22: Ranking of GPR

Although Lerdahl and Jackendoff have expressed that the GPR could not provide a

computable procedure for determining musical analyses, the GPR have proven to be useful

for the automatic detection of lower-level motive and phrase boundaries. The GPR were not

intended to completely reflect the cognitive processing that occurs when listening to music

in real-tirne, as they possess a two or three note latency in the detection of a group

boundary. ln this real-the adaptation of the GPR, it has been determined that the cognitive

processing of the various GPR may occur at different temporal locations in an event.

3.4 Event hlemory

The GPR compares certain features of notes N1-N4 to determine if a temporal or acoustiml

change occurred at the N2-N3 transition. The GPR require that a given feature at N2-N3 be

different from the same feature at NI-N2 and N3-N3. This implies that there be a high

degree of similanty between the NI-N2 and N3-N4 features. However, when one considers

the considerable variability of a particular feature during a live performance, these

comparisons become much more difficult to make. There anses the need for some type of

quantization, especially for compansons behveen temporal features.

The Segmenter responds to MIDI note-on, note-off and velocity messages. When a note-on

event is received, the data pertaining to it is stored in an Event structure. This structure

contains information such as the note's pitch, velocity, interval from its previous note, start

tirne (in milliseconds from the beginning of the performance), end time, duration (in

milliseconds) and offset time from the start lime of the previous note to the current note. It

also contains four fields which specify which rhythmic groups the note belongs to and

several fields containing the results of segmentation rule evaluation which will be explained

in detail below. The complete Event structure is listed in Appendix A.

The Segmenter stores the most recently received Events in a circular buffer. The length of

this buffer is currently 32 Events and may be adjusted by the user. This buffer will be

referred to a the short-term memory (STM). The Segmenter uses the STM to determine

segmentation points. Once a group boundary has been detected by the Segmenter, il is

placed into another region of memory called the long-term memory (LTM). The LTM

contains shapes and motives that have been recognized or learned from previous listening

sessions. The data is stored in external files and may be read into the Segmenter's long-term

memory prior to a listening session.

3.5 Determining Rhythmic Types

In order to properly separate the incoming events into coherent groups, the Segmenter must

know something about the rhythm and articulation of the notes. The Segmenter examines

three features of the Event:

1) the duration of the note in milliseconds (ms). This is the duration of the note from its

attack to release or MIDI note-on to note-off (Figure 3.23).

Duration - I

Note on ~ o t e off Figure 3.23: Note duration

2) the inter-offset time (101) or time from the beginning of a note to the beginning of the

next note (Figure 3.24).

Note on Note on Figure 3.24: Inter-offset interval

3) the articulation or time from the release of a note to the beginning of the next note. This

is the duration of the interval between the note-off to the next note-on (Figure 3.25).

t - r Off On

Figure 3.25: Note-off to note-on interval

The raw millisecond values are of little use to the Segmenter for it must be able Io classify

the note as being of a certain rhythmic type in order to compare the rhythm of one event

with that of another.

To simplify matters, this discussion will begin by looking at only duration or note-on Io

note-off lime. The concept of grouping the note durations using traditional rhythmic values

(Le. quarter note, eighth note etc.) was rejected due to ils complexity and due Io the

realization that this information was not needed by the Segmenter. Instead, the Segmenter

determines if a note belongs to one of ten rhythmic types. The process of beat induction

(Desain and Honing, 1995) in which a regular pattern (the beat) is activated in the listener

while listening to music is therefore not required by the Segmenter.

The Segmenter begins with a 'blank slate' where the duration of the first event will determine

the value of the first rhythmic type (Figure 3.26). Unassigned rhythmic types are assigned

the value -999. If the duration of the next note is close enough to the value of the first

rhythmic type, it will be assigned that type. If the duration of this note is different from

anything seen before, a new rhythmic type will be 'created' or initialized. In the Segmenter's

use of the short-term mernory, the rhythmic types need not be related Io each other in any

way; for example, one type need not be twice the value of another. In order to limit the

nurnber of different rhythmic types, a duntion exceeding a certain maximum limit will

simply be labeled as being a "long noten type and its millisecond value will not be stored

as a rhythmic type.

LongNoteClock: 5 0 0

On to Off Off to On -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999

LongRestClock: 2 5 0

Figure 3.26: Empty short-term memory

i

On to Off Off to On 1696 1 8 3

1 -999 5 -999 -999 -999 -999

LongNoteClock: 6 8 2 LongRestClock: 6 0 8

Figure 3.27: Active short-term memory

Tempo changes or accelerandos and ritardandos cause problems when dealing with rhythm

in this manner. Ten rhythmic types can soon become insufficient and two notes with

equivalent durations (Le. two quarter notes) at different sections of the song may not be

assigned the same rhythmic type due to their different millisecond duration times (Figure

3.27).

The solution to this problem is to upgrade the average millisecond value of each rhythmic

type every t h e a new note comes in. Instead of simply averaging the value depending on

the time values of the notes detennined to be of that type, the Segmenter gives the durational

value of the most recent Events of that type a higher weight in the overall type average than

those seen further in the past. Previous notes move further back in the short term memory

so that even when an accelerando is played, we can still tell what the quarter note is. With

this method, the average value of a rhythmic type will decrease with an accelerando or

increase with a ritardando.

If a rhythmic type is not assigned to any Events within the short term memory, the type is

taken out of the arny of observed rhythmic types to make room for new ones. If too many

different rhythmic types are seen, the minimum millisecond difference between the types

will be increased. The Segmenter will then go back through its short term memory and re-

calculate the different rhythmic averages of the Events using the new threshold. The

threshold will be increased until the number of rhythmic types is reduced. Every time a

rhythmic group is upgraded, the Segmenter will also check for types which have become too

close to one another. It will then temporarily raise the threshold and redo the rhythmic

avenges in order to make one type out of the hvo.

This method of determining and averaging rhythmic types is used for al1 three aspects of

rhythm: duration (note-on to note-off time); offset (note-on to note-on lime); and

articulation (note-off to note-on time). The rhythmic array of average articulation times is

only used for updating the segmenting clock times. The Event is actually assigned one of

three articulation types: stacatto, legato or rest. The articulation type is determined

depending on the ratio of the duration of the Event to the note-off to note-on time of the

Event. For example, if the note-off to note-on t h e of the Event is less than or equal to ils

duration, the note is labeled as legato.

3.6 Segmentation Rule Evaluation

Once the rhythmic type of a note has been determined, the Segmenter will check to see

whether or not it is valid to segment after the note. it does so by evaluating whether certain

mles hold true or no t

3.7 RT Segmenter Rules

The following three rules are referred to as the real-time mles (RT) because a group

boundary will be detected during or immediately after a note is played. If any of these three

niles holds true, the Segmenter will assign a group boundary after this note.

1) If the Segmenter has received a certain number of notes without having yet located a

group boundary, one will be automatically created. This mle exists to avoid groups with too

many notes. The threshold for the maximum number of notes can be input by the user. The

default is a length of 16 notes corresponding to GPR 1.

2) If the current note exceeds a certain duntion (note-on to note-off), a group boundary will

be created. The default threshold is 250 ms and the threshold is also dynamically adjusted

in the short-term memory. This strategy is an implementation of GPR 2b.

3) If there is a long rest after the current note (long note-off IO note-on), a group boundary

will be created. The default duration is 500 ms and the duration threshold is dynamically

adjusted in the short-term memory. This strategy is an implementation of GPR 2a.

The RT Segmenter detects group boundaries using two real-the clocks. These clocks are

called the note-on clock and the note-off clock. In order to implement RT rule 2 (GPR 2b),

the Segmenter has an interna1 note-on clock which is set when the current Event's note-on

is received. The clock will expire after a specified long duration t h e , if the Event's note-off

is not received before this time. The operation of the note-on dock is shown in Figure 3.28.

On Clock: 600 ms RT On Segment

Tempo: J = 120 Figure 3.28: Opent ion of the note-on clock

In this example, the note-on dock is currently set for 600 ms. When the note-on event for

the first quarter note is received, the note-on dock is set to expire 600 ms in the future.

Since the note-on event for the second quarter note arrives before the dock expires, the

note-on clock is reset. The half note arrives before the dock expires, so once again the note-

on clock is reset. After 600 ms kas elapsed the note-on dock expires before the next note-

on event has occurred and the half note is still being held, so a group boundary is detected at

the half note.

To implement RT rule 3 (GPR 2 4 , the note-off clock is set with each Event's note-off, and

unset if a new note-on is seen before the specified long test t h e . If the note-off dock

expires before a new note-on, then a group boundary will be marked after the preceding

note. The operation of the note-off dock is shown in Figure 3.29.

Off Clock: 500 ms RT Off Segment

I

1- 500 ms

Tempo: J = 120 J = 500 ms (approx)

Figure 3.29: Opent ion of the note-off clock

3.8 N4 Rules

The Segmenter also uses a set of rules derived from the GPR. The GPR compare the data

of two notes before the segmentation point with that of the two notes following the point.

These rules therefore do not operate in real-time since the group boundary can only be

determined hvo notes after the note being tested as a possible segmentation point. Because

the GPR requires a total of four notes to determine a segmentation point, these rules are

referred to as 'N4' rules.

The N4 rules determine whether there has been one or more of the following occurrences

between the two notes preceding the possible segment point and the two notes following it:

Fenture GPR Segmenter Rule

1) a change in articulation GPR 3c Rule 1

2) a change in attack point GPR 2b Rule 2

3) a large intervallic leap GPR 3a Rule 3

4) a change in duntion GPR 3d Rule 4

5 ) a change in dynamics GPR 3b Rule 5

Segmenter rules 1,2,3, and 5 can test for a group boundary between N2 and N3 as soon as

the note-on for N4 is received. Rule 4, however, can only be tested after note four's note-

off.

Each rule is graded according to ifs importance and is assigned a weight. These weights are

exponential, with the most accurate rule (change in articulation) having a weight of 128 and

the least accunte rule (change in velocity) having a weight of S. Every Event record contains

a field labeled rules. For each Segmenter rule that is true for the Event, a flag is set to

indimte the presence of the mle.

In addition, the amount of deviation seen behveen the hvo sets of hvo notes is calculated to

determine a weight multiplier. For example, a large interval will have a higher weight

multiplier than a smaller interval, just as a very long note will have a higher weight multiplier

than a smaller one. Therefore, the higher the multiplier, the more obvious the group

boundary. The weight multiplier is multiplied by the standard rule weight discussed above.

Every Event record also contains a field labeled N4sum-of-weights. It stores the sum of the

total weights of al1 the rules that were determined Io be true for segmentation after that

particular note. If this sum is over a certain threshold, the Segmenter will choose Io segment

after that note. n i e fields of the Event record are shown in Appendix A.

Duration - MIDI note on to MIDI note off l

i GPR 3d - Length

t Segmenter N3lN4 Rule 4

Note on Note off

~ l ~ ~ / ~ ~ ~ t MIDI note off to next MIDI

n note on

J GPR 2a - Slur/Rest

t i GPR 3c - Articulation

off 8 n Segmenter N31N4 Rule 1

Velocity O - 127 On

I Note Number

MlDl velocity value

GPR 3b - Dynamics

Segmenter N3lN4 Rule 5

MlDl velocity

GPR 3a - Register

Segmenter N3lN4 Rule 3

Figure 330: Summary of N3lN4 rules

The Segmenter increments the note-on and note-off clock duntions by assigning them the

value of the next largest rhythmic type in the short-term memory. The note-on clock will be

assigned the duntion of the next larger rhythmic type. The note-off clock will be assigned

the note-off to next note-on value of the largcr rhythmic type. If no larger rhythmic type

exists in the short-tem memory, the note-on and note-off clock duntions will be assigned

the appropnate values of the current Event, providing this value does not exceed nvo times

the current durations for the clocks. If the new values are more than hvice as long, the

current note-on and note-off clock duntions will simply be doubled. In this mmner, the N4

Segmenter is able to both confirm segments produced by the RT Segmenter and

dynamically adjust the clock durations to ensure more accunte real-lime segmentation for

Events.

When the N3 Segmenter detects a group boundary with the N4 duration rule, and this

group boundary has not been detected by the RT Segmenter, the duration used to set the

note-on clock is considered too long and will be decreased. The same strategy is used when

the N4 Segmenter detects a group boundary at a rest that has not been detected by the note-

off clock in the RT Segmenter. The N4 Segmenter decreases the duration of the note-on

and note-off clocks using values from the next smallest rhythmic type in the short-term

memory. Once again, if no such rhythmic type exists, the duration of the current Event is

used.

Because the N3 rules look only atone Event following the segmentation point, the segments

detected by the N3 rules are less accurate and must also be confirmed by N4 rules.

Therefore, if it is determined that an N3 segment does not have a high enough N4 weight, it

is removed.

3.11 Segment Length

One feature missing from the GPR is the total length of a segment. The Segmenter

considers the length of a segment when deciding where to place a group boundary. The

Segmenter adds 8 times the segment length to the current Event's segmentation weight in

order to encourage long groups of notes to segment sooner and to give longer segments

priority over those that are shorter. The Segmenter tests a potential segment's length before

creating a group boundary to be sure the segment contains a minimum number of events. If

it does not, the previous segment is compared wiih the current candidate to determine which

grouping is more valid. Since the N4 rules are the most reliable, priority is given to the

segment with the highest N4 segmentation value. For example, a potential N4 segment

whose segmentation value is higher than that of its previous segment will cause the previous

group boundary to be removed. The notes of the previous segment will then be added to the

current segment.

3.12 Display of Event Data and Segmenter Information

The Event data contained in the Event queue of the short-term memory can be graphically

displayed in a window. The 32 events in the queue are displayed in piano roll notation

format with MlDI note numbers along the y-axis ranging from 36 to 96 (Le. the typical 5

octave range found on a MlDI keyboard) The Event numbers are listed dong the x-axis

(Figure 3.31). Each of the 32 Events are drawn into the window as a black horizontal line

with the length of the line determined by the Event's duration. if an Event is also a

segmentation point (Le. the Segmenter detected a group boundary at that note) it is

displayed in one of the following colors:

Red real-time segmentation occurred (RT)

Green N3 rules segmentation occurred (N3)

Blue N4 rules segmentation occurred (N4)

Figure 331: The Segmenter displry window

Along with the color indication of the segmentation point, the segmentation weight and

segmenter rules are also displayed. The Segmenter window is updated each t h e the

segmenter detects a new segment point. When a real-time segment is detected the event will

be displayed with a red RT and a segmentation weight of O. When an N4 segmentation

occurs, it will be displayed as blue N4 along with its corresponding segmentation weight.

The same will occur in green for an N3 segmentation.

With this display, it is possible to view the accuracy of the RT Segmenter by examining

how well the red RT segmentation points line up with the blue N4 segmentation points. As

mentioned earlier, each red RT segmentation should be confirmed by a corresponding green

N3 and blue N4 segmentation. As new forms of rules are added to the Segmenter, or

existing ones adjusted, it will be possible to view their performance by modifying the

Segmenter display.

3.13 Tenor Madness: A Real-tirne Exarnple

Il is now time to take a look at a real-time example of the opention of the Segmenter. As

mentioned at the beginning of this chapter, the input to the Segmenter will be the first twelve

measures of an improvisation on the tune "Tenor Madness" by Sonny Rollins. This

improvisation is shown in Figure 3.32. The improvisation is input as MIDI into the Listener

model. The MIDI representation of the solo is shown in Figure 3.33 and the Segmenter

piano roll notation display is shown in Figure 3.34.

Each note in the MIDI Stream is first examined by the RT Segmenter. When the first note-

on event arrives at the RT Segmenter, the note-on dock is set for a duration of 500 ms.

When the first note-off event arrives at the segmenter, the note-off clock is set for a duration

of 750 ms. The results of the RT Segmenter are shown in Figures 3.35 and 3.36. Each

group boundary detected by the RT Segmenter is marked by a RT. Three group boundaries

are marked with a box to indicate mistakes made by the RT Segmenter. In measures 3 and

9, the RT Segmenter marked two RT segments in a row. In measure 8, the RT Segmenter

did not detect one of the group boundaries. These errors resulted from the incorrect

durations used to set the note-on and note-off clocks. Without validation from the N3 and

N4 Segmenters, the RT clocks will continue to make errors since the other Segmenters are

used to adjust the duration of the clocks.

Figure 332: Improvisation on Tenor Madness, mm. 1-12

Time (ms)

4640 4890 4947 4999 5046 5401 6432 6703 6744 6812 6854 7156 7286 7536 7604 7687 7734 7979 9963 10187 10192 10328 10427 10656 10864 10973 10994 11187 11338 11552 12593 12651 12687 13026 13052 13958 13984 14072 etc.

Event Note

144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144

Velocity Number

Ab4 note-on Ab4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off Ab4 note-on Ab4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off G4 note-on G4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off E4 note-on E4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off Eb4 note-on Eb4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off F4 note-on F4 note-off G4 note-on G4 note-off C#4 note-on F4 note-on C#4 note-off F4 note-off

Figure 3.33: MIDI representation of Tenor Madness improvisation

Figure 3.34: Segmenter representation of "Tettor Madness" improvisation

75

Figure 3.35: RT Segmenter results

Figure 336: RT Segmenter results

Figures 337 and 338 show the segmentation results using only the N3 Segmenter. In this

example, only one error occurred in measure 9. The N3 Segmenter chose the f in t quarter

note instead of the second quarter note. Since the N3 Segmenter only examines notes N1

N2 and N3, it is possible for it to segment too soon in certain situations. The N4 Segmenter,

with more information available, a n correct these mistakes by the N3 segmenter.

The results of using only the N4 Segmenter are shown in Figures 3.39 and 3.40. In this

example, it is clear that the N4 Segmenter provides a reliable means of detecting group

boundaries. The N4 Segmenter has successfully chosen the correct group boundaries for

measures 3 , 8 and 9, where the other Segmenters made errors. The cost of this increased

accuracy is the two to three note latency resulting from the GPR. The N4 segmenter is

therefore used mainly to confirm segmentation decisions made by the other Segmenters and

to monitor and adjust the durations used by the RTclocks.

The performance of al1 Segmenters is shown in Figures 3.41 and 3.42. There is now a high

correlation of segmentation decisions across al1 Segmenters. Th.: mistakes made by the RT

and N3 Segmenters in measures 3 ,8 and 9 have been corrected by the N4 Segmenter. The

N4 Segmenter has also dynamically adjusted the durations of the RT note-on and note-off

clocks to align them with the tnie tempo of the performance.

Figure 3.37: N3 Segmentation of Tetior Madness improvisation

Figure 338: N3 Segmentation of Tenor Madness improvisation

Figure 3.39: N4 Segmentation of Tenor hfadness improvisation

Figure 3.40: N4 Segmentation of Tenor Madness improvisation

Figure 3.41: Segmentation using RT N3 and N4 Segmenten

Figure 3.42: Segmentation using RT N3 and N4 Segmenters

3.14 Summary

The Segmenter used by the Listener model consists of three independent segmentation

processes:

1. The RT Segmenter attempts to locate group boundanes as close as possible to real-the.

2. The N3 Segmenter uses notes N1 N2 and N3 to locate group boundanes.

3. The N4 Segmenter uses notes N1 N2 N3 and N4 to locate group boundanes.

The N3 and N4 Segmenters use an adaptation of the GPR and are used to verify the

performance of the RT Segmenter. The RT Segmenter, by itself, does not always select

musically accurate group boundaries as there may not be enough information available to

the Listener at that point in the music to make a proper decision.

Close analysis of the GPR and the results produced by the Segmenter suggest that listeners

determine group boundaries at various temporal locations in the music. Most of the features

that determine a group boundary are present at the start of a note. However, the length

feature (GPR 3d) cannot be evaluated until the note is released. Certain rules such as

SludRest (GPR 2a) and Articulation (GPR 3c) may not be reliably determined until hvo to

three notes after the group boundary.

The Segmenter, in its current implementation, has proven to be a useful tool for detecting

musically significant melodic fragments. Variations of the Segmenter have been used by

Pennycook anil btammen (1993; 1994), Rowe (1994) and Rolland (1998).

3.14.1 Limitations and Future Work

Despite its definite usefulness, there are certain limitations to the current implementation of

the Segmenter. One major limitation is that the Segmenter only works on a monophonic

Stream of MIDI data. This requires the voices of music performance to be assigned to

individual MIDI channels. Also separate Segmenters must be assigned to each voice in the

ensemble. In the current model, these individual instances of the Segmenter do not share

information conceming the performance. This lack of a global memory means that the

preprocessors in each of the Segmenten may develop different theories on the various

rhythmic types assigned. There is a need in a future version of the Segmenter to listen to the

entire ensemble in order to have a global interpretation of the musical performance.

The operation of the Segmenter is dependent on the assignment of weights and thresholds

to determine if a group boundary is present. These weights were developed after much

experimentation with several styles of music. Further work is required to determine the

validity of these weights and to improve the accuracy of the Segmenter.

4. The Recognizer

4.1 Introduction

The final component we will examine in the Listener mode1 is the Recognizer. As shown in

Figure 4.1, the Recognizer receives as input a melodic fragment from the Segmenter. This

melodic fragment will Vary from 4 to 16 notes in length. The Recognizer is responsible for

comparing the new fragment with a collection of previously recognized fragments.

The Recognizer uses a version of the dynamic timewarp algorithm (DTW) (Itakura, 1975;

Sakoe and Chiba, 1978) to match the melodic contour of the new fragment with the closest

matching fragment in its intemal memory. The DTW is a widely used technique for lime

alignment and comparison of speech and image patterns. The Recognizer uses a modified

implementation of the DTW that allows the comparison of the pitch and rhythmic contours

of an unknown melodic fragment with a collection of previously recognized fragments. The

DTW can compare melodic fragments of different sizes. The timewarping characteristic of

the DTW easily accommodates the expressive timing inherent in human performance,

thereby eliminating any need for rhythmic quantization. The Recognized fragments are

stored in a long-term memory so they can be used for future listening sessions.

Segmenter + Recognizer '-4

? Figure 4.1: The Recognizer

Figure 4.2 shows an expanded view of the operation of the Recognizer. When the

Recognizer receives a melodic fragment from the Segmenter, the fragment is first passed to

the DTW algorithm. The algorithm compares the new fragment with a number of templates

contained in the template storage area. When a close match is found the results are sent to

the Evaluator, which decides if the new fragment should be considered similar enough to the

template. If the new fragment is too different from al1 of the templates, the fragment will

become a new template to be matched with future fragments. Othewise the new fragment

will be added to a list of fragments that are similar to the matching template. In this manner,

the Evaluator and Template storage build clusters of similar melodic fragments. The number

of templates in the Recognizer can expand dynamically as new fragments are received by

the Recognizer.

I Recognizer

Evaluator MIDI A-~XTI Templates

Figure 4.2: The Recognizer

4.2 Use of Melodic Contour

As discussed in Chapter 2, recent research into melodic perception suggests that melodic

contour is one of the most salient features used to remember and recognize melodic

sequences. In view of this evidence, a computer program designed for melodic recognition

should be able to analyze melodic contours. Further consideration of melodic recognition

theories suggests that melodic recognition may be viewed as a pattern recognition task. The

contour of an unknown melodic fragment represents a pattern of rising and falling intervals

which in tum characterizes the general shape of the fragment. Melodic contour recognition

therefore involves the comparison of an unidentified melodic fragment with previously

recognized fragments. A measurement of similarity behveen any hvo melodic fragments

needs to consider the differences between their melodic contours as well as their rhythmic

content. However, comparison becomes more complicated when one considers the

variability of melodic fragments, especially the differences in their lengths. To overcome this

difficulty, the dynamic t h e warping algorithm (DïW) has been implemented as a method

Io compare melodic fragments of different lengths.

4.3 Applications of Dynamic Prognmming to Music

The technique of dynamic time warping is based on the dynamic programming path îinding

algorithm. The theory of dynamic programming (DP) was introduced by Bellman (1957) Io

solve mathematical problems arising from multi-decision processes. Elastic pattern

matching involving the comparison of sequences of different lengths can also be modeled

using DP. This technique is commonly referred to as dynamic time warping 0.

Dannenberg (1984) described one of the first applications of DP to music. His score-

follower algorithm utilized DP for real-time comparison of performed pitches with a score

stored on a computer. The algorithm handled performance deviations from the score by

allowing for insertions, deletions, and replacements. Insertions occurred when the performer

added one or more notes to the score. Deletions occurred when notes were left out of the

performance. Replacements resulted when a performer substituted another note for one in

the score. Through the use of DP, the algorithm was able to match moderately inaccurate

performances with a fixed score.

Mongeau and Sankoff (1990) used DP to compare the ovenll similanty behbeen two

musical scores. Their algorithm not only accommodated musical insertions, deletions, and

replacements but also consolidation and fragmentation of the musical material.

Consolidation involves the replacement of several elements by a single one, while

fragmentation is the replacement of one element by several. Their implementation of DP

searched for similarities in melodic lines despite gross differences in key, mode, or tempo.

However, their use of interval weights and tonic relationships limited their analyses to

Western tonal music. Their system was intended to compare entire musical works or large

sections of works and is therefore unsuitable for a real-time implementation. Another use of

the DTW for melodic recognition is found in Hiraga (1996).

4.3.1 Other Melodic Fragment Recognition Systems

David Cope's Experimetits in Mioic Intelligence (EMI) (Cope, 1991; 1992) attempted to

determine fundamental elements of a composer's style by locating melodic patterns or

"signatures" that appear in more than one composition. The EMI program analyzed MIDI

files of a given composer's works using a patteh matching algorithm to locate these

patterns. Signatures were then stored in a database to be used later by the EMI composition

functions in the creation of a new work in the style of the composer. Cope's pattern

matching functions search the input compositions for matching interval sequences (Cope,

1990). The size of the motives to be located, as well as the tolerance for differing intervals in

the sequences may be specified by the user. Cope's exhaustive search algorithms were

inappropriate for a real-time implementation of a Recognizer.

Pierre-Yves Rolland (1994; 1998) presented a very complete melodic fragment recognition

algorithm called FlExPat (Flexible Extraction of Patterns). He used this algorithm for the

automated extraction of prominent motives in jau. Along with motive extraction, Rolland's

system also performed statistical and structural analyses that examined how the motives

were used in relation to the harmonic progression of the piece. Rolland's extensive analysis

algorithm for loating signifiant melodic fragments in a jazz solo was not designed to be

used in a real-time implementation.

4.4 Implementntion of the Dynamic Timewnrp Algonthm

The Recognizer is modeled after a discrete word recognition system (DWR) first proposed

by Itakura (1975) and is shown in Figure 4.2. In this system, a monopïiionic MIDI stream is

segmented into candidate musical fragments by a Segmenter process (Pennycook et al.,

1993). Each fragment produced by the Segmenter consists of 4-16 MIDI notes. The notes

in a fragment are converted to an interval set to remove pitch differences caused by

transposition. For example, fragment 1 s h o w in Figure 4.3 is represented by the interval

set {O, -1, +1, -5, +1}.

While not implemented in this current study, the rhythm of a fragment could be converted

into a set of duration ratios, which is defined as the ratio of a note's duration divided by the

previous note's duration. The rhythm of the first melodic fragment in Figure 4.3 could be

represented by the duration ratio set {O, 1,2,1, l}. However, in a live performance of this

fragment, the duration ratio set would not consist of exact integers and as an example may

contain the values {O, .9,2.1, .97,1.1}. The DTW is well suited to accommodate these local

rhythmic variations, thereby removing the need for rhythmic quantization. The duration ratio

representation of rhythm also allows for easy comparison of rhythmic contours that are

related by augmentation or diminution. Each melodic fragment is therefore transformed into

a two-dimensional feature vector that combines the pitch interval set and duration ratio set.

Other feature sets such as velocity could be added to the vec!or to create an n-dimensional

representation of the musical fragment. The total set of feature vectors becomes the

candidate template representing the unknown fragment.

J.S. Bach

Figure 43: Ferture set representntion of melodic fragments

(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)

In order to recognize the unknown fragment, the candidate's feature template is compared to

a database of reference templates. The DTW compares the unknown candidate with each

template in the database and assigns a distance value that indicateç the degree of similarity

between the candidate and reference templates. The Recognizer matches the candidate with

the reference template that results in the lowest distance measure. When several close

matches occur, an Evaluator is used to select the best match. If the best match distance is

higher than a pre-defined thieshold, the candidate's template is considered to be unique and

can be added to the database of reference templates.

The D l W template matching process is illustrated in Figure 4.4. For a more complete

introduction to the DTW, the reader is referred to Silverman and Morgan (1990) and

Sankoff and Kruskal (1983). The horizontal axis represents the [l..m] elements of the

unknown candidate C, where m is the number of feature vectors contained in the candidate.

The vertical axis represents the feature vectors [l..n] of a reference template R, n being the

number of feature vectors in R. Each grid intersection point (ij) represents a possible match

between elements Ci and Rj. A local distance measure d(i,j) is computed at each grid

intersection. This distance measure is a function of the two feature vectors Ci and Rj and

describes the dissimilarity of these two vectors. The DTW in the Recognizer uses the

simple Euclidean distance measure.

Aiter al1 of the local distances have been calculated, the DTiV attempts to trace a path from

endpoint(1,l) to endpoint(m,n) that results in the lowest accumulated distance. As shown in

Figure 4.4, any monotonic path from endpoint (1,l) to endpoint (m,n) represents a possible

mapping or warp of the candidate template onto the reference template. The accuracy of

such a mapping can be measured by summing al1 of the distances d(i,j) along the path. For a

given candidate/reference pair, the goal is to find the monotonic path through the arny that

minimizes the accumulated distance between the endpoints. Instead of tncing every possible

path through the DTiV array, we add local and global path restraints to limit the total

number of possible paths. A local restraint used in computing the optimal path through the

D W array is shown in Figure 4.5. This restraint limits the number of predecessors at a

given grid intersection. As indicated in Figure 4.5, the only possible predecessor to the point

(i,j) is one of (i-lj), (i-lj-l), or (i-lj-2). The accumulated distance at any point (ij) on the

grid becomes:

This local path restraint is known as the ltakura local restraint. The Itakura restraint limits

the possible path through the array to between a dope of 0.5 and 2.0. Other local path

restraints are possible and sevenl examples are listed in Silverman and Morgan (1990).

The DTW algorithm has a complexity of mn, where m and n are the number of feature

vectors in the candidate and reference templates. It is therefore desirable to further reduce

the amount of distance and path calculations required to determine the degree of similarity

between a candidate and reference template. Global path restraints such as the one shown in

Figure 4.5, require that the optimal path be within a certain region of the DTW array. Paths

that nin outside of this region are rejected. The use of a parallelogram search space (Rabiner

et al., 1978) reduces the complexity of the DTW to approximately nmI3. Figure

1 Candidate rn

Figure 4.4: DTW matching algorithm

Column i-1 Column i

Figure 4.5 ltakun Local Restraint

4.6 illustrates the region of this search space. The combination of the local and global path

restraints allow the DTW to compare an unknown candidate with templates that vary from

0.5 to 2.0 times its length. To compare a candidate with K templates, the complexity

becomes K(nm/3). Real-time performance is therefore dependent on the size of K.

The DTW compares a candidate template with every reference template contained in the

template database. These reference templates may be loaded into the database in a number

of ways. Templates may be pre-loaded by playing the desired musical fragments into the

Segmenter. The Segmenter converls each fragment into a feature vector template and stores

it into the template database. Templates may be loaded from a data file containing templates

recognized during previous sessions. Templates may also be created by the DTW

Evaluator. With this method, melodic fragments that are not recognized by the DTW are

entered into the template database. It is therefore possible to start with an empty database

and have the system construct a template database that contains templates representing the

unique musical fragments contained in a single musical work. A graphical editor allows for

the display and editing of the recognized fragments and also the correction of incorrectly

segmented or recognized fragments. The flexibility of the template database allows the

system to be used in a variety of melodic recognition tasks. The system can be configured

to respond to only one feature such as pitch or rhythm. Likewise, new features may be

defined and added to the DTW recognition process.

As melodic fragments are recognized by the Recognizer, they are added to the template

memory. A new fragment that does not match any other fragment is labeled a motive. Any

future fragments that are similar to the motive fragment are added to a lis1 of submotives

belonging to the fragment. The motive fragments therefore become the set of templates used

to match a new melodic fragment. Once the closest matching template has been discovered,

the list of related submotives can be searched for an even closer match. This strategy

reduces the size o f the search space as the ternplate database grows over repeated listening

sessions.

1 Candidate m

Figure 4.6: DTW senrch space

4.5 Real-tirne Listening Exarnple

It is now time to examine the real-time operation of the Recognizer. Figure 4.7 shows the

Timewarp application embedded in a Max patch (Puckette and Zicarelli, 1990). The

Timewarp objects are labeled melsltape. The music for the real-time listening session is a

MIDI recording of a performance of the Fugue in C minor, BWV 847 by J. S. Bach. The

fugue is played by theplaySMF object which will send the MIDI notes of the fugue in real-

tirne to the melshape object. &ch instance of the melshape object listens to a single voice of

the fugue. The melshape objects are labeled soprano, alto and bass respectively. The

melshape object contains both the Segmenter and Recognizer objects. The analysis

performed by each of the melshape objects may be viewed by double-clicking on the

melshape object. The analysis display window for the soprano voice is shown in Figure 4.8.

and the bass voice in Figure 4.9.

Figures 4.8 and 4.9 show the operation of the Segmenter and the Recognizer. Each melodic

fragment is labeled with the Segmenter rules and weights that determined the group

boundary. The shape of the best matching template is drawn above each fragment. Five

recognition templates are used in this example to represent the five basic melodic contours

(Scheidt, 1985):

Rising

Falling

Rising/Falling

Falling/Rising

Flat

The shapes of these templates are s h o w in Figures 4.10 to 4.14. The templates are

specified using interval offsets and contain 15 interval offsets used to represent a fragment

of up to 16 notes in length. The interval offsets for each template are shown in Figure 4.15.

During this listening session, the Segmenter in each melshape object sends melodic

fragments to its Recognizer. The results of listening to the alto voice of the fugue are shown

in Figures 4.16 to 4.19. The display shown in Figure 4.16 shows al1 of the nsing fragments

found in the alto voice. The normalized versions of these fragments are shown in Figure

4.17. Results for risinglfalling fragments are shown in Figures 4.18 and 4.19. Results for

the other template shapes are shown in Figure 4.20 to 4.24. From these results, it is clear

that the Recognizer does a very good job in clustering melodic fragments with similar

melodic contours together.

Figure 4.7: Timesarp objects in Max envimnrnent

Figure 4.8: SegmenterIRecognizer Display

(Input: Soprano voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)

Figure 4.9: SegmentedRecognizer Display

(Input: Bass voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)

Figure 4.10 Rising template

Figure 4.11 Falling template

Figure 4.12: RisinglFalling template

Figure 4.13: RisinglFalling template

. Figure 4.14: Flat template

Figure 4.15: Recognition Template Editor

Figure 4.16: Recognition DisplnyIEditor showing rising fmgments


Figure 4.17: Rising melodic fragments (normnlized)


Figure 4.18: Recognition Displry/Editor showing rising/fnlling fragments


Figure 4.19: Risindblling melodic fragments (normalized)


Figure 4.20: Examples of rising fragment recognition


Figure 4.21: Examples of Falling Fragment Recognition

(Input: Bass voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)

Figure 4.22: Examples of RisinglFalling Recognition


Fipre 4.23: Examples of FallinglRising Recognition

Figure 4.24: Examples of Flat Fngment Recognition

4.6 Summary

For this implementation of the Recognizer in the Listener model, the DTIV proved to be an

effective algorithm for the recognition of melodic fragments. The system was capable of

recognizing short melodic fragments in real-time. Good results were obtained in the

recognition of melodic fragments from music ranging from Bach fugues to bebop jazz. in

general, the Recognizer was capable of recognizing melodic fragments without specific

knowledge of a musical style. The contour-matching capabilities of the DTW allowed for

recognition of similar melodic fragments by accommodating the effects of insertion,

deletion, replacement, consolidation and fragmentation of elements.

The DTW-based melodic recognition system has been implemented as a Max external

object, a T-Max object for implementation on a parallel array of INMOS T-805 transputee

(Pennycook and Lea, 1991), and as a Macintosh program called Timewarp. The DTW may

also be used as a melodic recognition process for music analysis of printed scores. For

example, the output of an optical music recognition system (Fujinaga, 1992; 1996) may be

used as input to a DTW recognition system. This would enable users to search large

musical databases for similarities of certain musical features. This would also allow DTW

reference template databases to be constructed directly from printed musical scores. Future

research will examine level-building DTW algorithms (Silveman and Morgan, 1990) used

in continuous speech recognition to create higher-level hierarchies from recognized

fragments. As the performance of the recognition system is dependent upon the accuracy of

the Segmenter, continuous speech recognition techniques may allow for continuous

recognition of fragments and remove the need for a segmentation process.

5. Real-time Segmentation and Recognition of Vertical Sononties

5.1 Introduction

In this dissertation 1 have presented a model for the segmentation and recognition of

melodic fragments. As part of the research in creating this system 1 also examined the

problem of real-time segmentation and recognition of vertical sonorities. This resulted in the

realization of a real-the chord recognition component. The chord recognizer was part of a

computer model of a jazz improviser (Pennycook et. al., 1993). The jazz improviser model

consisted of two real-the, large-grain parallel processes, called the Listener and the Player

(Pennycook and Lea, 1991). One of the tasks of the Listener was to provide the Player with

information about the ensemble's position in a chord progression. The Listener must

therefore be able to derive the root of a chord given a collection of notes performed by an

ensemble of musicians. The Listener must also determine the quality of a given chord so

that the Player can fit its improvisation to the current harmonic situation.

5.2 Real-time Chord Recognition

The chord recognition process may be broken down into at least two separate processes.

First, notes must be grouped together into vertical sonorities. Second, the vertical sonorities

must be analyzed to determine the root and chord type. Rosenthal (1992) introduced a

chord finder based on the principIes of stream segregation. Unfortunately, his method was

not designed to operate in real-the. Rowe (1993) presented a connectionist model for the

real-lime recognition of chord roots. His model grouped a stream of notes into vertical

sonorities and determined the central pitch of a local harmonic area. However, Rowe's model

was designed to detect simple tertian structures and was therefore unsuitable for

dctermining the roots of the harmonic structures found in jm.

Parncutt (1988) revised Terhardt's (1982) octave-generalized model of the mot of a musical

chord involved the assigning of appropriate weights to a set of "root-support intervals" (in

descending order of importance: PllP8, P5, M3, m7, and M2lM9) in a subharmonic

matching routine. The model reliably predicted the roots of typical chords in 18th and 19th-

century music, including the traditionally problematic minor triad. We chose to incorponte

Pmcutt's model into the Listener because it was not limited to simple tertian harmonies and

could be easily adapted for real-time opention.

Chord Recognition

Figure 5.1: Chord recognition pmcess

The chord recognition model operates in real-time on a stream of MIDI data (Figure 5.2).

The opention of the model is shown in Figure 5.1. The model first segments the MIDI data

stream into vertical sonorities by grouping together notes with sttack limes that are within 80

ms of each other (Figure 5.4). The grouping of the chord candidates is shown in Figure 5.3.

When a vertical sonority of Iwo or more notes has been detected by the chord segmenter,

the notes are sent to the chord recognizer. The chord recognizer uses a modified version of

the algorithm described in Parncutt (1988). The recognizer first calculates the pitch-class

weights for al1 possible roots of the chord. These pitch-class weights are then used to

estimate the "root ambiguity" of the chord. Finally, the calculated root ambiguity value is

used to convert the pitch weights into absolute estirnates of salience (Parncutt, 1988). The

model selects the root with the highest pitch-class weight and reports to the Player the root

of the current vertical sonority, along with its chord type and root ambiguity.

Figure 5.2: Piano roll representation of a MIDI performance

(Input: Teilor Madness piano improvisation)

Figure 5.3: Notes gmuped into chord candidates

(Input: Tenor Madness piano improvisation)

Analysis Analysis

Figure 5.4: Creating chord candidates in real-tirne

53 Revision of Parncutt's Model

We discovered that Parncutt's model often failed to predict the roots of jazz keyboard

voicings, especially those in which roots or fifths are absent. In Parncutt's version, the

intervals Pl/P8 and P5 are weighted quite heavily relative to the other intervals (see Table

5.1). The new weights for the mode1 are shown in Table 5.2. We have reduced the emphasis

on the intervals Pl/P8 and P5, and increased the weights for the intervals m3, M3, and m7.

We have also added the root-support intenrals Tï, and M7 that were absent from Terhardi's

and Parncutt's original versions. We have implemented the model in such a way that the

user may easily adjust the interval weights according to the hannonic style of the music Io

be analyzed (Figure 5.5).

Terhardt (1982)

Table 5.1: Original weights

Table 5.2: Revised Jazz Weights

To compare the models, consider the chord C-Eb-G-Bb in any inversion (Figure 5.6).

Parncutt's model predicts that the root of this chord is ambiguous with Eb winning by a

small margin. n i e revised model calculates the root to be C and is thus more consistent with

jazz harmonies and the m u s i d styles in question in this research, in which the interpretation

i i 7 -~7-1 is preferred to 1~615-V-1. A similar calculation for the Do7 chord is shown in

Figure 5.7. It appears that the relative importance of the root-support intervals depends not

only on experience of the harmonic structure of single complex tones in speech and music,

but also on expenence of specific musical styles.

Figure 5.5: Adjustable subharmonic weights

I I I

Figure 5.6: Root Calculation for Cmin7 Chord

Figure 5.7: Root Calculation for Do7 Chord

5.4 Real-tirne Exarnple

The real-time chord recognition system has been implemented as a Max extemal object

called chord. The use of the chord object in a Max patch is shown in Figure 5.8. In this

example, the chord recognizer is presented with a piano improvisation on the tune Tenor

bfadness by Sonny Rollins. The chord progression for this tune is shown in Figure 5.9.

The recognized chords are shown in the chord recognizer display in Figure 5.10. The

recognizer is capable of determining the chord root and chord label, including chord

extensions.

Figure 5.8: M3x chord object

Figure 5.10: Xecognized Chords

(Input: Tenor Madness piano improvisation)

55 Summnry

The flexibility of the real-tirne chord recognizer has been verified with harmonic analyses of

music ranging from Bach to Bebop jazz. When the chord recognizer selects a root that is

not the same as the notated harmonic progression, in most cases the root selected is an

appropriate chord substitute. As with Pamcutt's original model, the revised model treats

chords as isolated vertical phenomena and does not consider each chord as an element in a

harmonic progression. Future work will involve adapting the model to consider previous

identified roofs in its identification of the current chord root.

6. Conclusions

This dissertation presents a working model of melodic fragment recognition. The Listener

model attempts and in many ways succeeds in modeling the perceptual processes used by

human listeners while listening to a musical performance. In many different musical

situations, the model performed the musical recognition tasks accurately. This lends support

to the validity of the hypotheses represented by the model. This computer model of a

listener demonstrated the use of applied musicology for the testing of hypotheses

formulated by perceptual research (hske , 1989). By examining perceptual research into

human melodic recognition, we can eventually develop a working computer model for the

red-time recognition of melodies.

If theories of human musical perception can be validated by computer models, then these

same theories can be used to improve the quality of computer music systems (Laske, 1978).

Computer music systems are any combination of computer hardware and software used to

compose, analyze or perform music. Human perceptual processes must be considered in the

design of these systems (Vercoe, 1992). Valid, perceptually-based theories of music are in

fact essential to computer music systems for "any attempts to simulate the ... abilities of

humans will probably not succeed until in fact the musical models and plans that humans

use are described and modeled" (Moorer, 1972, p.104). Theory is the lifeblood of a

computer music system and "a music analyst approaching the computer quickly realizes that

running out of theory means running out of program code" (Alphonce, 1980, p.26). While

music theory is one of the oldest disciplines in the literate history of civilization, "systematic,

coherent, well-developed and computnble music theory is both very young and rather

scarce" (Alphonce, 1980, p.26).

In humans, the formation of appropriate responses to musical stimuli draw upon a vast and

complex nehvork of concepts, leamed skills and innate behaviors that represent a formidable

126

e challenge to precise analysis (Dannenberg, 1987). Refined and specialized listening skills

are required of composen, music analysts, and petformen. However, considerable portions

of a musical "expert's" abilities may be equally tound in "naive" listeners, and even these

non-experts don? seem so naive when one looks more closely at their listening skills. Their

ability to segment and recognize melodic fragments and also to group these fragments into

melodies requires that a large number of complex tasks opente together in real-time. The

ability to perceive melodic patterns at various levels of organization is needed for a full

understanding of a piece of music. It seems that music understanding may rely heavily on

pattern recognition capabilities as well as logical reasoning and problem-solving techniques.

At the present time, computers should not be expected to undentand music the way humans

do. But it is possible to build computer music systems that attempt to deal with specific

aspects of musical intelligence. By becoming more aware of both the possibilities and the

limitations of such systems, we may Iearn how human listeners do what they do while

gradually raising the level of the simulated computer musician's musical abilities.

Appendix A. The Event Record

The following describes the fields of the Event record, a structure used to define each note

received by the Listener model. The Event record groups into one location the MIDI note-

on and note-off information for a given note. The record also contains al1 of the information

needed to make decisions on group boundaries. Information about the recognized melodic

fragment is also contained in the Event record.

long songNoteNum a number assigned to the note when it is received.

long begin

long end

long offset

long duration

char segLength

char class

the start lime of the note in milliseconds. This is the tirne

that the note-on event was received.

the end time of the event in millisewnds. This is the time

that the note-off event was received.

the time in milliseconds from the note-on of this note to the

note-on time of the next note.

the duration of the note in milliseconds. This is the lime from

the note-on to the note-off.

the length of the melodic fragment starting from this note.

Value will be O if this note is note the start of a fragment.

the Scheidt(l985) class assigned to the fragment if this is the

first note of the fragment. Values assigned are:

O - Rising

char dist

char pitch

char velocity

char interval

a

1 - Fz!ling

2 - RisingFalling

3 - FallingIRising

4 - Rat

if this note is the start of a fragment, the dist value is the

measure of similarity assigned by the D ï W algorithm.

the MIDI note value for this note.

the note-on MIDI velocity of this note.

the i n t e ~ a l in semitones behveen this note and the previous

note.

char onToOntype the note-on to next note-on rhythmic type value assigned to

the note by the Short Term Memory (STM). The value

ranges from 0-9 based on which rhy thmic type is assigned

by the STM.

char onToOfftype the note-on to note-off rhythmic type value assigned to

the note by the Short Term Memory (STM). The value

ranges from 0-9 based on which rhythmic type is assigned

by the STM.

char offfoontype used only for updating note-off clock in the RT Segmenter.

The note-off to next note-on rhythmic type assigned by the

STM. Values mnge from 0-9.

charofffoOnArtictype Articulation type assigned to note. Value is one of:

O - Slur

1 - Staccato

3 - rest

Boolean beginOfSeg TRUE if this is the beginning of a segment, else FALSE

Boolean N4yesSeg TRUE if the N4 Segmenter determined a group boundary at

this note.

Boolean N3yesSeg TRUE if the N3 Segmenter determined a group boundary at

this note.

Boolean realyesseg TRUE if the RT Segmenter determined a group boundary at

this note.

short N4rules a set of binary flags indicating which N4 Segmenter rule is

in effect at the group boundary.

short N4sum-of-weights the total of al1 weights used by the N4 Segmenter to

determine the group boundary.

short N3rules a set of binary flags indicating which N3 Segmenter rule is

in effed at the group boundary.

short N3sum-of-weights the total of al1 weights used by the N3 Segmenter to


short realmles a set of binary flags indicating which RT Segmenter mle is

in effect at the group boundary.

short realsum-of-weights the totaI of al1 weights used by the RTSegmenter to


stmct event *segEnd if this event is the beginning of a segment, this field is a

pointer to the event at the end of the segment.

struct event 'segBegin if this event is the end of a segment, this field is a pointer to

the event at the beginning of the segment.

stmct motive *motive if this event is the start of a fragment, this field is a pointer to

the motive assigned to this fragment.

struct submotive 'submotive if this event is the start of a fragment, this field is a pointer to

the submotive assigned to this fragment.

stmct event "next a pointer to the next Event structure in the linked list of

Events.

struct event *prev a pointer to the previous Event structure in the linked list of

Events.

References

Alphonce, B. 1980. Music Analysis by Computer - a field for Theory Formation. Computer hfusic Journal. 4,2: 26-35.

Ames, C. and M. Domino. 1992. Cybemetic Composer: An Overview. Understanding Music wirlt AI. Cambridge: AAAi Press. 86-205.

Bartlett, James C. and Jay W. Dowling. 1980. Recognition of Transposed Melodies: A Key-Distance Effect in Developmental Perspective. Journal ofEtperimenta1 Ps).cltology: Hur>?an Perception andPerformance. 6(3): 501-515.

Bellman, R. E. 1957. Dynamic Programming. Princeton: Princeton University Press.

Bregman, A. 1990.Auditot-y Scene Analysis. Cambridge: The MIT Press.

Chomsky, N. 1965.Aspects oftlie Tlreot-y of Syntm. Cambridge: The MIT Press.

Cope, David. 1992. Computer Modeling of Musical Intelligence in EMI. Compirter Music Journal. 16,2: 69-83.

Cope, David. 1991. Conputers and Musical Style. Madison: A-R Editions.

Cope, David. 1990. Pattern Matching as an Engine for the Computer Simulation of Musical Style. Proceedings of tlie 1990 International Computer Music Cotrfeence. 288-291.

Croonen, W. L. M., and P. F. M. Kop. 1989. Tonality, Tonal Scheme, and Contour in Delayed Recognition of Tone Sequences. Music Perception. 7.1: 49-68.

Dannenberg, Roger B. and Bernard Mont-Reynaud. 1987. Following an Improvisation in Real Time. Proceedings oftltc International Computer hfusic Conference. 241-47.

Dannenberg, Roger B. 1984. An On-Line Algorithm for Real-The Accompaniment. Proceedings oftlte 1984 International Computer Music Conference. 193-198.

Davies, J. B. and J. Jennings. 1977. Reproduction of familiar melodies and the perception of tonal sequences. Journal oftlteAcoirstical Sociefy ofAmerica. 61.2: 534-541.

Deliége, 1.1987. Grouping Conditions in Listening to Music: An Approach to Lerdahl and Jackendoffs Grouping Preference Rules. Music Perception. 4.4: 325-360.

Desain, Peter and Henkjan Honing. 1995. Computational models of beat induction: the rule-based aooroach.Artificiai Intelli~ence and hfusic. 14th Intemational Conference on Artificial lnÎeiligence. 1-i0.

..

Dowling, W. Jay. 1978. Scale and Contour: Two Components of a Theory of Memory for Melodies. Psycltological Revieiv. (85)4: 41-354.

Dowling, W. J. 1973. Rhythmic Groups and Subjective Chunks in Memory for Melodies. Perception and Psycl~opltysics. 14.1: 37-40.

Dowling, W. J. and Diane Fujitani. 1971. Contour, Interval, and Pitch Recognition in Memory for Melodies. TlteJournal oftlre Acoustical Society ofAmerica. 49(2): 524-531.

Dyson, Mary C. and Anthony J. Watkins. 1984. A Figural Approach to the Role of Melodic Contour in Melody Recognition. Perception and Ps).cliopliysics. 355: 477-485.

Ellis, D. 1992. A Perceptual Representation of Audio. MS Thesis, EECS, Media Laboratory, Massachusetts Institute of Technology.

Fujinaga, 1. 1996. Adaptive Optical Music Recognition. Ph.D. dissertation, McGill University.

Fujinaga, I., B. Alphonce, and Bmce Pennycook. 1992. Interactive Optical Music Recognition. Proceedings of the 1992 International Computer Music Conference. 117-120.

Gill, Stanley. 1964. A Technique for Composition of Music in a Computer. The Computer Journal. 6: 129-33.

Grubb, Lorin and Roger B. Dannenberg. 1993. Pattern Processing in Music. Proceedings of the 1994 International Computer Music Conference. 63-69.

Hiller, Lejaren A. 1959. Computer Music. Scientific American. 201(6): 109-120.

Hiller, Lejaren A, and Leonard M. Issacson. 1959. .Operimental hfusic: Composition with an Electronic Cornputer. New York: McGnw-Hill.

Hiraga, Yuzum. 1996. A Cognitive Mode1 of Pattern Matching in Music. Proceedings of the 1996 International Computer Music Conference. 248-250.

Itilkuril, Fumitada. 1975. Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Transactions on Acoicstics, Speech and Signal Processing. ASSP-23: 67-72.

Koenig, G. M. 1970a. Project One. Electronic Music Report 3. Utrecht: Institute of Sonology. (Reprinted 1977, Amsterdam: Swets and Zeitlinger.)

Koenig, G. M. 1970b. Project Two. Electronic MfisicReport2. Utrecht: Institute of Sonology. (Reprinted 1977, Amsterdam: Swets and Zeitlinger.)

Laske, Otto. 1989. Introduction to Cognitive Musicology. Journal ofAfusicoiogy. 9: 1-22.

Laske, Otto. 1981. Composition Theory in Koenig's Project One and Project Two. Computer Music Journal. 5,4: 54-65.

Laske, Otto. 1980. Towards an Explicit Cognitive ïheory of Musical Listening. Compirter Music Jormal. 4.2: 73-83.

Laske, Otto. 1978. Considering Human Memory in Designing User Interfaces for Computer Music. Compitter Music Journal. 2,4: 39-45.

Ledley, Robert Steven. 1962. Programming and Utilizing Digital Compitters. New York: McGraw-Hill.

Lerdahl, Fred and Ray Jackendoff. 1983. A Generative Theory of Tonal Music. Cambridge: The MIT Press.

Massam. Dominic W., Howard J. Kallman, and Janet L Kelly. 1980. The Role of Tone Height, Melodic Contour, and Tone Chroma in Melody Recognition. Journal of Erperimental Psychology.. Hwnan Learning and Memory. 6(1): 77-90.

Minsky, Marvin. 1986. TheSociery ofMind. New York: Simon and Schuster.

Minsky, Mamin. 1981. Music, Mind, and Meaning. Compute M c Journal. 53: 28-44.

hlongeau, Marce! and David Sankoff. 1990. Companson of Musical Sequences. Compurers and the Humaniries. 24: 61-175.

Moorer, James Anderson. theACM. 15(2): 104.

1972. Music and Computer Composition. Cornmunications of

Pamcutt, R. 1988. Revision of Terhardt's Psychoacoustic Model of the Root(s) of a Musical Chord. hfusic Perception. 6:l: 65-94.

Pennycook, Bruce and Dale Stammen. 1994. A Model of Tonal Jazz Improvisation. Proceedings of the 3rd International Conference on Music Perception and Cognition. 61- 62.

Pennycook, Bruce, Dale Stammen, and Debbie Reynolds. 1993. Toward a Computer Model of a Jazz Improviser. Proceedings of t11e 1993 International Computer Music Conference. 228-231.

Pennycook, Bruce and Chris Lea. 1991. T-MAX A Panllel Processing Development System for MAX. Proceedings of tlie 1991 International Computer Music Conference. 229-233.

Pinkerton, Richard C. 1956. Information Theory and Melody. Scieritific American. 191: 77- 86.

Puckette, M. and D. Zicarelli. 1990. Max-An Interactive Grapliical Programming Environment. Menlo Park: Opcode Systems.

Rabiner, L. R., A. E. Rosenberg, and S. E. Levinson. 1978. Considentions in Dynamic Time Warping Algorithms for Discrete Word Recognition. IEEE Transactions on Aconstics, Speech, Signal Processing. ASSP-26: 575-582.

Roads, Curtis. 1979. Grammars as Representations for Music. Computer Music Journal. 3,l: 45-55.

Rolland, Piene-Yves. 1998. Dccouverte Automlitique de Regularities Dans Les Sequences et Application a l'Analyse Musicale. Ph.D. Dissertation, Université Paris.

Rolland, Pierre-Yves. 1994. Automated Extraction of Prominent Motives in Jazz Solo Corpuses. Proceedings of tlie 4 8 International Conference on hi'usic Perception and Cognition. 491-495.

Rollins, Sonny. 1957. Tenor Madness. Berkeley: Prestige Music Inc.

Rosenthal, D. 1992. Machine Rhythm: Computer Emdation of Human Rltytltm Perception. Ph.D. dissertation, Massachusetts Institute of Technology.

Rosenthal, David. 1992. Emulation of Human Rhythm Perception. Computer Music Journal. 16.1: 64-76.

Rosenthal, David. 1989. A Model of the Process of Listening to Simple Rhythms. Music Perception. 6 3 315-328.

Rowe, Robert and Tang-Chun Li. 1994. Pattern Processing in Music. Proceedings of the 1994 International Cornputer Music Conference. 60-62.

Rowe, Robert. 1993. Interactive Music Systems Machine Listening and Cornposing. Cambridge, Massachusetts: The MIT Press.

Rowe, Robert. 1991. Machine Listening and Composing: Making Seme ofMusic wirlt Cooperating Real-Time Agen». Ph.D. dissertation, Massachusetts Institute of Technology.

Sakoe, Hiroaki and Seibi Chiba. 1978. Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Transactions on ASSP. 26.1.

Sankoff, David and Joseph B. Kruskal (Ed). 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sepence Cornparison. Reading, Massachusetts: Addison Wesley.

Scheidt, Daniel J. A. 1985. A Prototype Implementation of a Generative hlecliatiism for Music Composition. Masters Thesis, Queen's University, Kingston, Ontario.

a Schloss, A. 1985. On the Automatic Transcription of Percussive Music - fmm Acoustic Signal to High-Level Analysis. Ph.D. Thesis, CCRMA, Department of Music, Stanford University.

Silverman, Harvey F. and David P. Morgan. 1990. The Application of Dynamic Programming to Connected Speech Recognition. IEEEASSP Magazine. 7.3:6-25.

Smoliar, Stephen W. 1992. Representing Listening Behavior: Problems and Prospects. In Mira Balaban, et al. (Ed.): Understanding Music ivitlt AI. Cambridge, Massachusetts: AAAl Press. 53-63.

Smoliar, Stephan W. 1976. Music Programs: An Approach to Music Theory through Computational Linguistics. Journal of Music Tlieory. 20: 105-31.

Stammen, Dale and Bruce Pennycook. 1994. Real-the segmentation of music using an adaptation of Lerdahl and Jackendoffs Groupine. Principles. Proceedings of the 3rd ~nt&ational Conference on Mirsic ~ e r c e ~ t i h Ünd ~06;nition. 269-2701

.

Stammen, D., B. Pennycook and R. Parncutt. 1994. A Revision of Pamcutt's Psychoacoustical Model of the Root of a Musical Chord. Proceedings of the 3rd International Conference on Music Perception and Cognition. 357-358.

Stammen, Dale and Bruce Pennycook. 1993. Real-time Recognition of Melodic Fragments Using the Dynamic Timewarp Algorithm. Proceedings of the 1993 International Computer Mitsic Conference. 232-235.

Terhardt, E., G. Stoll and M. Seewann. 1982. Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of tlieAmericanAcouslica1 Society. 71: 679- 688.

Vercoe, Barry. 1992. A Realtime Auditory Mode1 of Rhythm Perception and Cognition. Paper presented at the Second International Conference on Music Perception and Cognition. Los Angeles, Califomia.

Vercoe, Barry. 1985. nie Synthetic Petformer in the Context of Live Music.Proceedingsof the 1984 International Computer Music Conference. 199-200.

Vercoe, Barry. 1971. Harry B. Lincoln, The Cornputer and Music, and Barry S. Brook, Musicology and the Cornputer. Perspectives of New Music. 9.1: 323-330.