Abs tnc t
This thesis describes a mmputer program called TUnavarp, which implements a model of
the processes used to segment and recognize melodic fragments while listening to a real-
time musical performance. The model, called the Listener, accepts as input a Stream of MIDI
data. The output of the model is a representation of the performance data that includes the
segmentation of the music into melodic fragments and a listing of the recognized melodic
fragments.
The Listener model is divided into two discrete processes: the Segmenter and the
Recognizer. All of the processes operate in real-time and analyze the musical data as it is
performed. The Segmenter uses a preprocessor to provide the model with an interna1
representation of the performance and serves as a short-term perceptual memory that is used
by the other processes. The Segmenter parses individual voices in the performance into
musiwlly relevant fragments. The Recognizer uses the dynamic timewarp algorithm (DTW)
(Itakura, 1975), which was originally developed for time alignment and comparison of
speech and image patterns. The DTW enables the Recognizer to compare and wtegorize the
contour of a new melodic fragment with a collection of previously recognized melodies.
Good results were obtained in the segmentation and recognition of melodic fragments in
music ranging from Bach fugues to bebop jazz.
At the present time, computers can not be expected to listen to music the way humans do.
But it is possible and instructive to build systems that attempt to deal with specific aspects
of musical intelligence. By becoming more aware of both the possibilities and the
limitations of such systems, one may learn how people listen to music while gradually
raising the musical quality and usefulness of computer music systems.
Cette thèse présente le logiciel "Timewarp" qui modélise les processus employés pour
segmenter et reconnaître des fragments mélodiques lors de l'écoute d'une pièce musicale. Le
modèle reçoit un flux de données MIDI et renvoie une représentation de l'interprétation
musicale qui inclut une liste des fragments mélodiques reconnus.
Le modèle se divise en deux processus distincts de segmentation ("Segmenter") et de
reconnaissance ("Recognizer"). Ces processus fonctionnent en temps réel en analysant les
données musicales au fur et à mesure qu'elles sont reçues. Le processus de segmentation
utilise un module de pré-traitement qui fournit au modèle une représentation de
l'interprétation et fait aussi office de mémoire perceptive à court :erme utilisée par les autres
processus. II extrait ainsi, à partir des voix individuelles, les fragments musicalement
pertinents. Le processus de reconnaissance emploie 1' algorithme "dynamic timewarp" qui
fut développé à 1' origine pour 1' alignement temporel et la comparaison des formes de la
voix parlée et des images. L' algorithme permet au module de reconnaissance de catégoriser
le contour 6' un fragment mélodique nouveau en le comparant à un catalogue de mélodies
déjà reconnues. De bons résultats ont été obtenus lors de la segmentation et la
reconnaissance de fragments mélodiques allant de la musique de Bach su jazz bebop.
Les ordinateurs ne peuvent à ce moment-ci "écouter" de la musique comme le font les
humains. II est toutefois possible et instructif de bâtir des systèmes s' attaquant à des
aspects spécifiques de 1' intelligence musicale. En prenant conscience des possibilités et des
limites de tels systèmes, on peut apprendre comment se fait l'tcoute musicale tout en
améliorant graduellement les qualités musicales et 1' utilité des systèmes d' informatique
musicale.
Acknowledgements
1 would like first to thank my advisor, Dr. Bruce Pennycook for his support and supervision
during this project. 1 would also like to thank the following people who have helped me over
the yeas: Dr. Robert Rowe, Dr. Gilben Soulodre, Dr. W. Andrew Schloss, Dr. Ichiro
Fujinaga, Sean Terriah, Sean Ferguson, Anne Holloway, and Jason Vantomme. A special
thanks goes to René Quesnel for tnnslating my abstract into French, and Geoff Mitchell for
performing the musical exanples.
1 would like to thank my employer, RealNetworks, Inc., for granting me the time off so that
1 could complete this dissertation.
Finülly, 1 would like to thank my wife, Kimm Brockett Stammen, for her love and support.
She provided me with the inspiration and confidence to wrnplete this dissertation.
This research was, in part, supported by a research grant from the Social Sciences and
Humanities Research Council of Canada.
Table of Contents
Abstract ............................................................................................................................. 2
Résumé ............................................................................................................................ 3
Acknowledgements ........................................................................................................... 4
Table of Contents ............................................................................................................. 5
1 . Introduction .................................................................................................................. 8
1.1 Introduction .................................................................................................... 8
1.2 Prognm Ovewiew .......................................................................................... 9
1 3 Problem Specification ..................................................................................... 10
1.4 Implementations of the Listener Mode1 ........................................................... 15
1.5 Dissertation Roadmap ..................................................................................... 18
2 . The Computer as Listener ............................................................................................. 19
2.1 The Birth of the Computer Musician .............................................................. 19
2.2 Lejaren Hiller
The Computer as Composer ..................................................................... 20
2.3 Stanley Gill
The Self-Correcting Computer Musician .................................................. 23
2.4 R . S . Ledley
..................................................... Criticism and the Need for a Grammar 24
2.5 G . M . Koenig
The First Artificial Intelligence Computer Musician ................................. 25
2.6 Noam Chomsky
n i e Structured Computer Musician .......................................................... 25
2.7 Otto Laske
The Listening Cornputer Musician .......................................................... 2 6
2.8 Msrvin Minsky
"Why do we like Music?" ........................................................................ 3 0
2.9 Current Implementations of the Computer Musician ...................................... 35
2.9.1 Roger Dannenberg
The Computer Musician as Performer .......................................... 35
2.92 David Rosenthal
The Computer Musician as a Listener ........................................... 36
2.93 Robert Rowe
The Computer Musician as Listener, Composer, Performer
...................................................................................... and Critic 37
2.10 How do \Ve Recognize and Remember Melodies? ....................................... 38
2.11 Conclusion .................................................................................................... 41
3 . The Segmenter .............................................................................................................. 43
.................................................................................................... 3.1 Introduction 43
3.2 Grouping Preference Rules ............................................................................. 46
3.2.1 GPR 1 Avoid Small Groups ............................................................ 47
3.2.2 GPR 2 Proximity Rules ................................................................... 47
3.2.3 GPR 3 Change Rules ....................................................................... 48
3.3 Application of GPR to a Live Performance ..................................................... 50
................................................................................................ 3.4 Event Memory 58
3.5 Determining Rhythmic Types ......................................................................... 59
......................................................................... 3.6 Segmentation Rule Evaluation 63
3.7 RT Segmenter Rules ....................................................................................... 63
3.8 N4 Rules ......................................................................................................... 65
......................................................................................................... 3.9 N3 Rules 67
................................................... 3.10 Handling Incorrect Segmentation Choices 67
............................................................................................ 3.1 1 Segment Length 70
....................................... 3.12 Display of Event Data and Segmenter Information 70
3.13 Tenor Madness
A Real-time Example ............................................................................... 72 3.14 Summary ...................................................................................................... 85
3.14.1 Limitations and Future Work ......................................................... 85 4 .The Recognizer ............................................................................................................. 87
4.1 Introduction .................................................................................................... 87
4.2 Use of Melodic Contour ................................................................................. 88
4.3 Applications of Dynamic Programming to Music ........................................... 89 43.1 Other Melodic Fragment Recognition Systems ............................... 90
4.4 Implementation of the Dynamic Timewarp Algorithm .................................... 91 4.5 Real-time Listening Example .......................................................................... 97 4.6 Summary .................................................................................................... 112
5 . Real-time Segmentation and Recognition of Vertical Sonorities ................................... 113 5.1 Introduction ................................................................................................. 113 5.2 Real-time Chord Recognition .......................................................................... 113 5.3 Revision of Parncutt's Mode1 ...................................................................... 117 5.4 Real-time Example .......................................................................................... 121
........................................................................................................ 5.5 Summary 125
6 . Conclusions .................................................................................................................. 126
Appendix A . The Event Record ......................................................................................... 128
References ........................................................................................................................ 133
1. Introduction
1.1 Introduction
In 1981, M a ~ i n Minsky asked the simple question, "Why do we like certain tunes?"
(Minsky, 1981). In response to his own question, Minsky offered two possible answers: Xe
like melodies because they have certain structural features or we like them because they
resemble other tunes we like. n i e f i a i answer has to do with the laws and rules that make
tunes pleasant. The second answer forces us to look not at the tune itself, but at ourselves
and how we perceive music. It causes us to look at how we actually listen to music.
However, Minsky's use of the word resemble forces us to ask another seemingly simple
question: How do we know if two tunes resemble each other? As we shall see, this simple
question wiil prove to be difficult to answer.
During a live musical performance, the sound that is generated by an ensemble travels
through the air to a listener's ears and stimulates various part of her brain. She somehow
manages to organize this stream of sound into the notes played by each individual
instrument. She may then group the notes into motives and the motives into phrases. She
may then compare these melodic fragments with other fragments that she has previously
heard. As she listens, she becomes familiar with the music and may even think that she has
heard the piece before. After the performance, the listener may even be able to hum a few of
the melodies from the performance. If she were to hear the piece in a future performance,
chances are she would quickly recognize ihat she had heard it before. But what are the
perceptual and cognitive processes that enable her to do al1 of this with little or no effort on
her part?
How does the listener segment the music into melodic fragments and then recognize iishe
has previously heard any of these fragments? Traditionül musicology, through the analyzing
8
of musical scores and historia1 documents, cemot hclp us find the answer to this question
(Laske, 1989). During the past few decades, res~archers of musical perception have studied
how listeners recogiiize and remember melodies. The fields of cognitive musicology and
music perception now offer theories on the recognition process. But how can we test if
these theories are valid?
1.2 Prognm Ovewiew
This thesis describes a computer program called Timewarp, which implements a model of
the real-time processes used to identify and recognize melodic fragments while listening to a
musical performance. The model, called the Listener, accepts as input a stream of MlDl data
that represents the sound produced by the musical performance. MIDI, or Musical
Instmment-Digital Interface, is a standard way of recording performance information from
electronic instruments. A MlDI representation of a performance is called a sequence and
can be stored in a standard MlDI file (SMF). The MlDl recording contains basic
performance information including the pitch of the note (MIDI note number), the onset and
release times for each note and how loudly the note was played (MIDI velocity). This
information contains enough detail to accurately reproduce a musical performance.
The output of the model is a graphical presentation of the performance that includes the
segmentation of the music into melodic fragments and the recognized melodic fragments.
The Timewarp application is not a complete melodic recognition system as it does not
attempt to recognize entire melodies or works of music. Instead the system looks at the
problem of segmenting a stream of music data into musically significant melodic fragments
of 4 to 16 notes in length.
Recognizer 9 Figure 1.1: The Listener hlodel
The Listener model is divided into two discrete processes, the Segmenter and the
Recognizer. These processes are shown in Figure 1.1. All of the processes operate in real-
lime and analyze the musical data as it is performed. The Segmenter contains a preprocessor
that provides the model with an interna1 representation of the performance and serves as a
short-term perceptual memory that is used by the other processes. The Segmenter panes
individual voices in the performance into musically relevant fragments. The Recognizer uses
the dynamic timewarp algorithm (DTW) (Itakura, 1975), which was originally developed for
time alignment and comparison of speech and image patterns. The DDV enables the
Recognizer to compare and categorize the contour of a new melodic fragment with a
collection of previously recognized fragments.
1.3 Problem Specification
The process of listening to a Stream of music may be broken down into a set of simpler
tasks. Each of these tasks proves to be a complex problem for a computer. As we shall see,
programming a computer to perform the simple task of listening to a monophonic melody is
anything but simple.
13.1 Source Sepant ion
When listening to the performance of a musical ensemble, a human listener is able to
identify the various instmments in the ensemble. Or, for example, in the case of a Bach
fugue, the individual voices of the fugue. This task is called source sepantion and it is a
crucial step in listening to music. Source sepantion allows the user to connect the individual
notes in the music to each other, thereby enabling the user to hear the individual parts of the
musical score.
Source sepantion, especially fmm acoustic or digital sources, has proven to be an extremely
difficult task for computers. Several researchers have devoted entire dissertations and books
to the subject. Good examples of this work are found in Schloss (1985), Bregmm (1990)
and Ellis (1992). For the purposes of this research, the Listener rnodel sidesteps the
complexities of source separation by accepting MIDI as its input. MIDI allows each
instrument or voice in the ensemble to be assigned a specific MIDI channel. A Listener is
then allocated to listen to each channel, thereby ensuring that each Listener receives a
monophonic input representing a single voice in the ensemble.
1.3.2 Segmentation
One of the most important tasks in the Listener model is the segmentation of the musical
data into melodic fragments. These fragments must make sense musically in that they must
be aligned with motivic or phrase boundaries. The location of the end of a melodic fragment
is called a group boundary for it denotes the end of one fragment and the start of the next
fragment. An exarnple of a group boundary is shown in Figure 1.2. This exarnple uses four
notes, N1 N2 N3 N4 to locate the group boundary at N2. N2 is perceived as a group
boundary due to the length of the half note in relation to the surroundag quarter notes.
Grmp Boundary I
Figure 1.2: Detecting a gmup boundary
Figure 1.3 shows the segmentation of a short musical example. In this fugue subject there
are four melodic fragments. The Segmenter must locate these group boundaries in real-the
and pass the fragments on to the Recognizer. The Recognizer is then responsible for
recognizing that fragments 1.2 and 3 are sirnilar. Chapter 3 will discuss the operation of the
Segmenter in greater detail.
Figure 1.3: Segmentation into melodic fragments
(Input: Alto voice, Fugue No. 2 in C Minor, B \ W 847, J. S. Bach)
1.3.3 Quantization
Before the Segmenter can determine the location of group boundaries, it must examine
certain characteristics of every note. The duration of each note is an especially important
feature. The Segmenter examines individual note durations to determine the location of a
group boundary. The mode1 would therefore prefer that al1 quarter notes were periormed
with the exact same duration. For example, Figure 1.4 illustrates a precise performance of 5
quarter notes where each quarter note is 500 milliseconds (ms) in length. The Segmenter
would easily detect that al1 of these notes are quarter notes.
U 1 I 1 I I
Milliseconds: O 500 1 O00 1500 2000
Figure 1.4 Exact performance
However, performance of music, even by highly tmined musicians, rarely provides for the
exact execution of durations. Such a performance would be perceived as robotic and
expressionless. Musicians Vary ihe duration of each note in the performance for expressive
purposes and also due to the fact ihat there is an innate variability to our motor functions. A
"typical" performance of the quarter notes may be similar to the one shown in Figure 1.5. In
this example only the fourth quarter note is 500 ms in duration. The others Vary from 450
ms to 570 ms in duration.
Milliseconds: O 480 1050 1600 2100
Figure 1.5: Typical performance
This variability problem is exacerbated when accelerandos or ritardandos are performed. in
these situations, each individual quarter is longer or shorter than the previous note. In the
example shown in Figure 1.6, the accelerando creates successive quarter notes that are
approximately 10% shorter in duration than the previous quarter note. Situations such as
these are difficult for computers to handle. The quarter notes mus! quantized or made to be
equivalent before the Segmenter can properly examine them. Chapter 3 describes how the
use of a short-term memory in the Segmenter handles the variability of note duntion as well
as other musical features.
h t l 1 1 I I l
1. I , I I l 1, e - I - I -
I
el 1 I I 1 1 Milliseconds: O 500 950 1350 1700
Figure 1.6: Accelerando performance
1.3.4 Recognition
Once the Segmenter has detected the location of a group boundary, the melodic fragment is
sent to the Recognizer. The Recognizer uses the dynamic timewarp algorithm (DTW),
which was originally developed for time alignment and comparison of speech and image
patterns. For example, if you were to hear the word ball spoken quickly, "ball" or slowly,
"baaaall", you would have no problem understanding the word even though the sonic
characteristics of the word have been temporally altered between enunciations. This same
situation exists in music performance. Even though the notes on the musical score are
precisely notated, their realization by a human performer will contain many variations from
the score due to musical interpretation and technical accuracy. Composes will also modify
and develop motives and phrases by varying their note structures. For example, the Bach
fugue subject presented in Figure 1.3 contains three iterations of the same motive where
each iteration has been slightly modified. The task for the Recognizer is Io detect that these
three fragments are in fact musically similar.
As will be presented in Chapters 2 and 4, resesrch into melodic perception suggests that
melodic contour is one of the most salient features used to remember and recognize melodic
sequences (Bartlett and Dowling, 1980). In order Io mode1 this behavior, the Recognizer
uses the DTW to match the melodic contour of a new melodic fragment with a set of
previously recognized contours.
1.4 Irnplernentations of the Listener Model
The Listener model has been implemented as a computer program called Timewarp. The
progrdm runs both as a stand-alone Macintosh computer application called Timewarp and
as a Max (Puckette and Zicarelli, 1990) extemal object called rnelsl~ape. The Max version of
Timewarp is shown in Figure 1.7. In this Max patch, there are three melshape objects, each
listening to a sepante voice of the Bach fugue. The sepante voices of the fugue have been
assigned to separate MIDI channels. The fugue is realized in real-time by the playSMF
object which plays Standard MIDI files.
The results of the listening session are shown in a display window (Figure 1.8). In this
window, the location of the group boundaries are labeled by the Segmenter. The recognized
melodic contours are drawn above each fragment. Using this display, it is possible to verify
the performance of the Listener model.
Figure 1.7: The Listener Model in the Max Environment
Figure 1.8: Timewarp application displny vvindow
1.5 Dissertation Roadmap
Chapter 2 presents previous work by various researchen that is relevant to this dissertation.
Chapter 3 discusses in detail the operation of the Segmenter component. Chapter 4 presents
the Recognizer component and describes its use of the DTW. Chapter 5 presents a chord
recognition component. While this component is not directly part of this thesis, it represents
a real-time process for the real-time segmentation and recognition of vertical sonorities.
Concluding remarks are presented in Chapter 6. Appendix A describes the structure of the
Event record used by the Segmenter.
2. The Computer a s Listener
"How do we know if two tunes resemble each other?" The concept of the Listener model
developed from the effort to answer this question. The Listener rnodel emulates the
processes used to segment and recognize melodic fragments while listening to a real-time
musical performance. The idea of a computer as a musical listener has developed over the
past sevenl decades as researchers in many diverse fields have tried to answer similar
musical questions. Can a computer be programmed to compose music? Or analyze a piece
of music? From our vantage point ai the end of the 2dh century, it is possible to look back
over the previous four decades and trace the growth of what 1 will cal1 a "computer
musician" (Le. a computer system that is able to compose, analyze or listen to music). In
this chapter, 1 will follow the development of the computer musician betweeti 1956 and the
present. More importantly, 1 will examine the computer's ability to model human perception
Q of music. The research presented has proven to be most intluential in the development of the
Listener model.
2.1 The Birth of the Computer Musician
The idea of the computer musician began with a very simple question. "What makes the
melodies of simple nursery tunes so appealing?" As we s h ~ l l see, this seemingly simple
question is not easily answered. Richard Pinkerton asked this question in a 1956 Scient@
American article entitled "Information Theory and Melody" (Pinkerton, 1956). In asking
this question, Pinkerton was searching for a universal set of rules or features that could
define "appealing" melodies. But do these universal rules even exist?
Pinkerton chose to define music as a form of communication. By using this definition, it
was then possible to apply communication theory to music. Communication theory is based
on the concept of entropy, a numerical index of disorder. When there is a lot of uncertainty
and disorder, entropy is high. Likewise, when there is much symmetry or pa'rerned
arrangement, the entropy is low. According to Pinkerton, a composer must make the entropy
of a melody low enough to give it some sort of pattern and at the same time high enough so
that it has sufficient complexity to be interesting. Maximum entropy would mean that al1 the
notes would have equal probability of being chosen. By applying information theory to
music, it would be possible to calculate the entropy or average information per note for
certain kinds of elementary melodies. This in turn would give an indication of meaning or
information that could be expressed by such melodies. Pinkerton discoveied that a certain
amount of redundancy or repetition is necessary in order to have tuneful melodies.
Pinkerton's article concluded that melody, rhythm and harmony could al1 be fitted into a
statistical theme, and that it was therefore possible to build machines that could compose
music. A set of tables could be constructed thrit would "compose Mozartian melodies or
themes that would out-Shostakovich Shostakovich" (Pinkerton, 1956, p. 86).
Pinkerton's ideas, while being overly optimistic, did in fact outline some of the early hopes
for a computer musician. From the very beginning, one of the first musical applications of
the computer was composition. But contrary to Pinkerton's claims, we still do not have
computers that can "out-Shostakovich Shostakovich". Moreover, Pinkerton's definition of
"meaning" as the degree of entropy of a melody really does not bring us any closer to
understanding why and how Our minds like certain lunes.
2.2 Lejaren Hiller: The Computer as Composer
In 1959 another researcher asked the more ambitious question, "Cm a computer compose a
symphony?" (Hiller, 1959). Lejaren Hiller believed it was possible based on the following
reasons:
1) Music is a sensible form govemed by the laws of organization which permit fairly exact
codification. Therefore computer-produced music which is "meaningful" is possible as
far as the laws of music organization are codifiable.
2) Computers a n be used to create a nndom universe in accordance with imposed rules,
musical or othenvise.
3) Since the process of creative composition may be thought of as an imposition of order
upon an infinite variety of possibilities, a fairly close approximation of the composing
process may be done with a computer.
Hiller, like Pinkerton, viewed music as "a compromise between chaos and monotony" and
as an "ordered disorder lying somewhere between complete randomness and complete
redundancy" (Hiller, 1959, p. 110). Hiller recognized that the appreciation of music involved
not only psychological needs and responses, but meanings imported into the musical
experience by reference to its cultunl context. However, Hiller chose not to include this side
of music in his version of the computer musician. Instead, he looked at what he called the
objective side of music which he defined as existing in the score apart from the composer
and listener. The information encoded there relates to such quantitative entities as pitch and
t h e and "is therefore accessible to rational and ultimately mathematical analysis" (Hiller,
1959, p. 110). From this reasoning it is apparent that Hiller believed universal rules of
music composition existed, and that these mles could be separated from the "human" aspect
of music composition.
Hillrr was ahead of his time in his considention of the aesthetic aspects of music composed
by a computer. Hiller believed the aesthetic significance or value of a music composition
depended considerably upon its relationship to our inner mental and emotional transitions.
This relationship is largely perceived in music through the articulation of musicai f o m or
the semantic content of music and this in tum could best be understood in terms of the
technical problems of musical composition. Since the articulation of musical forms is the
primary problem faced by the composer, it seemed most logical to Hiller to star! his
investigation of computer composition by attempting to restate the techniques used by
composers in terms both compatible with information theory and tnnslatable into computer
prognms uti!izing sequential-choice opentions as a basic for music generation. Using this
objective viewpoint, Hiller derived five basic principles involved in music composition
(Hiller and Issason, 1959):
1) The formation of a piece is an ordering process in which specified musical elements are
selected and arranged from an infinite variety of possibilities, i.e. from chaos.
2) Both order and chaos contribute to the musical structure.
3) The two most important dimensions of music on which a greater or lesser order can be
imposed are pitch and time.
4) Because music exists in time, memory and instantaneous perception are required in the
understanding of musical structures.
5) Tonality, a significant ordering concept is considered the result of establishing pitch
order in terms of memory recall.
Hiller's process of generating computer music was divided into two basic operations. In the
first operation, the computer generated random sequences of integers which were equated to
the notes of the scale, rhythmic patterns, dynamics, etc. In the second operation, each
random number was screened through a series of arithmetic tests expressing various rules
of composition and was either used or rejected depending on which rules were in effect. If
accepted, the nndom integer was used to build up a "composition". If it was rejected, a new
integer was genented and examined. The process was repeated until a satisfactory note was
found or until it became evident that no such note existed, in which case the composition
thus far \vas erased to allow a fresh start. Using this method, the Illiac computer at the
Univenity of Illinois composed the IIIiac Suite for string quartet (Hiller and Issacson,
1959).
2 3 Stanley Gill: The Self-Correcting Computer hlusician
Stanley Gill (1964) pointed out that the main difficulty with the process of altemate nndom
generation and selection in computer composition was that the computer may lead itself into
a dead end. In other words, the computer musician would continue in one direction until the
rules that govemed its selection process made it impossible for the composition process to
continue. This was a problem also acknowledged by Hiller. It was therefore desirable to
allow the computer to backtnck so it could te-examine alternative choices at an earlier point
in the composition. Gill used a technique thüt retained at üny moment not one, but eight
competitive versions of the partial composition, each completely specified up to a certain
point, but not necessarily the same length. The genemtion process took one of these partial
compositions, or sequences, at random and extended it according to the compositional rules
and criteria. Its value wüs then compared with the other existing sequences. The weakest
sequence was then rejected and the whole process repeated. Each sequence was linked
backwards in t h e from the end to the beginning. At the end of the composition, the
sequence with the highest value was chosen. As we shall see, this concept of competing
versions is a central idea in Minsky's theories. Gill's ideas gave his computer musician a
limited ability to correct itself.
2.4 R S. Ledley: Cnticism and the Need for a G n m m n r
The question now arises whether composition of music by amputer is really creativity. R.
S. Ledley briefly described the use of a computer in musical composition as part of his
discussion on prognmming a computer IO achieve intelligence. In particular, he referred to
the use of computers for creative purposes. Creativity "produces structure out of disorder,
form out of chaos, but structure and form must meet aesthetic requirements as well"
(Ledley, 1962, p.371). In Ledley's opinion there was a lack, in amputer-genemted music, of
some of the necessary ingredients of creativity including:
1) Over-al1 planning or direction was missing, leading to a sense of incompleteness.
2) The resulting music was digressive, lacking symmetry and the recursive building of
ideas.
Ledley identified two problems cornputers have in the composition process:
1) The problem of irnparting fom, direction and unity to a particular musical composition.
2) The problem of comprehending an over-all, across-the board characterization of a style,
i.e. an abstraction that is recognizable in collections of compositions by a single
composer.
Ledley believed that the solutions to these problems would probably involve the notion of
syntactical concept formation, which would give the computer the ability to comprehend
musical abstractions and to use such abstractions as a guide to creativity (Ledley, 1962, p.
375). In essence, Ledley's criticism indicated that the current computer musician was unable
to generate satisfactory structures because it did not have full use of a grammar.
2.5 G. M. Koenig: The F i n t Artificial Intelligence Computer Musician
Between 1965 and 1970, two prognms written by G. M. Koenig, Project One and Project
2, (Koenig 19703; 1970b) represented a f i s t step toward an artificial intelligence (AI) view
of a computer musician (Laske, 1981). These prognms embodied a composition theory
about the processes used by composes to compose a musical work. These prognms were
the f i s t knowledge-based systems for composition in which a composer could define a
series of steps or rules that led from the ovenll design of a composition to the more detailed
specification of the musical surface. Koenig's programs offered the first computer-assisted
composition system for composes where a composer guided the programs to the final
musical output by specifying a wide variety of parameters to the system. The human
composer guided the computer musician to generate music that had the potential to
overcome some of Ledley's criticisms. The need for a Listener, a process to listen and
evaluate the output of the computer musician is evident from Koenig's work.
2.6 Noam Chomsky: The Structured Computer Musician
In 1965, Noam Chomsky presented his concept of a genentive grammar. Chomsky defines
a generative language as a "system of rules that can iterate to generate an indefinitely large
number of structures" (Chomsky, 1965, pp. 15-16). Chomsky's system of rules for a
generative grammar was divided into three major components. These were called the
syntactical, phonological, and semantic components. According to Chomsky, the syntactical
component specified an infinite set of abstract formal objects, each of which incorponted al1
information relevant to a single interpretation of a particular sentence. The phonological
component of a grammar determined the phonetic form of a sentence generated by the
syntactic rules. Finally, the semantic component determined the semantic meaning of a
sentence. This semantic component related a structure generated by the syntactic component
to a certain semantic representdtion (Chomsky, 1965). The syntactic component of a
grammar must specify, for each sentence, a "deep structure that determines the syntactic
interpretation and a surface structure that determines it phonetic interpretation" (Chomsky,
1965, p.15-16). The deep structure would be interpreted by the semantic component and the
surface structure by the phonetic component.
Looking back at Ledley's criticisms in view of Chomsky's theory, it was a lack of deep
structure that was missing from the compositions generated by an information theory-based
system. The computer musician did have a syntactical component for generating sequences.
This syntax was the rules used in the selection of note values from the randomly generated
numben. However, the computer was unable to interpret the deep structure of the sequences
it generated, and therefore had no way in which to evaluate the structures it composed. It
was totally incapable of any semantic interpretation. Chomsky's theories of generative
grammars proved to be quite innuential on the later development of a computer musician.
2.7 Otto L ~ s k e : The Listening Computer Musician
Otto Laske regarded semantic processing as a matter of reconstruction. A "listener who
perceives a musical evect may be said to understand that event if he is capable of specifying
how it may be reproduced" (Smoliar, 1976, p.112). The problem of constructing semantic
structures may be regarded as decompilation. The acoustic signal, as perceived by the
listener, is the "lowest-level language" representation of musical information. The computer
musician would need to decompose the acoustic signal into a "machine-level" representation
of the information. A representation in terms of notes may be said to be a high-level
reconstruction of the machine-level information. This "higher level language must
incorporate some sort of mode1 of musical perception, since reconstruction can only arise
from the listener's perceptual activities" (Smoliar, 1976, p. 113). The computer musician of
the 1970's could not decompose and reconstruct semantic meanings from its own
compositions.
Notice ths: in this discussion of a computer musician's ability to compose music, we are
suddenly talking about listening! In order for a computer musician to be able to compose or
perform likc a human, it must be aware of perceptual processes. ï h i s need for a perceptual
model was expressed by several researchers in the early 1970's. For example, in 1971,
Barry Verme stated:
"We seem to be without a sufficiently well-defined 'theory' of
music that could provide that logically consistent set of
relationships between the elements which is necessary in
order to program, and thus specify, a meaningful substitute
for our own cognitive processes!' (Vercoe, 1971, p. 324)
During the early years of the 1970's. Otto Laske formed many of his theories of human
cognition of music. Since 1970, Laske has adopted a procedunl view of music, regarding it
as a set of cognitive tasks people are able to perform. According to Laske, theories of music
needed to acknowledge !his task-dependency. Studies in computer-aided composition of
music would be relevant both for providing tools for creative activity and for understanding
such an activity. Laske (1978) discussed the panllels between computer music systems and
a model of human memory. From a cognitive point of view, one similarity is provided by the
concept of an information-processing system as a distributed memory system. Cognitive
psychology views the humün mind as a set of distinct but procedurally-interrelated buffers.
The defining elements of these buffers are parameters like time-constant of storage, transfer
rate, access-time of a buffer, capücity in number of symbols, and others. These are the same
attributes that define computer memories. According to Laske, intelligent computer music
systems needed to be based on an understanding of this parallelism.
Laske described a mode1 for human memory as a chain of submemories consisting of
echoic, perceptual, short-term, working, contextual and long-term memories. Human
memory primarily functions by storing the temporal structure of occurring events and
constructing successive intemal representations of these events. Sonic event data enters our
minds and is first stored in our echoic memory, a temporary memory buffer that contains
the most recent sounds that we have heard. The contents of the echoic memory are not
perceived by the listener until they have been stored in the perceptual memory. This
perceptual memory is capable of storing up IO sevesal seconds of event data. n i e transfer of
event data from the perceptual memory to the short term memory is called the conscious-
time. During this time the listener may become aware of certain perceptual configurations
such as pitch, timbre, duration, meter and loudness. It is at this level of perception where we
perceive the musical present. Any information pushed out of this memory will be lost if it
has not been tnnsferred to long-term memory.
Laske theorizes that there is also a higher cognitive level which he labels the "in!erpretive-
time". It is at this level "where the illusion of lasting time is created by memory through
interpretations of events on a high level of abstraction" (Laske, 1978, p. 40). These high
level events will be stored in long-term memory. It is our long-term memory that allows US
to remember the musical past.
According to Laske, Our working memory handles al1 of the interactions between our
perceptual, short-term and contextual memories. The contents of the working memory are
the musical present of which we are conscious. Our musical past may be divided into Iwo
types, the cultural past and the immediate past. The immediate past is often referred to as
our "musical context" which may be considered to be a semantic mode1 of the current
auditory world of a listener. This musical context is thought to be stored in a portion of our
long-term memory called the contextual memory. This contextual memory is the currently
active portion of our long-tem memory. h k e believes that this conceptual memory a n
function in either a syntactical or semantic mode.
The syntactic mode is able to define structural representations of music, thus making it
possible to distinguish different levels of musical structure. Music syntactic networks are
comparable to tree representations of the hierarchy of the structural levels of music.
Semantic concepts are the listener's interpretation of the musical structure and structural
levels. They are bound to a music's past, both in its music tradition and the immediate past.
A semantic network may be thought of as a linked-list of musical interpretations. A listener
can switch behveen these hvo modes while 1ister:ing to a musical performance.
In a later article Laske (1980) developed a cognitite theory of the music listening process.
According to Laske, music understanding occurs through the mapping of musical structures
into memory, where musical pasts are stored. The "perception of music is made possible by
concepts in memory that represent musical pasts that act as precedents to which new sonic
events may be matched" (Laske, 1980, p. 75). The listener therefore generates a sequence of
current pasts of the music. A musical experience is "the total of al1 current pasts a listener
has construed during a listening session" (Laske, 1980, p. 77).
In Laske's view of the listening process, the listener makes an initial musical interpretation.
As the music progresses, the listener maintains this interpretation as long as it continues to
hold true. However, there will gradually emerge another possible interpretation as a result of
a newly arrived set of perceptual features. The listener must gradually unlearn the old
interpretation while acquiring the new one. Listening is therefore a continuous process of
learning and unlearning in which previous interpretations are replaced by succeeding ones
as demanded by the new perceptual findings.
f t is important to note that by the mid 1970's Laske had formulated a cohesive theory of
music cognition. Laske's theories pointed out what was wrong with the information theory
model of human perception and offered many potential solutions. Laske realized that a
computer musician needed to incorponte a working model of perceptual processes. A
computer musician must be able to process information in a manner similar to the processes
of the human mind. iaske's ideas offered computer music researchers a theoretical model
that could have helped them create a computer musician able to more closely model human
musical perception and understanding. But it appears that iaske's theory had little influence
on a computer musician outside the realm of research into linguistic models of music
perception (Roads, 1979). To a large extent, Laske's theory remained a theory without an
actual working model. Lake's theories did have a large influence on the development of the
Listener model presented in this thesis. However, the theories of Marvin Minsky appeaï to
have been a stronger influence on the current versions of the computer musicim.
2.8 Mawin Minsky: "Why do we like Music?"
As mentioned in Chapter 1, Mawin Minsky asked the question "Why do we like Music?"
Minsky believed that one of the problems with music theory is that it \vas afraid to ask
questions such as these. Music theory was not just about music, but how people processed
it, "To understand any art, we must look below its surface into the psychological detail of its
creation and absorption" (Minsky, 1981, p.29). Music theory was unable to help a computer
musician understand music, for it had become stuck trying to find universal truths. ï h i s was
the saine problem information theorists such as Pinkerton and Hiller had, for they had
belicved it was possible to reduce music to a collection of rules that would enable a
computer musician to compose as well as humans. Minsky felt that we cannot find any such
univenal laws of thought.
"Both memory and thinking interact and grow together. We
do not just leam things, we leam ways to think about things;
then we leam to think about thinking itself. Before long, our
ways of thinking become so complicated that we cannot
expect to understand their details in terms of their surface
operation, but we might undentand the principles that guide
their growth." (Minsky, 1981, p.28)
Music is recognized and undestood by a listener because the music engages the previously
acquired knowledge of the listener. But after a listening session, most of the listener's
memories of the music fade away. However, if the listener were to hear the same music
again, he or she would recognize it almost immediately. Minsky believes that something
must remain in the mind to cause this and suggests that perhaps what we leam is not the
music itself, but a way of listening to it.
Before attempting to answer the question "Why do we like music?" Minsky asked several
other questions. One such question is "What is the diffcrence between merely knowing (or
remembering, or memorizing) and understanding?" To understand something we must
know what it menns. However, an idea seems meaningful only when we have several
different ways to represent it. We must have several different perspectives and associations
for an idea. Understanding is therefore the process of looking at an idea from many
different perspectives. Something has "meaning" only if we are able to examine it in several
different ways. Minsky theorized that this is why those who seek "real meanings" never
find thern.
Minsky also asked the question, "Why do we like certain lunes?" This question is
essentially the same as the one Pinkerton asked in 1956. Minsky offered two possible
answes. \Ve like certain tunes because they have certain structural features or we like
certain tunes because they resemble other tunes we like. The first answer has to do with the
laws and rules that make tunes pleasant, which were what Pinkerton, Hiller, etc. were
searching for with information theory. However, this answer implies the existence of a
univenal set of essential features which Minsky feels are impossible to discover. The
second answer forces us to look not ai the tune itself, but at ourselves and how we perceive
tunes. However, the verb 'resemble' forces us to define the rules of musical resemblance.
How do we know if two tunes resemble each other? These rules are dependent upon how
melodies are represented in each individual's mind. In Laske's theories, representations of
these melodies were stored in a long-tenn memory. In Minsky's theory we store melodies in
a "society of agents".
According to Minsky, our minds consist of a network or "society of agents". Minsky
defines an agent as "any part or process of the mind that by itself is simple enough to
understand, even though the interactions among groups of such agents may produce
phenomena which are much harder to understand" (Minsky, 1986, p.326). Each agent
knows what happens to some of the others, but little of what happens to the rest. Thinking
consists of making these mind-agents work together. Productive thought is the process of
breaking problems into different parts and then assigning these parts to the agents that
handle them best. When one listens to music, various facts of what is heard activate various
agents. These agents are connected in various ways, which affects how the listener
processes the music.
In Minsky's view of memory, cognitive processes and memory structures are not separated
from each other. This differs from Laske's theory of many separate buffers and a working
memory that interprets the music to achieve understanding of the structure. According to
Minsky:
"We often speak of memory ris though the things we know
were stored away in boxes of the mind, like the objects we
keep in the closets of our homes. But this raises many
questions. How is knowledge represented? How is it stored?
How is it retrieved? How is it used? ...[ Our theory of
memory] tries to answer al1 these questions at once by
suggesting that we keep each tl~ing rtu Iearn close 10 the
agents tl~ar learn in tlrefirstplace." (Minsky, 1981, p.28)
Our knowledge, stored in this manner, becomes easy to reach and easy to use. This theory
is based on the idea of a type of agent called a "Knowledge-line" or "K-line!' Whenever
one gets a good idea, solves a problem, or wants to remember something, he or she activates
a K-line to represent it. .4 K-line is a "wirelike structure that attaches itself Io whatever
mental agents are active when you solve a problem or have a good idea" (Minsky, 1986,
p.82). Later, when one activates this K-line, the agents attached to it are also activated. This
puts the person into a "mental state" much like the state the mind wris in when it received the
original input. This makes it possible for us to solve new, but similar problems or recognize
similar situations. This is how we are able to remember music and know if one musical
fragment resembles another. Two similar fragments of music will have a similar effect on
the mind, because they will tend to activate the same agents. M e n one hears a familiar piece
of music, one's mental state (Le. set of currently active agents or K-line) will be similar to
the mental state induced by a previous hearing.
Each perceptual experience activates a structure called a frame. A frame is "a representation
based on a set of terminals to which other structures may be attached. Normally, each
terminal in connected to a default assumption which is easily displaced by more specific
information" (Minsky, 1986, p.328). Minsky described a frame as being like an application
f o m with many blanlis or slots to be filled. These blanks are the teminals referred to by the
above definition. Terminais are used as connection points Io which we can mach other
kinds of informs!ion. Any type of agent c m be attached to a frame-terminal, including K-
lines. The mind remembers millions of frames, each representing a stereotyped previous
experience.
Using these agents, K-lines and frames, Minsky described a theory of niusic listening.
When listening to music we only have access to the musical present, the notes currently
being played. We use the rhythm as "synchronizrition pulses" to match new phrases against
older ones. These phrases are examined for difference and change. As differences and
changes are sensed, the rhythmic frames fade from our awareness. This process of
matching allows our minds to "see" things, from different times, together (Minsky, 1981).
a The concept of K-lines and frames proved to be quite influential in the development of the
DTW based melodic fragment recognition system. As presented in Chapter 4, the
Recognizer matches an unknown melodic fragment with a collection of known melodic
fragment referred to as templates. The templates are a set of features used to represent a
fragment. The DTW algorithm used to compare fragments provides a distance measure that
indicates the degree of similarity between two fragments. This allows a fragment to be
considered similar to a variety of templates. As fragments are added to the system, they are
clustered into areas of similar fragments. This clustering allows new fragments to be
recognized as similar Io entire sets of recognized fragments. The musical present
represented by a new melodic fragment is quickly matched with a large collection of
fragments representing the Listener model's musical past.
2.9 Current Implementations of the Computer Musician
Minksy's ideas presented have also had an enormous influence on recent versions of the
computer musician. 1 will now discuss these influences on the various versions of the
computer musician as created by Dannenberg, Rosenthal, and Rowe.
2.9.1 Roger Dannenberg: The Computer Musician a s Performer
Roger Dannenberg's (1984) version of the computer musician created a real-time
accompanist for live performers. The computer was given a score containing parts for the
soloist and for the corresponding accompaniment and was assigned the task of performing
the accompaniment in synchronization with the live performer, jus! as a human accompanist
would do. This approach required the computer to have the ability to follow the soloist.
Dannenberg identified three problems the computer had in fulfilling its role of an
accompanist:
1) Detecting and processing input from the performer
2) Matching this input against a score of expected input
3) Generating the timing information necessary to control the generation of the
accompaniment.
Dannenberg expected that the live solo performance would contain mistakes or be
imperfecily detected by the computer, so il was necessary to allow for these performance
errors when matching the actual solo against the score. The normal variations in tempo
resulting from human musical expression could also easily confuse the computer
accompanist. He presented an efficient dynamic programming algorithm for finding the best
match between solo performance and the score is presented (Dannenberg, 1984). In
35
producing the accompaniment, it was also necemry to genente a time reference that varied
in speed according to the soloist. The computers concept of time was derived from
diiferences behveen the amval times of performed events and the expected times indicated
by the score.
With his computer accompanist, Dannenberg created a computer musician that was able to
match the musical present (solo performance) with a musical past (the stored version of the
score). However, the algorithms for his program appeared to have been influenced more by
computer science than cognitive theory. His accompanist had no higher-level understanding
of its role as an accompanist, and was unaware of the musical form, style etc. of the piece it
is performing. In fact, the computer would become easily lost if the performer made too
many errors. In reality, this model was not much different from those created by the
information theorists. Dannenberg did address a few of these shortconiings in a later
version of his accompanist (Dannenberg and Mont-Reynaud, 1987; Dannenberg, 1989;
Grubb and Dannenberg, 1994).
2.9.2 David Rosenthal: The Computer Musician as a Listener
David Rosenthal(1989) of M.I.T. presented a computer model of the process of listening to
simple rhythm. The model consisted of:
1) A way of dividing the rhythm into appropriate chunks
2) A means of constructing recognizers for the chunks
3) An organization of the recognizers into a hierarchical structure.
This model denved its conception of the mind's workings from Manin Minsky's The
Sociefy of Mind (Minsky, 1986). Specifically, Rosenthal attempted to create a society of
agents capable of recognizing simple rhythmic patterns. The p m g m listened to a series of
events and constmcted recognizen (agents) which attempted to recognize certain pattems of
events. As the events were processed, the p r o g m decided whether the event was part of a
recognized pattern or was a new pattem. If the progrdm decided ihe pattern was new, it
con~tmcted a new recognizer capable of recognizing future occurrences of the pattern.
Rosenthal also claimed that his progrdm also had the limited ability to constnict higher-level
recognizen which could recognize pattems of other recognizers (Rosenthal, 1992).
Rosenthal's model, while limited in scope, is an interesting working model of Minsky's
theories. This version of the computer musician was capable of constructing new
recognizers as it listened to music. Rosenthal's computer musician was able to expand its
understanding of the music, perhaps in a manner similar to our own minds.
2.93 Robert Rowe: The Cornputer blusician as Listener, Composer, Performer and
Critic
Robert Rowe (1991) developed one of the most advanced versions of the computer
musician. His program Cyplter, was capable of using musical concepts in listening,
composing and performing. His implementation was strongly based on Minsky's societal
architecture in which "many small, relatively independent agents cooperate to realize
complex behaviors" (Rowe, 1991, p. 12). The program had two main components, a listener
and a ployer. Rowe claimed that his program was capable of classifying several perceptual
features arising from a musical context, tracking the changes in these features over t h e , and
attaching compositional responses to them. It could also identify tonal regions, harmonic
progressions and likely beat periods. The program also had a limited ability to examine and
crkicize the quality of its own musical output.
Rowe's work had a profound influence on the development of the Segmenter and
Recognizer presented in this dissertation. His Listener model was adapted by Pennycook
and S t m e n (1993; 1994) to create a working model of a jazz improviser. Pennycook and
Stammen felt that the feature set used by Rowe (density, register, speed, dynamics, duntion
and harmony) \vas not sufficient for a real-the jazz improviser. They modified the key
identification agent in the Listener to improve the recognition of jazz harmonic structures
(Stammen, Pennycook and Pamcutt, 1994). The implementation of the real-time harmonic
recognition system is described in Chapter 5. The phrase detection methods used by Rowe
were replaced with the Segmenter presented in Chapter 3. Finally, Rowe's pattern matching
algorithms, based on a string matching algorithm, led to the development of the DTW
algorithm for matching melodiccontours presented in Chapter 4. It is interesting to note that
Rowe (1994) added the Segmenter and Recognizer algorithms to his Cypher application.
Rowe extended the DTW based melodic recognition component to develop a program
capable of recognizing important sequential structures on an arbitrary stream of MIDI data.
Rowe's work is perhaps the most complete implementation of a cognitive music theory to
date. Its ability to derive higher levels of musical understanding from low level specialists
offers some evidence as to the validity of Minsky's theories and the ideas presented by this
dissertation. He has also implemented the idea of a listening composer that was desired by
Ledley, Vercoe and Laske in the early 1970's. Rowe's work points the way to future
versions of computer musicians that will be constructed from many levels of agents.
2.10 How do We Recognize and Remember Melodies?
We now return to the question, "How do we know if two tunes resemble each other?"
Recent research into melodic perception suggests that melodic contour is one of the most
salient features used to remember and recognize melodic sequences. Melodic contour m'y
be defined as a set of directional relationships between successive notes of a melody
(Dowling and Fujitani, 1971). Contour represents the overall pattem or shape of a melody.
This pattem of ups and do\vns chmcterizes a particular melody, Plowing one to recognize a
melody even if it has been altered in some way. This suggests that the melodic contour is an
important part of what is remembered when one rememben a melody. Melodic contour also
contributes strongly to the recognition of transposed melodies.
Davies and Jennings (1977) undertook a novel experiment to test memory for tonal
sequences. Groups of trained and untrained musicians were asked to represent the contour
of a tonal sequence by drawing it on a piece of paper. They were also asked to draw the
melodies according to the interval sizes. Although the musicians were generally superior,
there was IittIe difference between musicians and non-musicians in terms of perception of
melodic contour. Both groups performed at a much lower level when estimating interval
sizes. This suggests that pitch intervals are not normally coded in terms of magnitude.
Dowling (1978) separated the use of contour and scale information in the recognition of
melodies. Dowling defined scale in terms of the tonality of the melody, and in his study he
made comparisons between tonal and atonal melodies. Inexperienced musicians with less
than two years of musical training relied more on contour information than scale of key
information to remember and recognize melodies. Experienced musicians were found to use
both contour and s a l e information.
Several studies by Dowling (1971; 1978; Bartlrrt and Dowling, 1980) have found that
melodic contour is an abstraction from melody that can be remembered independently of
pitches. According to Dowling, the contours of brief, novel atonal melodies can be retrieved
from short-term memory even when the sequence of exact intervals cannot. In addition to
being preserved in short-term memory, melodic contours also seem to be retrievable from
long-term memory independent of interval sizes (Dowling, 1978).
Melodic contour has also been investigated in studies that aimed to evaluate a bi-
dimensional model of pitch. ï h i s model proposes that both tone height (the ovenll pitch
level of a tone) and tone chroma (the position of the tone within the octave) are important in
melody perception. Massaro (1980) has assessed how contour and tone chroma
information are used in the identification of familiar melodies and recently l emed melodies.
The results suggested that tone chroma alone is not a sufficient cue for identification and
must therefore be accompanied by contour information. Contour and chroma together
contribute to accurate identification of melodies. Tone chroma allows a given note of a
melody to be tnnsposed up or down by one or more octaves without having much effect on
the recognizability of a melody. Contour alone can be used to identify a melody only if
listenes have some linowledge of the set of tunes from which to select their answers.
Dyson (1984) revealed perceptually salient aspects of contour which may be interpreted as
contour features. She called these features contour revrrsals, or locations where the melody
changes direction. The relative importance of contour reversals is independent of the
magnitude of pitch change or the general shape of the melody. The findings of this study
show that revesals in melodiccontour are in some way similar to visual features in that they
are treated as areas of high information by the listener. Thus, in a single hearing, listeners
may be attempting to extract as much information as possible by aiming for the points of
high information value, tbe corners, or reversal. These features may therefore be thought of
as defining the shape of the melody while the slopes or non-reversais fil1 in the detail in
between the reversals. Dyson concluded that melodic contours provide a figura1 description
of novel tone sequences. Contour reversais serve as features contributing to a perceptual
representation that gives a global outline of the melody to which further detail may be
added.
In view of the evidence that contour is one of the most salient features used to remember
and recognize melodic sequences, the Recognizer described in Chapter 4 uses an algorithm
that compares the melodic contours of nvo melodic fragments. The Recognizer uses the
dynamic timewarp algorithm (DTW) (Itakun, 1975), which was originally developed for
t h e alignment and comparison of speech and image patterns. The DTW enables the
Recognizer to compare and categorize the contour of a new melodic fragment with a
collection of previously recognized melodies.
2.11 Conclusion
It is now time to attempt to answer a few of the questions that began this chapter. "Can a
computer compose, analyze or listen as well as a human being?" Looking at its current level
of development, one would have to answer no. The computer musician is still a new born
child that has barely entered into the real world. Current versions have given it the tiniest of
minds, minds that contain no musical past, no musical culture. The computer musician is
still incapable of understanding music as we do. However, we must acknowledge that this
primitive child has indeed made progress. In light of its most recent vcrsions and the
advancements made in computer rechnology, the future look pmmising.
A potential hindrance to the development of an advanced computer musician is voiced by Bo
Alphonce:
"Music theory is one of the oldest disciplines in the litcrate
history of civilization; still, systematic, coherent, well-
developcd and cornputable music theory is both very young
and mther scarce." (Alphonce, 1980, p. 26)
The d~velopment of a computer musician is most dependent on our own understanding of
music. While the theories of Laske and Minsky offer sorne intriguing insights into the
potential working of the human mind, they are incomplete. "A music analyst approaching
the computer quickly realizes that running out of theory means running out of program
code" (Alphonce, 1980, p. 26). At present, program code is the life blood of a cornputer
musician. Without the continued development of cognitive music theories, a computer
musician will remain grossly inferior even to untrained music listeners.
3. The Segmenter
3.1 Introduction
The first component we will examine in the Listener model is the Segmenter. As shown in
Figure 3.1, the Segmenter receives as input a stream of MIDI data, which represents a single
monophonic v o i e in a live musical performance. The Segmenter is responsible for dividing
the musical stream into melodic fragments at the motive or phrase level. These melodic
fragments are 4 Io 16 notes in length. When the Segmenter detects a group boundary, the
new grouping is sent to a recognition process that attempts to match the new segment with
previously recognized fragments. The Segmenter is the most important component in the
Listener model for it must accurately identify, in real-tirne, the melodic fragments that will be
sent to the Recognizer.
Segmenter a Recognizer +
t Figure 3.1: The Segmenter
In this chapter, 1 will be referring to the first eleven measures of an improvisation on the
tune "Tenor Madness" by Sonny Rollins (Rollins, 1957) to dernonstrate the operations
performed by the Segmenter. The notes of the improvisation are shown in Figures 3.2a and
3.2b. An exact transcription of the performance is shown in Figure 3.2a and a quantized
version is shown in 3.2b.
Figure 3.2a: Improvisation on Tenor Madness, mm. 1-11
(as performed)
Figure 3.2b: Improvisntion on Tenor Madness, mm. 1-11
(quantized version)
The role of the Segmenter is to listen to the improvisation in real-time and group the notes
into short melodic fragments as shown in Figure 33.
Figure 3.3: Melodic fragments created by the Segmenter
3 2 Gmuping Preference Rules
The Segmenter is largely based on the Grouping Preference Rules (GPR) proposed by
Lerdahl and Jackendoff (1983). Their grouping theory consists of a set of rules that
describe the organization of the musical surface into groups. According to Lerdahl and
Jackendoff, the grouping of a musical surface is "an auditory analog of the partitioning of
the visual field into objects, parts of objects, and parts of parts of objects" (Lerdahl and
Jackendoff, p. 36). Their rules fo: grouping appear to be idiom-independent in that a
listener needs to know little about a musical genre in order to assign grouping structure to
pieces in that idiom. This idiom-independent nature and the well-defined structure of the
GPR made them an excellent starting point for the development of the real-the Segmenter.
The GPR form a set of preference rules that determine group boundaries by examining the
local detail of a monophonic stream of music. The GPR detect changes in attack points,
articulation, dynamics, duration and register that could lead to the perception of a group
boundary. The Segmenter has adapted several of the GPR for use in detrcting group
boundaries or segmentation points in a real-time stream of music. Given a sequence of four
notes, N1 N2 N3 N4, as shown in Figure 3.4, i: is possible, using one or niore of these
rules, to detect a segmentation point between the notes N2 and N3. The GPR focus on the
transitions from note to note and pick out the transitions that are more distinctive than the
surrounding ones. These more distinctive boundaries are the ones that a listener will favor
as group boundaries.
In order to locate a group boundary, the GPR examine a sequence of four notes containing
three transitions, NlIN2, N2lN3, and N31N4. The transition N2lN3 is a candidate for a
group boundary if it differs from the surrounding transitions NIIN2 and N3lN4. The group
boundary between N2 and N3 therefore defines two groups with one group ending at N2
and the other starting with N3. The GPR used by the Segmenter in the Listener mode1 are
described below.
Group Boundary
Figum 3.4: GPR segmentation point
3.2.1 GPR 1 Avoid Small Groups
GPR 1 states that groups containing a single event or very few notes should be avoided.
In order to avoid groups of two or less notes, GPR 1 has been implemented by the
Segmenter. The Segmenter will also create a group boundary after 16 notes to avoid very
large groups.
3.2.2 GPR 2 Proximity Rules
The GPR Proximity rules are used to detect breaks in the music flow which are perccived as
group boundaries. Given a sequence of four notes, N I N2 N3 N4, the transition between
notes N2 and N3 may be heard as a group boundary if the time interval from the end of N2
to the beginning of N3 is greater than that from the end of N1 to the beginning of N2 and
that from the end of N3 to the beginning of N4. This rule is known as GPR 2a or the
SludRest rule. Examples are shown in Figures 3.5 and 3.6. The SludRest rule is useful for
detecting group boundaries at ends of phrases or points of rest in the music.
Figure 3.5: GPR 2a SludRest mle
Figure 3.6: GPR 2a SludRest rule
Another GPR proximity rule is called GPR 2b or the Attack-Point rule, and is shown in
Figure 3.7. The Attack-Point rule mesures the interval of lime behveen the attack points of
N2 and N3. This time interval is called the inter-onset interval or 101. If the 101 behveen N2
and N3 is greater than the IO1 between N1 and N2 and N3 and N4, then there exists a
group boundary between N2 and N3.
Figure 3.7: GPR 2b Attack-Point rule
3.2.3 G P R 3 Change Rules
Group boundaries may also occur at transition points where there is a distinct change in
register, dynamics, articulation or length. The GPR 3 Change rules examine the N2lN3
transition and compares it with the NIIN2 and N3/N4 transitions. The register rule, GPR
3a, detects a group boundary at N2/N3 if the interval distance between N2 and N3 is jreater
than both N I N 2 and N3lN4. GPR 3a is shown in figure 3.8.
Figure 3.8: GPR 3a Register rule
The Dynamics rule, GPR 3b, involves a change in dynamics benveen N2 and N3, but not
between N1 and N2, and N3 and N4. The Dynamics rule is shown is Figure 3.9.
Figure 3.9: GPR 3b Dynamics rule
Figure 3.10 demonstrates GPR 3c, the Articulation rule. This rule requires a change in
articulation between N2 and N3. There must be no change in articulation in the N1M2 and
h13/N4 transitions.
Figure 3.10: GPR 3c Articulation rule
The final change rule, known as GPR 3d or the Length rule, is shown in Figure 3.11. The
Length rule requires a change in note length between notes N2 and 1.13. Notes N1 and N2
must not differ in length and Notes N3 and N4 must also not differ in length.
Figure 3.11: GPR 3d Length m l e
It is interesting to note that sevenl of the GPR may occur at the same location in the music
thereby reinforcing each other. Group boundaries determined by the GPR may also occur
in conflicting positions in the music such as the example shown in Figure 3.12. In this
example, the GPR 2a SlurIRest rule and the GPR 2b Attack point rules are in confiict. In
order to solve these conflicting group boundaries, it would be desirable to assign each mle a
numerical degree of strength (Lerdahl and Jackendoff, 1953, p.47). However, Lerdahl and
Jackendoff preferred not to assign these weights. As we shall see later on in this chapter, the
Segmenter uses a set of weights to solve these conflicting group boundary situations.
= , : ' I ' 8 1
t f f ) ~ ! i n i 2 - GPR GPR
Figure 3.12: GPR group boundary conflicts
33 Application of GPR to a Live Performance
We will now attempt to apply the GPR to a real-the musical performance. As mentioned
above, we will be using the first twelve measures of an improvisation on the tune Tenor
Madness. by Sonny Rollins The application of the GPR to the first two measures is shown
in Figure 3.13.
Figure 3.13: Application of GPR to a live performance
Using the GPR, the first group boundary will not be detected until the fifth note is played.
This rr .,. s in a considerable latency in the recognition of the first group boundary, as the
GPR must wait for the notes N3 and N4 to be played after the 3 beats of rest. In this
example, the latency is equal to four beats. The GPR are therefore unsuitable for a real-time
implementation where group boundaries must be located during a musical performance.
As shown above in Figure 3.4, the Grouping Preference Rules (GPR) require four notes,
NI, N2, N3, and N4 to determine whether a group boundary occurred at N2. As a result, the
GPR must wait for both N3 and N4 to be perfonned before a group boundary decision can
be made at N2. The two note delay required by the GPR is therefore unsuitable for a real-
time implementation where group boundaries must be located during a musical
performance. Clearly, some modifications to the application of the GPR were needed.
Figure 3.14: The Segmenter Processes
In order to overcome the real-lime limitations of the GPR, the Segmenter is subdivided into
three independent segmentation processes. These processes are shown in Figure 3.14. Two
of these processes are adaptations of the GPR and have one and two note latencies. We will
refer to these processes as the N3 and N4 segmenters, for they determine that a group
boundary occurred at N2 when event N3 or N4 are received by the Segmenter. The other
process is called the Real-Time (RT) segmenter. It locales group boundaries in real-the.
The N4 segmenter uses events NI-N4 to determine the group boundary while the N3
segmenter uses only events NI-N3. The rules and MIDI features used by the N3/N4
segmenters and their equivalent GPR rules and musical features are shown in Figure 3.15.
Examination of the MIDI features in Figure 3.15 suggests that the processing for each
GPRISegmenter rule occurs at a specific temporal location in an event. For example, GPR
Pa, 2b, 3a, 3b, 3c and their equivalent Segmenter mles can al1 be evaluated immediately each
lime a note-on arrives at the Segmenter (Figure 3.16). GPR 3d can be tested immediately
with the arrival of each note-off (Figure 3.17). GPR 2a and 3c are mapped onto N3/N4
segmenter rule 1 becüuse both of these rules examine the amount of lime between an event's
note-off and the next event's note-on. The N3 and N4 segmenters therefore treüt slurs and
articulation as the same featurc. GPR 2a and 3c add an additional latency to the segmentes.
These rules actually require the arrival of event N5 to calculate the articulation feature for
N4 (See Figure 3.18). In other words, the Segmenter musi wait for the N5 note-on to
evaluate the N4 note-off to note-on value. This results in a three note latency before the N2-
N3 transition can be tested. This three note latency is clearly unsuitable for a real-time
segmenter.
GPR Feature N3IN-l MIDI Feature Rule
GPR 2a SlurPRest Rule 1 note-off to next note-on
GPR 2b Attack Point Rule 2 note-on to next note-on
GPR k Register Rule 3 MIDI note nurnber
GPR 3b Dynamics Rule5 MIDI velocity
GPR 3c Articulation Rule 1 note-off to next note-on
GPR 3d Length Rule 4 note-on to note-off
Figure 3.15: GPR to N3/N4 Segmenter Rules
I For each note-on event evaluate:
GPR 2a SlurIRest GPR 2b Attack Point GPR 3a Register
On GPR 3b Dynamics GPR 3c Articulation
Figure 3.16: Note-on rule evaluations
t For each note-off event evaluate:
off GPR 3d Length
Figure 3.17: Note-off rule evaluations
t l Off On
Figure3.18: 3 note latency of GPR 2a and 3c
In order to reduce the latency of the GPR, the RT segmenter tests for a group boundary as
each event arrives at the Segmenter. The RT segmenter uses two interna1 clocks to detect a
group boundary and marks a boundary when one of the following situations occurs:
1) The performer is holding a long note;
2) The performer has stopped playing (a rest);
3) A pre-defined maximum number of total notes for a segment has been exceeded.
N1 N2 : N3 N4 NS
Figure 3.19: Segmentation example
Recognition -PT Figure 3.20: The Segmenter Pmcesses
The operation of the RT, N3 and N4 segmenters is shown in Figures 3.19 and 3.20. When
the note-on of N1 arrives at the RT segmenter, an interna1 clock is set to go off at a
predetemined lime in the future. Since N2 arrives before the clock bas expired, the clock is
reset again and no group boundary is marked by the RT segmenter. The N3 segmenter then
tests the N1-N2 transition and does not mark a group boundary. During the sustain of N2,
the RT note-on clock expires before the arriva1 of N3, thereby causing the RT segmenter to
mark a group boundary. When the note-on of N3 arrives at the Segmenter, the N3
segmenter confirms the group boundary at N2, while the N4 segmenter tests the NI-N2
transition. When N4 amves at the Segmenter, the N4 segmenter confirms the previous N3
and RT segmentation at the N2-N3 transition. This same procedure is followed for each
note-off arriving at the Segmenter. This chain of segmenters allows for a higher degree of
accuracy in the detection of musical groupings. The performance of the RT segmenter is
alwvays verified one note later by the N3 segmenter. The N4 segmenter, having available
both future and past data around the group boundary, verifies the performance of both the
N3 and RT segmenters. When either the N3 or N4 segmenters detect an error made by the
RT segmenter, they update the settings of the RT segmenter's interna1 clocks. This allows
the RT segmenter to be dynamically adjusted to changing musical situations.
As mentioned above, the GPR introduce a two to three noie latency before a group
boundary is detected. The RT, N3 and N4 Segmenters attempt to reduce this latency to zero
depending on the musical situation. The latency of the various Segmenters is shown in
Figure 3.21. The RT Segmenter attempts to determine a group boundary immediately and
therefore hüs a zero note latency. The N3 Segmenter waits for notes NI N2 and N3 and has
a one note latency. The N4 Segmenter must wait for notes N1 N2 N3 and N4 and therefore
has a two note latency.
Latency: O 1 I
Segmenter: RT 1 N3 N4 I ! I I I
GPR: NI N2 1 N3 N4 I I
Figure 3.21: Segmenter Rule Latency
In order to avoid segmentation at every feature transition, each segmenter assigns a weight
to each feature that increases with the size of the feature transition. In this manner, larger
changes in register, dynamin, duration, and articulation as well as group lengrh have a
greater contribution to the detection of a musical boundary. While Lerdahl and Jackendoff
avoided the assignment of weights to the GPR, 1 found it necessary to utilize a weighting
system. The ranking of the rules are in general accoidance with the ranking reported by
Deliége (1987) (Figure 3.22). Since the Listener mode1 is currently using MID1 as input to
the system, the Segmenter is unable to consider timbre (Deliége, Rule 7). Our use of
weights to decide between potential group boundaries is similar to the phrase detection
system used by Rowe (1993).
LerdahlIJackendoff
GPR 2a (SlurIRest)
GPR 2b (Attack Point)
GPR 3a (Register)
GPR 3b (Dynamics)
GPR 3c (Articulation)
GPR 3d (Length)
Stammen/Pennycook Deliége
Rule 1 Rule 1
Rule 2 Rule 2
Rule 3 Rule 3
Rule 5 Rule 4
Rule 1 Rule 5
Rule 4 Rule 6
Figure 3.22: Ranking of GPR
Although Lerdahl and Jackendoff have expressed that the GPR could not provide a
computable procedure for determining musical analyses, the GPR have proven to be useful
for the automatic detection of lower-level motive and phrase boundaries. The GPR were not
intended to completely reflect the cognitive processing that occurs when listening to music
in real-tirne, as they possess a two or three note latency in the detection of a group
boundary. ln this real-the adaptation of the GPR, it has been determined that the cognitive
processing of the various GPR may occur at different temporal locations in an event.
3.4 Event hlemory
The GPR compares certain features of notes N1-N4 to determine if a temporal or acoustiml
change occurred at the N2-N3 transition. The GPR require that a given feature at N2-N3 be
different from the same feature at NI-N2 and N3-N3. This implies that there be a high
degree of similanty between the NI-N2 and N3-N4 features. However, when one considers
the considerable variability of a particular feature during a live performance, these
comparisons become much more difficult to make. There anses the need for some type of
quantization, especially for compansons behveen temporal features.
The Segmenter responds to MIDI note-on, note-off and velocity messages. When a note-on
event is received, the data pertaining to it is stored in an Event structure. This structure
contains information such as the note's pitch, velocity, interval from its previous note, start
tirne (in milliseconds from the beginning of the performance), end time, duration (in
milliseconds) and offset time from the start lime of the previous note to the current note. It
also contains four fields which specify which rhythmic groups the note belongs to and
several fields containing the results of segmentation rule evaluation which will be explained
in detail below. The complete Event structure is listed in Appendix A.
The Segmenter stores the most recently received Events in a circular buffer. The length of
this buffer is currently 32 Events and may be adjusted by the user. This buffer will be
referred to a the short-term memory (STM). The Segmenter uses the STM to determine
segmentation points. Once a group boundary has been detected by the Segmenter, il is
placed into another region of memory called the long-term memory (LTM). The LTM
contains shapes and motives that have been recognized or learned from previous listening
sessions. The data is stored in external files and may be read into the Segmenter's long-term
memory prior to a listening session.
3.5 Determining Rhythmic Types
In order to properly separate the incoming events into coherent groups, the Segmenter must
know something about the rhythm and articulation of the notes. The Segmenter examines
three features of the Event:
1) the duration of the note in milliseconds (ms). This is the duration of the note from its
attack to release or MIDI note-on to note-off (Figure 3.23).
Duration - I
Note on ~ o t e off Figure 3.23: Note duration
2) the inter-offset time (101) or time from the beginning of a note to the beginning of the
next note (Figure 3.24).
Note on Note on Figure 3.24: Inter-offset interval
3) the articulation or time from the release of a note to the beginning of the next note. This
is the duration of the interval between the note-off to the next note-on (Figure 3.25).
t - r Off On
Figure 3.25: Note-off to note-on interval
The raw millisecond values are of little use to the Segmenter for it must be able Io classify
the note as being of a certain rhythmic type in order to compare the rhythm of one event
with that of another.
To simplify matters, this discussion will begin by looking at only duration or note-on Io
note-off lime. The concept of grouping the note durations using traditional rhythmic values
(Le. quarter note, eighth note etc.) was rejected due to ils complexity and due Io the
realization that this information was not needed by the Segmenter. Instead, the Segmenter
determines if a note belongs to one of ten rhythmic types. The process of beat induction
(Desain and Honing, 1995) in which a regular pattern (the beat) is activated in the listener
while listening to music is therefore not required by the Segmenter.
The Segmenter begins with a 'blank slate' where the duration of the first event will determine
the value of the first rhythmic type (Figure 3.26). Unassigned rhythmic types are assigned
the value -999. If the duration of the next note is close enough to the value of the first
rhythmic type, it will be assigned that type. If the duration of this note is different from
anything seen before, a new rhythmic type will be 'created' or initialized. In the Segmenter's
use of the short-term mernory, the rhythmic types need not be related Io each other in any
way; for example, one type need not be twice the value of another. In order to limit the
nurnber of different rhythmic types, a duntion exceeding a certain maximum limit will
simply be labeled as being a "long noten type and its millisecond value will not be stored
as a rhythmic type.
LongNoteClock: 5 0 0
On to Off Off to On -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
LongRestClock: 2 5 0
Figure 3.26: Empty short-term memory
i
On to Off Off to On 1696 1 8 3
1 -999 5 -999 -999 -999 -999
LongNoteClock: 6 8 2 LongRestClock: 6 0 8
Figure 3.27: Active short-term memory
Tempo changes or accelerandos and ritardandos cause problems when dealing with rhythm
in this manner. Ten rhythmic types can soon become insufficient and two notes with
equivalent durations (Le. two quarter notes) at different sections of the song may not be
assigned the same rhythmic type due to their different millisecond duration times (Figure
3.27).
The solution to this problem is to upgrade the average millisecond value of each rhythmic
type every t h e a new note comes in. Instead of simply averaging the value depending on
the time values of the notes detennined to be of that type, the Segmenter gives the durational
value of the most recent Events of that type a higher weight in the overall type average than
those seen further in the past. Previous notes move further back in the short term memory
so that even when an accelerando is played, we can still tell what the quarter note is. With
this method, the average value of a rhythmic type will decrease with an accelerando or
increase with a ritardando.
If a rhythmic type is not assigned to any Events within the short term memory, the type is
taken out of the arny of observed rhythmic types to make room for new ones. If too many
different rhythmic types are seen, the minimum millisecond difference between the types
will be increased. The Segmenter will then go back through its short term memory and re-
calculate the different rhythmic averages of the Events using the new threshold. The
threshold will be increased until the number of rhythmic types is reduced. Every time a
rhythmic group is upgraded, the Segmenter will also check for types which have become too
close to one another. It will then temporarily raise the threshold and redo the rhythmic
avenges in order to make one type out of the hvo.
This method of determining and averaging rhythmic types is used for al1 three aspects of
rhythm: duration (note-on to note-off time); offset (note-on to note-on lime); and
articulation (note-off to note-on time). The rhythmic array of average articulation times is
only used for updating the segmenting clock times. The Event is actually assigned one of
three articulation types: stacatto, legato or rest. The articulation type is determined
depending on the ratio of the duration of the Event to the note-off to note-on time of the
Event. For example, if the note-off to note-on t h e of the Event is less than or equal to ils
duration, the note is labeled as legato.
3.6 Segmentation Rule Evaluation
Once the rhythmic type of a note has been determined, the Segmenter will check to see
whether or not it is valid to segment after the note. it does so by evaluating whether certain
mles hold true or no t
3.7 RT Segmenter Rules
The following three rules are referred to as the real-time mles (RT) because a group
boundary will be detected during or immediately after a note is played. If any of these three
niles holds true, the Segmenter will assign a group boundary after this note.
1) If the Segmenter has received a certain number of notes without having yet located a
group boundary, one will be automatically created. This mle exists to avoid groups with too
many notes. The threshold for the maximum number of notes can be input by the user. The
default is a length of 16 notes corresponding to GPR 1.
2) If the current note exceeds a certain duntion (note-on to note-off), a group boundary will
be created. The default threshold is 250 ms and the threshold is also dynamically adjusted
in the short-term memory. This strategy is an implementation of GPR 2b.
3) If there is a long rest after the current note (long note-off IO note-on), a group boundary
will be created. The default duration is 500 ms and the duration threshold is dynamically
adjusted in the short-term memory. This strategy is an implementation of GPR 2a.
The RT Segmenter detects group boundaries using two real-the clocks. These clocks are
called the note-on clock and the note-off clock. In order to implement RT rule 2 (GPR 2b),
the Segmenter has an interna1 note-on clock which is set when the current Event's note-on
is received. The clock will expire after a specified long duration t h e , if the Event's note-off
is not received before this time. The operation of the note-on dock is shown in Figure 3.28.
On Clock: 600 ms RT On Segment
Tempo: J = 120 Figure 3.28: Opent ion of the note-on clock
In this example, the note-on dock is currently set for 600 ms. When the note-on event for
the first quarter note is received, the note-on dock is set to expire 600 ms in the future.
Since the note-on event for the second quarter note arrives before the dock expires, the
note-on clock is reset. The half note arrives before the dock expires, so once again the note-
on clock is reset. After 600 ms kas elapsed the note-on dock expires before the next note-
on event has occurred and the half note is still being held, so a group boundary is detected at
the half note.
To implement RT rule 3 (GPR 2 4 , the note-off clock is set with each Event's note-off, and
unset if a new note-on is seen before the specified long test t h e . If the note-off dock
expires before a new note-on, then a group boundary will be marked after the preceding
note. The operation of the note-off dock is shown in Figure 3.29.
Off Clock: 500 ms RT Off Segment
I
1- 500 ms
Tempo: J = 120 J = 500 ms (approx)
Figure 3.29: Opent ion of the note-off clock
3.8 N4 Rules
The Segmenter also uses a set of rules derived from the GPR. The GPR compare the data
of two notes before the segmentation point with that of the two notes following the point.
These rules therefore do not operate in real-time since the group boundary can only be
determined hvo notes after the note being tested as a possible segmentation point. Because
the GPR requires a total of four notes to determine a segmentation point, these rules are
referred to as 'N4' rules.
The N4 rules determine whether there has been one or more of the following occurrences
between the two notes preceding the possible segment point and the two notes following it:
Fenture GPR Segmenter Rule
1) a change in articulation GPR 3c Rule 1
2) a change in attack point GPR 2b Rule 2
3) a large intervallic leap GPR 3a Rule 3
4) a change in duntion GPR 3d Rule 4
5 ) a change in dynamics GPR 3b Rule 5
Segmenter rules 1,2,3, and 5 can test for a group boundary between N2 and N3 as soon as
the note-on for N4 is received. Rule 4, however, can only be tested after note four's note-
off.
Each rule is graded according to ifs importance and is assigned a weight. These weights are
exponential, with the most accurate rule (change in articulation) having a weight of 128 and
the least accunte rule (change in velocity) having a weight of S. Every Event record contains
a field labeled rules. For each Segmenter rule that is true for the Event, a flag is set to
indimte the presence of the mle.
In addition, the amount of deviation seen behveen the hvo sets of hvo notes is calculated to
determine a weight multiplier. For example, a large interval will have a higher weight
multiplier than a smaller interval, just as a very long note will have a higher weight multiplier
than a smaller one. Therefore, the higher the multiplier, the more obvious the group
boundary. The weight multiplier is multiplied by the standard rule weight discussed above.
Every Event record also contains a field labeled N4sum-of-weights. It stores the sum of the
total weights of al1 the rules that were determined Io be true for segmentation after that
particular note. If this sum is over a certain threshold, the Segmenter will choose Io segment
after that note. n i e fields of the Event record are shown in Appendix A.
Duration - MIDI note on to MIDI note off l
i GPR 3d - Length
t Segmenter N3lN4 Rule 4
Note on Note off
~ l ~ ~ / ~ ~ ~ t MIDI note off to next MIDI
n note on
J GPR 2a - Slur/Rest
t i GPR 3c - Articulation
off 8 n Segmenter N31N4 Rule 1
Velocity O - 127 On
I Note Number
MlDl velocity value
GPR 3b - Dynamics
Segmenter N3lN4 Rule 5
MlDl velocity
GPR 3a - Register
Segmenter N3lN4 Rule 3
Figure 330: Summary of N3lN4 rules
The Segmenter increments the note-on and note-off clock duntions by assigning them the
value of the next largest rhythmic type in the short-term memory. The note-on clock will be
assigned the duntion of the next larger rhythmic type. The note-off clock will be assigned
the note-off to next note-on value of the largcr rhythmic type. If no larger rhythmic type
exists in the short-tem memory, the note-on and note-off clock duntions will be assigned
the appropnate values of the current Event, providing this value does not exceed nvo times
the current durations for the clocks. If the new values are more than hvice as long, the
current note-on and note-off clock duntions will simply be doubled. In this mmner, the N4
Segmenter is able to both confirm segments produced by the RT Segmenter and
dynamically adjust the clock durations to ensure more accunte real-lime segmentation for
Events.
When the N3 Segmenter detects a group boundary with the N4 duration rule, and this
group boundary has not been detected by the RT Segmenter, the duration used to set the
note-on clock is considered too long and will be decreased. The same strategy is used when
the N4 Segmenter detects a group boundary at a rest that has not been detected by the note-
off clock in the RT Segmenter. The N4 Segmenter decreases the duration of the note-on
and note-off clocks using values from the next smallest rhythmic type in the short-term
memory. Once again, if no such rhythmic type exists, the duration of the current Event is
used.
Because the N3 rules look only atone Event following the segmentation point, the segments
detected by the N3 rules are less accurate and must also be confirmed by N4 rules.
Therefore, if it is determined that an N3 segment does not have a high enough N4 weight, it
is removed.
3.11 Segment Length
One feature missing from the GPR is the total length of a segment. The Segmenter
considers the length of a segment when deciding where to place a group boundary. The
Segmenter adds 8 times the segment length to the current Event's segmentation weight in
order to encourage long groups of notes to segment sooner and to give longer segments
priority over those that are shorter. The Segmenter tests a potential segment's length before
creating a group boundary to be sure the segment contains a minimum number of events. If
it does not, the previous segment is compared wiih the current candidate to determine which
grouping is more valid. Since the N4 rules are the most reliable, priority is given to the
segment with the highest N4 segmentation value. For example, a potential N4 segment
whose segmentation value is higher than that of its previous segment will cause the previous
group boundary to be removed. The notes of the previous segment will then be added to the
current segment.
3.12 Display of Event Data and Segmenter Information
The Event data contained in the Event queue of the short-term memory can be graphically
displayed in a window. The 32 events in the queue are displayed in piano roll notation
format with MlDI note numbers along the y-axis ranging from 36 to 96 (Le. the typical 5
octave range found on a MlDI keyboard) The Event numbers are listed dong the x-axis
(Figure 3.31). Each of the 32 Events are drawn into the window as a black horizontal line
with the length of the line determined by the Event's duration. if an Event is also a
segmentation point (Le. the Segmenter detected a group boundary at that note) it is
displayed in one of the following colors:
Red real-time segmentation occurred (RT)
Green N3 rules segmentation occurred (N3)
Blue N4 rules segmentation occurred (N4)
Figure 331: The Segmenter displry window
Along with the color indication of the segmentation point, the segmentation weight and
segmenter rules are also displayed. The Segmenter window is updated each t h e the
segmenter detects a new segment point. When a real-time segment is detected the event will
be displayed with a red RT and a segmentation weight of O. When an N4 segmentation
occurs, it will be displayed as blue N4 along with its corresponding segmentation weight.
The same will occur in green for an N3 segmentation.
With this display, it is possible to view the accuracy of the RT Segmenter by examining
how well the red RT segmentation points line up with the blue N4 segmentation points. As
mentioned earlier, each red RT segmentation should be confirmed by a corresponding green
N3 and blue N4 segmentation. As new forms of rules are added to the Segmenter, or
existing ones adjusted, it will be possible to view their performance by modifying the
Segmenter display.
3.13 Tenor Madness: A Real-tirne Exarnple
Il is now time to take a look at a real-time example of the opention of the Segmenter. As
mentioned at the beginning of this chapter, the input to the Segmenter will be the first twelve
measures of an improvisation on the tune "Tenor Madness" by Sonny Rollins. This
improvisation is shown in Figure 3.32. The improvisation is input as MIDI into the Listener
model. The MIDI representation of the solo is shown in Figure 3.33 and the Segmenter
piano roll notation display is shown in Figure 3.34.
Each note in the MIDI Stream is first examined by the RT Segmenter. When the first note-
on event arrives at the RT Segmenter, the note-on dock is set for a duration of 500 ms.
When the first note-off event arrives at the segmenter, the note-off clock is set for a duration
of 750 ms. The results of the RT Segmenter are shown in Figures 3.35 and 3.36. Each
group boundary detected by the RT Segmenter is marked by a RT. Three group boundaries
are marked with a box to indicate mistakes made by the RT Segmenter. In measures 3 and
9, the RT Segmenter marked two RT segments in a row. In measure 8, the RT Segmenter
did not detect one of the group boundaries. These errors resulted from the incorrect
durations used to set the note-on and note-off clocks. Without validation from the N3 and
N4 Segmenters, the RT clocks will continue to make errors since the other Segmenters are
used to adjust the duration of the clocks.
Figure 332: Improvisation on Tenor Madness, mm. 1-12
Time (ms)
4640 4890 4947 4999 5046 5401 6432 6703 6744 6812 6854 7156 7286 7536 7604 7687 7734 7979 9963 10187 10192 10328 10427 10656 10864 10973 10994 11187 11338 11552 12593 12651 12687 13026 13052 13958 13984 14072 etc.
Event Note
144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144
Velocity Number
Ab4 note-on Ab4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off Ab4 note-on Ab4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off G4 note-on G4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off E4 note-on E4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off Eb4 note-on Eb4 note-off Bb4 note-on Bb4 note-off Bb4 note-on Bb4 note-off F4 note-on F4 note-off G4 note-on G4 note-off C#4 note-on F4 note-on C#4 note-off F4 note-off
Figure 3.33: MIDI representation of Tenor Madness improvisation
Figure 3.34: Segmenter representation of "Tettor Madness" improvisation
75
Figure 3.35: RT Segmenter results
Figure 336: RT Segmenter results
Figures 337 and 338 show the segmentation results using only the N3 Segmenter. In this
example, only one error occurred in measure 9. The N3 Segmenter chose the f in t quarter
note instead of the second quarter note. Since the N3 Segmenter only examines notes N1
N2 and N3, it is possible for it to segment too soon in certain situations. The N4 Segmenter,
with more information available, a n correct these mistakes by the N3 segmenter.
The results of using only the N4 Segmenter are shown in Figures 3.39 and 3.40. In this
example, it is clear that the N4 Segmenter provides a reliable means of detecting group
boundaries. The N4 Segmenter has successfully chosen the correct group boundaries for
measures 3 , 8 and 9, where the other Segmenters made errors. The cost of this increased
accuracy is the two to three note latency resulting from the GPR. The N4 segmenter is
therefore used mainly to confirm segmentation decisions made by the other Segmenters and
to monitor and adjust the durations used by the RTclocks.
The performance of al1 Segmenters is shown in Figures 3.41 and 3.42. There is now a high
correlation of segmentation decisions across al1 Segmenters. Th.: mistakes made by the RT
and N3 Segmenters in measures 3 ,8 and 9 have been corrected by the N4 Segmenter. The
N4 Segmenter has also dynamically adjusted the durations of the RT note-on and note-off
clocks to align them with the tnie tempo of the performance.
Figure 3.37: N3 Segmentation of Tetior Madness improvisation
Figure 338: N3 Segmentation of Tenor Madness improvisation
Figure 3.39: N4 Segmentation of Tenor hfadness improvisation
Figure 3.40: N4 Segmentation of Tenor Madness improvisation
Figure 3.41: Segmentation using RT N3 and N4 Segmenten
Figure 3.42: Segmentation using RT N3 and N4 Segmenters
3.14 Summary
The Segmenter used by the Listener model consists of three independent segmentation
processes:
1. The RT Segmenter attempts to locate group boundanes as close as possible to real-the.
2. The N3 Segmenter uses notes N1 N2 and N3 to locate group boundanes.
3. The N4 Segmenter uses notes N1 N2 N3 and N4 to locate group boundanes.
The N3 and N4 Segmenters use an adaptation of the GPR and are used to verify the
performance of the RT Segmenter. The RT Segmenter, by itself, does not always select
musically accurate group boundaries as there may not be enough information available to
the Listener at that point in the music to make a proper decision.
Close analysis of the GPR and the results produced by the Segmenter suggest that listeners
determine group boundaries at various temporal locations in the music. Most of the features
that determine a group boundary are present at the start of a note. However, the length
feature (GPR 3d) cannot be evaluated until the note is released. Certain rules such as
SludRest (GPR 2a) and Articulation (GPR 3c) may not be reliably determined until hvo to
three notes after the group boundary.
The Segmenter, in its current implementation, has proven to be a useful tool for detecting
musically significant melodic fragments. Variations of the Segmenter have been used by
Pennycook anil btammen (1993; 1994), Rowe (1994) and Rolland (1998).
3.14.1 Limitations and Future Work
Despite its definite usefulness, there are certain limitations to the current implementation of
the Segmenter. One major limitation is that the Segmenter only works on a monophonic
Stream of MIDI data. This requires the voices of music performance to be assigned to
individual MIDI channels. Also separate Segmenters must be assigned to each voice in the
ensemble. In the current model, these individual instances of the Segmenter do not share
information conceming the performance. This lack of a global memory means that the
preprocessors in each of the Segmenten may develop different theories on the various
rhythmic types assigned. There is a need in a future version of the Segmenter to listen to the
entire ensemble in order to have a global interpretation of the musical performance.
The operation of the Segmenter is dependent on the assignment of weights and thresholds
to determine if a group boundary is present. These weights were developed after much
experimentation with several styles of music. Further work is required to determine the
validity of these weights and to improve the accuracy of the Segmenter.
4. The Recognizer
4.1 Introduction
The final component we will examine in the Listener mode1 is the Recognizer. As shown in
Figure 4.1, the Recognizer receives as input a melodic fragment from the Segmenter. This
melodic fragment will Vary from 4 to 16 notes in length. The Recognizer is responsible for
comparing the new fragment with a collection of previously recognized fragments.
The Recognizer uses a version of the dynamic timewarp algorithm (DTW) (Itakura, 1975;
Sakoe and Chiba, 1978) to match the melodic contour of the new fragment with the closest
matching fragment in its intemal memory. The DTW is a widely used technique for lime
alignment and comparison of speech and image patterns. The Recognizer uses a modified
implementation of the DTW that allows the comparison of the pitch and rhythmic contours
of an unknown melodic fragment with a collection of previously recognized fragments. The
DTW can compare melodic fragments of different sizes. The timewarping characteristic of
the DTW easily accommodates the expressive timing inherent in human performance,
thereby eliminating any need for rhythmic quantization. The Recognized fragments are
stored in a long-term memory so they can be used for future listening sessions.
Segmenter + Recognizer '-4
? Figure 4.1: The Recognizer
Figure 4.2 shows an expanded view of the operation of the Recognizer. When the
Recognizer receives a melodic fragment from the Segmenter, the fragment is first passed to
the DTW algorithm. The algorithm compares the new fragment with a number of templates
contained in the template storage area. When a close match is found the results are sent to
the Evaluator, which decides if the new fragment should be considered similar enough to the
template. If the new fragment is too different from al1 of the templates, the fragment will
become a new template to be matched with future fragments. Othewise the new fragment
will be added to a list of fragments that are similar to the matching template. In this manner,
the Evaluator and Template storage build clusters of similar melodic fragments. The number
of templates in the Recognizer can expand dynamically as new fragments are received by
the Recognizer.
I Recognizer
Evaluator MIDI A-~XTI Templates
Figure 4.2: The Recognizer
4.2 Use of Melodic Contour
As discussed in Chapter 2, recent research into melodic perception suggests that melodic
contour is one of the most salient features used to remember and recognize melodic
sequences. In view of this evidence, a computer program designed for melodic recognition
should be able to analyze melodic contours. Further consideration of melodic recognition
theories suggests that melodic recognition may be viewed as a pattern recognition task. The
contour of an unknown melodic fragment represents a pattern of rising and falling intervals
which in tum characterizes the general shape of the fragment. Melodic contour recognition
therefore involves the comparison of an unidentified melodic fragment with previously
recognized fragments. A measurement of similarity behveen any hvo melodic fragments
needs to consider the differences between their melodic contours as well as their rhythmic
content. However, comparison becomes more complicated when one considers the
variability of melodic fragments, especially the differences in their lengths. To overcome this
difficulty, the dynamic t h e warping algorithm (DïW) has been implemented as a method
Io compare melodic fragments of different lengths.
4.3 Applications of Dynamic Prognmming to Music
The technique of dynamic time warping is based on the dynamic programming path îinding
algorithm. The theory of dynamic programming (DP) was introduced by Bellman (1957) Io
solve mathematical problems arising from multi-decision processes. Elastic pattern
matching involving the comparison of sequences of different lengths can also be modeled
using DP. This technique is commonly referred to as dynamic time warping 0.
Dannenberg (1984) described one of the first applications of DP to music. His score-
follower algorithm utilized DP for real-time comparison of performed pitches with a score
stored on a computer. The algorithm handled performance deviations from the score by
allowing for insertions, deletions, and replacements. Insertions occurred when the performer
added one or more notes to the score. Deletions occurred when notes were left out of the
performance. Replacements resulted when a performer substituted another note for one in
the score. Through the use of DP, the algorithm was able to match moderately inaccurate
performances with a fixed score.
Mongeau and Sankoff (1990) used DP to compare the ovenll similanty behbeen two
musical scores. Their algorithm not only accommodated musical insertions, deletions, and
replacements but also consolidation and fragmentation of the musical material.
Consolidation involves the replacement of several elements by a single one, while
fragmentation is the replacement of one element by several. Their implementation of DP
searched for similarities in melodic lines despite gross differences in key, mode, or tempo.
However, their use of interval weights and tonic relationships limited their analyses to
Western tonal music. Their system was intended to compare entire musical works or large
sections of works and is therefore unsuitable for a real-time implementation. Another use of
the DTW for melodic recognition is found in Hiraga (1996).
4.3.1 Other Melodic Fragment Recognition Systems
David Cope's Experimetits in Mioic Intelligence (EMI) (Cope, 1991; 1992) attempted to
determine fundamental elements of a composer's style by locating melodic patterns or
"signatures" that appear in more than one composition. The EMI program analyzed MIDI
files of a given composer's works using a patteh matching algorithm to locate these
patterns. Signatures were then stored in a database to be used later by the EMI composition
functions in the creation of a new work in the style of the composer. Cope's pattern
matching functions search the input compositions for matching interval sequences (Cope,
1990). The size of the motives to be located, as well as the tolerance for differing intervals in
the sequences may be specified by the user. Cope's exhaustive search algorithms were
inappropriate for a real-time implementation of a Recognizer.
Pierre-Yves Rolland (1994; 1998) presented a very complete melodic fragment recognition
algorithm called FlExPat (Flexible Extraction of Patterns). He used this algorithm for the
automated extraction of prominent motives in jau. Along with motive extraction, Rolland's
system also performed statistical and structural analyses that examined how the motives
were used in relation to the harmonic progression of the piece. Rolland's extensive analysis
algorithm for loating signifiant melodic fragments in a jazz solo was not designed to be
used in a real-time implementation.
4.4 Implementntion of the Dynamic Timewnrp Algonthm
The Recognizer is modeled after a discrete word recognition system (DWR) first proposed
by Itakura (1975) and is shown in Figure 4.2. In this system, a monopïiionic MIDI stream is
segmented into candidate musical fragments by a Segmenter process (Pennycook et al.,
1993). Each fragment produced by the Segmenter consists of 4-16 MIDI notes. The notes
in a fragment are converted to an interval set to remove pitch differences caused by
transposition. For example, fragment 1 s h o w in Figure 4.3 is represented by the interval
set {O, -1, +1, -5, +1}.
While not implemented in this current study, the rhythm of a fragment could be converted
into a set of duration ratios, which is defined as the ratio of a note's duration divided by the
previous note's duration. The rhythm of the first melodic fragment in Figure 4.3 could be
represented by the duration ratio set {O, 1,2,1, l}. However, in a live performance of this
fragment, the duration ratio set would not consist of exact integers and as an example may
contain the values {O, .9,2.1, .97,1.1}. The DTW is well suited to accommodate these local
rhythmic variations, thereby removing the need for rhythmic quantization. The duration ratio
representation of rhythm also allows for easy comparison of rhythmic contours that are
related by augmentation or diminution. Each melodic fragment is therefore transformed into
a two-dimensional feature vector that combines the pitch interval set and duration ratio set.
Other feature sets such as velocity could be added to the vec!or to create an n-dimensional
representation of the musical fragment. The total set of feature vectors becomes the
candidate template representing the unknown fragment.
J.S. Bach
Figure 43: Ferture set representntion of melodic fragments
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
In order to recognize the unknown fragment, the candidate's feature template is compared to
a database of reference templates. The DTW compares the unknown candidate with each
template in the database and assigns a distance value that indicateç the degree of similarity
between the candidate and reference templates. The Recognizer matches the candidate with
the reference template that results in the lowest distance measure. When several close
matches occur, an Evaluator is used to select the best match. If the best match distance is
higher than a pre-defined thieshold, the candidate's template is considered to be unique and
can be added to the database of reference templates.
The D l W template matching process is illustrated in Figure 4.4. For a more complete
introduction to the DTW, the reader is referred to Silverman and Morgan (1990) and
Sankoff and Kruskal (1983). The horizontal axis represents the [l..m] elements of the
unknown candidate C, where m is the number of feature vectors contained in the candidate.
The vertical axis represents the feature vectors [l..n] of a reference template R, n being the
number of feature vectors in R. Each grid intersection point (ij) represents a possible match
between elements Ci and Rj. A local distance measure d(i,j) is computed at each grid
intersection. This distance measure is a function of the two feature vectors Ci and Rj and
describes the dissimilarity of these two vectors. The DTW in the Recognizer uses the
simple Euclidean distance measure.
Aiter al1 of the local distances have been calculated, the DTiV attempts to trace a path from
endpoint(1,l) to endpoint(m,n) that results in the lowest accumulated distance. As shown in
Figure 4.4, any monotonic path from endpoint (1,l) to endpoint (m,n) represents a possible
mapping or warp of the candidate template onto the reference template. The accuracy of
such a mapping can be measured by summing al1 of the distances d(i,j) along the path. For a
given candidate/reference pair, the goal is to find the monotonic path through the arny that
minimizes the accumulated distance between the endpoints. Instead of tncing every possible
path through the DTiV array, we add local and global path restraints to limit the total
number of possible paths. A local restraint used in computing the optimal path through the
D W array is shown in Figure 4.5. This restraint limits the number of predecessors at a
given grid intersection. As indicated in Figure 4.5, the only possible predecessor to the point
(i,j) is one of (i-lj), (i-lj-l), or (i-lj-2). The accumulated distance at any point (ij) on the
grid becomes:
This local path restraint is known as the ltakura local restraint. The Itakura restraint limits
the possible path through the array to between a dope of 0.5 and 2.0. Other local path
restraints are possible and sevenl examples are listed in Silverman and Morgan (1990).
The DTW algorithm has a complexity of mn, where m and n are the number of feature
vectors in the candidate and reference templates. It is therefore desirable to further reduce
the amount of distance and path calculations required to determine the degree of similarity
between a candidate and reference template. Global path restraints such as the one shown in
Figure 4.5, require that the optimal path be within a certain region of the DTW array. Paths
that nin outside of this region are rejected. The use of a parallelogram search space (Rabiner
et al., 1978) reduces the complexity of the DTW to approximately nmI3. Figure
1 Candidate rn
Figure 4.4: DTW matching algorithm
Column i-1 Column i
Figure 4.5 ltakun Local Restraint
4.6 illustrates the region of this search space. The combination of the local and global path
restraints allow the DTW to compare an unknown candidate with templates that vary from
0.5 to 2.0 times its length. To compare a candidate with K templates, the complexity
becomes K(nm/3). Real-time performance is therefore dependent on the size of K.
The DTW compares a candidate template with every reference template contained in the
template database. These reference templates may be loaded into the database in a number
of ways. Templates may be pre-loaded by playing the desired musical fragments into the
Segmenter. The Segmenter converls each fragment into a feature vector template and stores
it into the template database. Templates may be loaded from a data file containing templates
recognized during previous sessions. Templates may also be created by the DTW
Evaluator. With this method, melodic fragments that are not recognized by the DTW are
entered into the template database. It is therefore possible to start with an empty database
and have the system construct a template database that contains templates representing the
unique musical fragments contained in a single musical work. A graphical editor allows for
the display and editing of the recognized fragments and also the correction of incorrectly
segmented or recognized fragments. The flexibility of the template database allows the
system to be used in a variety of melodic recognition tasks. The system can be configured
to respond to only one feature such as pitch or rhythm. Likewise, new features may be
defined and added to the DTW recognition process.
As melodic fragments are recognized by the Recognizer, they are added to the template
memory. A new fragment that does not match any other fragment is labeled a motive. Any
future fragments that are similar to the motive fragment are added to a lis1 of submotives
belonging to the fragment. The motive fragments therefore become the set of templates used
to match a new melodic fragment. Once the closest matching template has been discovered,
the list of related submotives can be searched for an even closer match. This strategy
reduces the size o f the search space as the ternplate database grows over repeated listening
sessions.
1 Candidate m
Figure 4.6: DTW senrch space
4.5 Real-tirne Listening Exarnple
It is now time to examine the real-time operation of the Recognizer. Figure 4.7 shows the
Timewarp application embedded in a Max patch (Puckette and Zicarelli, 1990). The
Timewarp objects are labeled melsltape. The music for the real-time listening session is a
MIDI recording of a performance of the Fugue in C minor, BWV 847 by J. S. Bach. The
fugue is played by theplaySMF object which will send the MIDI notes of the fugue in real-
tirne to the melshape object. &ch instance of the melshape object listens to a single voice of
the fugue. The melshape objects are labeled soprano, alto and bass respectively. The
melshape object contains both the Segmenter and Recognizer objects. The analysis
performed by each of the melshape objects may be viewed by double-clicking on the
melshape object. The analysis display window for the soprano voice is shown in Figure 4.8.
and the bass voice in Figure 4.9.
Figures 4.8 and 4.9 show the operation of the Segmenter and the Recognizer. Each melodic
fragment is labeled with the Segmenter rules and weights that determined the group
boundary. The shape of the best matching template is drawn above each fragment. Five
recognition templates are used in this example to represent the five basic melodic contours
(Scheidt, 1985):
Rising
Falling
Rising/Falling
Falling/Rising
Flat
The shapes of these templates are s h o w in Figures 4.10 to 4.14. The templates are
specified using interval offsets and contain 15 interval offsets used to represent a fragment
of up to 16 notes in length. The interval offsets for each template are shown in Figure 4.15.
During this listening session, the Segmenter in each melshape object sends melodic
fragments to its Recognizer. The results of listening to the alto voice of the fugue are shown
in Figures 4.16 to 4.19. The display shown in Figure 4.16 shows al1 of the nsing fragments
found in the alto voice. The normalized versions of these fragments are shown in Figure
4.17. Results for risinglfalling fragments are shown in Figures 4.18 and 4.19. Results for
the other template shapes are shown in Figure 4.20 to 4.24. From these results, it is clear
that the Recognizer does a very good job in clustering melodic fragments with similar
melodic contours together.
Figure 4.7: Timesarp objects in Max envimnrnent
Figure 4.8: SegmenterIRecognizer Display
(Input: Soprano voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.9: SegmentedRecognizer Display
(Input: Bass voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.10 Rising template
Figure 4.11 Falling template
Figure 4.12: RisinglFalling template
Figure 4.13: RisinglFalling template
. Figure 4.14: Flat template
Figure 4.15: Recognition Template Editor
Figure 4.16: Recognition DisplnyIEditor showing rising fmgments
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.17: Rising melodic fragments (normnlized)
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.18: Recognition Displry/Editor showing rising/fnlling fragments
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.19: Risindblling melodic fragments (normalized)
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.20: Examples of rising fragment recognition
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.21: Examples of Falling Fragment Recognition
(Input: Bass voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Figure 4.22: Examples of RisinglFalling Recognition
(Input: Alto voice, Fugue No. 2 in C Minor, BWV 847, J. S. Bach)
Fipre 4.23: Examples of FallinglRising Recognition
Figure 4.24: Examples of Flat Fngment Recognition
4.6 Summary
For this implementation of the Recognizer in the Listener model, the DTIV proved to be an
effective algorithm for the recognition of melodic fragments. The system was capable of
recognizing short melodic fragments in real-time. Good results were obtained in the
recognition of melodic fragments from music ranging from Bach fugues to bebop jazz. in
general, the Recognizer was capable of recognizing melodic fragments without specific
knowledge of a musical style. The contour-matching capabilities of the DTW allowed for
recognition of similar melodic fragments by accommodating the effects of insertion,
deletion, replacement, consolidation and fragmentation of elements.
The DTW-based melodic recognition system has been implemented as a Max external
object, a T-Max object for implementation on a parallel array of INMOS T-805 transputee
(Pennycook and Lea, 1991), and as a Macintosh program called Timewarp. The DTW may
also be used as a melodic recognition process for music analysis of printed scores. For
example, the output of an optical music recognition system (Fujinaga, 1992; 1996) may be
used as input to a DTW recognition system. This would enable users to search large
musical databases for similarities of certain musical features. This would also allow DTW
reference template databases to be constructed directly from printed musical scores. Future
research will examine level-building DTW algorithms (Silveman and Morgan, 1990) used
in continuous speech recognition to create higher-level hierarchies from recognized
fragments. As the performance of the recognition system is dependent upon the accuracy of
the Segmenter, continuous speech recognition techniques may allow for continuous
recognition of fragments and remove the need for a segmentation process.
5. Real-time Segmentation and Recognition of Vertical Sononties
5.1 Introduction
In this dissertation 1 have presented a model for the segmentation and recognition of
melodic fragments. As part of the research in creating this system 1 also examined the
problem of real-time segmentation and recognition of vertical sonorities. This resulted in the
realization of a real-the chord recognition component. The chord recognizer was part of a
computer model of a jazz improviser (Pennycook et. al., 1993). The jazz improviser model
consisted of two real-the, large-grain parallel processes, called the Listener and the Player
(Pennycook and Lea, 1991). One of the tasks of the Listener was to provide the Player with
information about the ensemble's position in a chord progression. The Listener must
therefore be able to derive the root of a chord given a collection of notes performed by an
ensemble of musicians. The Listener must also determine the quality of a given chord so
that the Player can fit its improvisation to the current harmonic situation.
5.2 Real-time Chord Recognition
The chord recognition process may be broken down into at least two separate processes.
First, notes must be grouped together into vertical sonorities. Second, the vertical sonorities
must be analyzed to determine the root and chord type. Rosenthal (1992) introduced a
chord finder based on the principIes of stream segregation. Unfortunately, his method was
not designed to operate in real-the. Rowe (1993) presented a connectionist model for the
real-lime recognition of chord roots. His model grouped a stream of notes into vertical
sonorities and determined the central pitch of a local harmonic area. However, Rowe's model
was designed to detect simple tertian structures and was therefore unsuitable for
dctermining the roots of the harmonic structures found in jm.
Parncutt (1988) revised Terhardt's (1982) octave-generalized model of the mot of a musical
chord involved the assigning of appropriate weights to a set of "root-support intervals" (in
descending order of importance: PllP8, P5, M3, m7, and M2lM9) in a subharmonic
matching routine. The model reliably predicted the roots of typical chords in 18th and 19th-
century music, including the traditionally problematic minor triad. We chose to incorponte
Pmcutt's model into the Listener because it was not limited to simple tertian harmonies and
could be easily adapted for real-time opention.
Chord Recognition
Figure 5.1: Chord recognition pmcess
The chord recognition model operates in real-time on a stream of MIDI data (Figure 5.2).
The opention of the model is shown in Figure 5.1. The model first segments the MIDI data
stream into vertical sonorities by grouping together notes with sttack limes that are within 80
ms of each other (Figure 5.4). The grouping of the chord candidates is shown in Figure 5.3.
When a vertical sonority of Iwo or more notes has been detected by the chord segmenter,
the notes are sent to the chord recognizer. The chord recognizer uses a modified version of
the algorithm described in Parncutt (1988). The recognizer first calculates the pitch-class
weights for al1 possible roots of the chord. These pitch-class weights are then used to
estimate the "root ambiguity" of the chord. Finally, the calculated root ambiguity value is
used to convert the pitch weights into absolute estirnates of salience (Parncutt, 1988). The
model selects the root with the highest pitch-class weight and reports to the Player the root
of the current vertical sonority, along with its chord type and root ambiguity.
Figure 5.2: Piano roll representation of a MIDI performance
(Input: Teilor Madness piano improvisation)
Figure 5.3: Notes gmuped into chord candidates
(Input: Tenor Madness piano improvisation)
Analysis Analysis
Figure 5.4: Creating chord candidates in real-tirne
53 Revision of Parncutt's Model
We discovered that Parncutt's model often failed to predict the roots of jazz keyboard
voicings, especially those in which roots or fifths are absent. In Parncutt's version, the
intervals Pl/P8 and P5 are weighted quite heavily relative to the other intervals (see Table
5.1). The new weights for the mode1 are shown in Table 5.2. We have reduced the emphasis
on the intervals Pl/P8 and P5, and increased the weights for the intervals m3, M3, and m7.
We have also added the root-support intenrals Tï, and M7 that were absent from Terhardi's
and Parncutt's original versions. We have implemented the model in such a way that the
user may easily adjust the interval weights according to the hannonic style of the music Io
be analyzed (Figure 5.5).
Terhardt (1982)
Table 5.1: Original weights
Table 5.2: Revised Jazz Weights
To compare the models, consider the chord C-Eb-G-Bb in any inversion (Figure 5.6).
Parncutt's model predicts that the root of this chord is ambiguous with Eb winning by a
small margin. n i e revised model calculates the root to be C and is thus more consistent with
jazz harmonies and the m u s i d styles in question in this research, in which the interpretation
i i 7 -~7-1 is preferred to 1~615-V-1. A similar calculation for the Do7 chord is shown in
Figure 5.7. It appears that the relative importance of the root-support intervals depends not
only on experience of the harmonic structure of single complex tones in speech and music,
but also on expenence of specific musical styles.
Figure 5.5: Adjustable subharmonic weights
I I I
Figure 5.6: Root Calculation for Cmin7 Chord
Figure 5.7: Root Calculation for Do7 Chord
5.4 Real-tirne Exarnple
The real-time chord recognition system has been implemented as a Max extemal object
called chord. The use of the chord object in a Max patch is shown in Figure 5.8. In this
example, the chord recognizer is presented with a piano improvisation on the tune Tenor
bfadness by Sonny Rollins. The chord progression for this tune is shown in Figure 5.9.
The recognized chords are shown in the chord recognizer display in Figure 5.10. The
recognizer is capable of determining the chord root and chord label, including chord
extensions.
Figure 5.8: M3x chord object
Figure 5.10: Xecognized Chords
(Input: Tenor Madness piano improvisation)
55 Summnry
The flexibility of the real-tirne chord recognizer has been verified with harmonic analyses of
music ranging from Bach to Bebop jazz. When the chord recognizer selects a root that is
not the same as the notated harmonic progression, in most cases the root selected is an
appropriate chord substitute. As with Pamcutt's original model, the revised model treats
chords as isolated vertical phenomena and does not consider each chord as an element in a
harmonic progression. Future work will involve adapting the model to consider previous
identified roofs in its identification of the current chord root.
6. Conclusions
This dissertation presents a working model of melodic fragment recognition. The Listener
model attempts and in many ways succeeds in modeling the perceptual processes used by
human listeners while listening to a musical performance. In many different musical
situations, the model performed the musical recognition tasks accurately. This lends support
to the validity of the hypotheses represented by the model. This computer model of a
listener demonstrated the use of applied musicology for the testing of hypotheses
formulated by perceptual research (hske , 1989). By examining perceptual research into
human melodic recognition, we can eventually develop a working computer model for the
red-time recognition of melodies.
If theories of human musical perception can be validated by computer models, then these
same theories can be used to improve the quality of computer music systems (Laske, 1978).
Computer music systems are any combination of computer hardware and software used to
compose, analyze or perform music. Human perceptual processes must be considered in the
design of these systems (Vercoe, 1992). Valid, perceptually-based theories of music are in
fact essential to computer music systems for "any attempts to simulate the ... abilities of
humans will probably not succeed until in fact the musical models and plans that humans
use are described and modeled" (Moorer, 1972, p.104). Theory is the lifeblood of a
computer music system and "a music analyst approaching the computer quickly realizes that
running out of theory means running out of program code" (Alphonce, 1980, p.26). While
music theory is one of the oldest disciplines in the literate history of civilization, "systematic,
coherent, well-developed and computnble music theory is both very young and rather
scarce" (Alphonce, 1980, p.26).
In humans, the formation of appropriate responses to musical stimuli draw upon a vast and
complex nehvork of concepts, leamed skills and innate behaviors that represent a formidable
126
e challenge to precise analysis (Dannenberg, 1987). Refined and specialized listening skills
are required of composen, music analysts, and petformen. However, considerable portions
of a musical "expert's" abilities may be equally tound in "naive" listeners, and even these
non-experts don? seem so naive when one looks more closely at their listening skills. Their
ability to segment and recognize melodic fragments and also to group these fragments into
melodies requires that a large number of complex tasks opente together in real-time. The
ability to perceive melodic patterns at various levels of organization is needed for a full
understanding of a piece of music. It seems that music understanding may rely heavily on
pattern recognition capabilities as well as logical reasoning and problem-solving techniques.
At the present time, computers should not be expected to undentand music the way humans
do. But it is possible to build computer music systems that attempt to deal with specific
aspects of musical intelligence. By becoming more aware of both the possibilities and the
limitations of such systems, we may Iearn how human listeners do what they do while
gradually raising the level of the simulated computer musician's musical abilities.
Appendix A. The Event Record
The following describes the fields of the Event record, a structure used to define each note
received by the Listener model. The Event record groups into one location the MIDI note-
on and note-off information for a given note. The record also contains al1 of the information
needed to make decisions on group boundaries. Information about the recognized melodic
fragment is also contained in the Event record.
long songNoteNum a number assigned to the note when it is received.
long begin
long end
long offset
long duration
char segLength
char class
the start lime of the note in milliseconds. This is the tirne
that the note-on event was received.
the end time of the event in millisewnds. This is the time
that the note-off event was received.
the time in milliseconds from the note-on of this note to the
note-on time of the next note.
the duration of the note in milliseconds. This is the lime from
the note-on to the note-off.
the length of the melodic fragment starting from this note.
Value will be O if this note is note the start of a fragment.
the Scheidt(l985) class assigned to the fragment if this is the
first note of the fragment. Values assigned are:
O - Rising
char dist
char pitch
char velocity
char interval
a
1 - Fz!ling
2 - RisingFalling
3 - FallingIRising
4 - Rat
if this note is the start of a fragment, the dist value is the
measure of similarity assigned by the D ï W algorithm.
the MIDI note value for this note.
the note-on MIDI velocity of this note.
the i n t e ~ a l in semitones behveen this note and the previous
note.
char onToOntype the note-on to next note-on rhythmic type value assigned to
the note by the Short Term Memory (STM). The value
ranges from 0-9 based on which rhy thmic type is assigned
by the STM.
char onToOfftype the note-on to note-off rhythmic type value assigned to
the note by the Short Term Memory (STM). The value
ranges from 0-9 based on which rhythmic type is assigned
by the STM.
char offfoontype used only for updating note-off clock in the RT Segmenter.
The note-off to next note-on rhythmic type assigned by the
STM. Values mnge from 0-9.
charofffoOnArtictype Articulation type assigned to note. Value is one of:
O - Slur
1 - Staccato
3 - rest
Boolean beginOfSeg TRUE if this is the beginning of a segment, else FALSE
Boolean N4yesSeg TRUE if the N4 Segmenter determined a group boundary at
this note.
Boolean N3yesSeg TRUE if the N3 Segmenter determined a group boundary at
this note.
Boolean realyesseg TRUE if the RT Segmenter determined a group boundary at
this note.
short N4rules a set of binary flags indicating which N4 Segmenter rule is
in effect at the group boundary.
short N4sum-of-weights the total of al1 weights used by the N4 Segmenter to
determine the group boundary.
short N3rules a set of binary flags indicating which N3 Segmenter rule is
in effed at the group boundary.
short N3sum-of-weights the total of al1 weights used by the N3 Segmenter to
determine the group boundary.
short realmles a set of binary flags indicating which RT Segmenter mle is
in effect at the group boundary.
short realsum-of-weights the totaI of al1 weights used by the RTSegmenter to
determine the group boundary.
stmct event *segEnd if this event is the beginning of a segment, this field is a
pointer to the event at the end of the segment.
struct event 'segBegin if this event is the end of a segment, this field is a pointer to
the event at the beginning of the segment.
stmct motive *motive if this event is the start of a fragment, this field is a pointer to
the motive assigned to this fragment.
struct submotive 'submotive if this event is the start of a fragment, this field is a pointer to
the submotive assigned to this fragment.
stmct event "next a pointer to the next Event structure in the linked list of
Events.
struct event *prev a pointer to the previous Event structure in the linked list of
Events.
References
Alphonce, B. 1980. Music Analysis by Computer - a field for Theory Formation. Computer hfusic Journal. 4,2: 26-35.
Ames, C. and M. Domino. 1992. Cybemetic Composer: An Overview. Understanding Music wirlt AI. Cambridge: AAAi Press. 86-205.
Bartlett, James C. and Jay W. Dowling. 1980. Recognition of Transposed Melodies: A Key-Distance Effect in Developmental Perspective. Journal ofEtperimenta1 Ps).cltology: Hur>?an Perception andPerformance. 6(3): 501-515.
Bellman, R. E. 1957. Dynamic Programming. Princeton: Princeton University Press.
Bregman, A. 1990.Auditot-y Scene Analysis. Cambridge: The MIT Press.
Chomsky, N. 1965.Aspects oftlie Tlreot-y of Syntm. Cambridge: The MIT Press.
Cope, David. 1992. Computer Modeling of Musical Intelligence in EMI. Compirter Music Journal. 16,2: 69-83.
Cope, David. 1991. Conputers and Musical Style. Madison: A-R Editions.
Cope, David. 1990. Pattern Matching as an Engine for the Computer Simulation of Musical Style. Proceedings of tlie 1990 International Computer Music Cotrfeence. 288-291.
Croonen, W. L. M., and P. F. M. Kop. 1989. Tonality, Tonal Scheme, and Contour in Delayed Recognition of Tone Sequences. Music Perception. 7.1: 49-68.
Dannenberg, Roger B. and Bernard Mont-Reynaud. 1987. Following an Improvisation in Real Time. Proceedings oftltc International Computer hfusic Conference. 241-47.
Dannenberg, Roger B. 1984. An On-Line Algorithm for Real-The Accompaniment. Proceedings oftlte 1984 International Computer Music Conference. 193-198.
Davies, J. B. and J. Jennings. 1977. Reproduction of familiar melodies and the perception of tonal sequences. Journal oftlteAcoirstical Sociefy ofAmerica. 61.2: 534-541.
Deliége, 1.1987. Grouping Conditions in Listening to Music: An Approach to Lerdahl and Jackendoffs Grouping Preference Rules. Music Perception. 4.4: 325-360.
Desain, Peter and Henkjan Honing. 1995. Computational models of beat induction: the rule-based aooroach.Artificiai Intelli~ence and hfusic. 14th Intemational Conference on Artificial lnÎeiligence. 1-i0.
..
Dowling, W. Jay. 1978. Scale and Contour: Two Components of a Theory of Memory for Melodies. Psycltological Revieiv. (85)4: 41-354.
Dowling, W. J. 1973. Rhythmic Groups and Subjective Chunks in Memory for Melodies. Perception and Psycl~opltysics. 14.1: 37-40.
Dowling, W. J. and Diane Fujitani. 1971. Contour, Interval, and Pitch Recognition in Memory for Melodies. TlteJournal oftlre Acoustical Society ofAmerica. 49(2): 524-531.
Dyson, Mary C. and Anthony J. Watkins. 1984. A Figural Approach to the Role of Melodic Contour in Melody Recognition. Perception and Ps).cliopliysics. 355: 477-485.
Ellis, D. 1992. A Perceptual Representation of Audio. MS Thesis, EECS, Media Laboratory, Massachusetts Institute of Technology.
Fujinaga, 1. 1996. Adaptive Optical Music Recognition. Ph.D. dissertation, McGill University.
Fujinaga, I., B. Alphonce, and Bmce Pennycook. 1992. Interactive Optical Music Recognition. Proceedings of the 1992 International Computer Music Conference. 117-120.
Gill, Stanley. 1964. A Technique for Composition of Music in a Computer. The Computer Journal. 6: 129-33.
Grubb, Lorin and Roger B. Dannenberg. 1993. Pattern Processing in Music. Proceedings of the 1994 International Computer Music Conference. 63-69.
Hiller, Lejaren A. 1959. Computer Music. Scientific American. 201(6): 109-120.
Hiller, Lejaren A, and Leonard M. Issacson. 1959. .Operimental hfusic: Composition with an Electronic Cornputer. New York: McGnw-Hill.
Hiraga, Yuzum. 1996. A Cognitive Mode1 of Pattern Matching in Music. Proceedings of the 1996 International Computer Music Conference. 248-250.
Itilkuril, Fumitada. 1975. Minimum Prediction Residual Principle Applied to Speech Recognition. IEEE Transactions on Acoicstics, Speech and Signal Processing. ASSP-23: 67-72.
Koenig, G. M. 1970a. Project One. Electronic Music Report 3. Utrecht: Institute of Sonology. (Reprinted 1977, Amsterdam: Swets and Zeitlinger.)
Koenig, G. M. 1970b. Project Two. Electronic MfisicReport2. Utrecht: Institute of Sonology. (Reprinted 1977, Amsterdam: Swets and Zeitlinger.)
Laske, Otto. 1989. Introduction to Cognitive Musicology. Journal ofAfusicoiogy. 9: 1-22.
Laske, Otto. 1981. Composition Theory in Koenig's Project One and Project Two. Computer Music Journal. 5,4: 54-65.
Laske, Otto. 1980. Towards an Explicit Cognitive ïheory of Musical Listening. Compirter Music Jormal. 4.2: 73-83.
Laske, Otto. 1978. Considering Human Memory in Designing User Interfaces for Computer Music. Compitter Music Journal. 2,4: 39-45.
Ledley, Robert Steven. 1962. Programming and Utilizing Digital Compitters. New York: McGraw-Hill.
Lerdahl, Fred and Ray Jackendoff. 1983. A Generative Theory of Tonal Music. Cambridge: The MIT Press.
Massam. Dominic W., Howard J. Kallman, and Janet L Kelly. 1980. The Role of Tone Height, Melodic Contour, and Tone Chroma in Melody Recognition. Journal of Erperimental Psychology.. Hwnan Learning and Memory. 6(1): 77-90.
Minsky, Marvin. 1986. TheSociery ofMind. New York: Simon and Schuster.
Minsky, Mamin. 1981. Music, Mind, and Meaning. Compute M c Journal. 53: 28-44.
hlongeau, Marce! and David Sankoff. 1990. Companson of Musical Sequences. Compurers and the Humaniries. 24: 61-175.
Moorer, James Anderson. theACM. 15(2): 104.
1972. Music and Computer Composition. Cornmunications of
Pamcutt, R. 1988. Revision of Terhardt's Psychoacoustic Model of the Root(s) of a Musical Chord. hfusic Perception. 6:l: 65-94.
Pennycook, Bruce and Dale Stammen. 1994. A Model of Tonal Jazz Improvisation. Proceedings of the 3rd International Conference on Music Perception and Cognition. 61- 62.
Pennycook, Bruce, Dale Stammen, and Debbie Reynolds. 1993. Toward a Computer Model of a Jazz Improviser. Proceedings of t11e 1993 International Computer Music Conference. 228-231.
Pennycook, Bruce and Chris Lea. 1991. T-MAX A Panllel Processing Development System for MAX. Proceedings of tlie 1991 International Computer Music Conference. 229-233.
Pinkerton, Richard C. 1956. Information Theory and Melody. Scieritific American. 191: 77- 86.
Puckette, M. and D. Zicarelli. 1990. Max-An Interactive Grapliical Programming Environment. Menlo Park: Opcode Systems.
Rabiner, L. R., A. E. Rosenberg, and S. E. Levinson. 1978. Considentions in Dynamic Time Warping Algorithms for Discrete Word Recognition. IEEE Transactions on Aconstics, Speech, Signal Processing. ASSP-26: 575-582.
Roads, Curtis. 1979. Grammars as Representations for Music. Computer Music Journal. 3,l: 45-55.
Rolland, Piene-Yves. 1998. Dccouverte Automlitique de Regularities Dans Les Sequences et Application a l'Analyse Musicale. Ph.D. Dissertation, Université Paris.
Rolland, Pierre-Yves. 1994. Automated Extraction of Prominent Motives in Jazz Solo Corpuses. Proceedings of tlie 4 8 International Conference on hi'usic Perception and Cognition. 491-495.
Rollins, Sonny. 1957. Tenor Madness. Berkeley: Prestige Music Inc.
Rosenthal, D. 1992. Machine Rhythm: Computer Emdation of Human Rltytltm Perception. Ph.D. dissertation, Massachusetts Institute of Technology.
Rosenthal, David. 1992. Emulation of Human Rhythm Perception. Computer Music Journal. 16.1: 64-76.
Rosenthal, David. 1989. A Model of the Process of Listening to Simple Rhythms. Music Perception. 6 3 315-328.
Rowe, Robert and Tang-Chun Li. 1994. Pattern Processing in Music. Proceedings of the 1994 International Cornputer Music Conference. 60-62.
Rowe, Robert. 1993. Interactive Music Systems Machine Listening and Cornposing. Cambridge, Massachusetts: The MIT Press.
Rowe, Robert. 1991. Machine Listening and Composing: Making Seme ofMusic wirlt Cooperating Real-Time Agen». Ph.D. dissertation, Massachusetts Institute of Technology.
Sakoe, Hiroaki and Seibi Chiba. 1978. Dynamic Programming Algorithm Optimization for Spoken Word Recognition. IEEE Transactions on ASSP. 26.1.
Sankoff, David and Joseph B. Kruskal (Ed). 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sepence Cornparison. Reading, Massachusetts: Addison Wesley.
Scheidt, Daniel J. A. 1985. A Prototype Implementation of a Generative hlecliatiism for Music Composition. Masters Thesis, Queen's University, Kingston, Ontario.
a Schloss, A. 1985. On the Automatic Transcription of Percussive Music - fmm Acoustic Signal to High-Level Analysis. Ph.D. Thesis, CCRMA, Department of Music, Stanford University.
Silverman, Harvey F. and David P. Morgan. 1990. The Application of Dynamic Programming to Connected Speech Recognition. IEEEASSP Magazine. 7.3:6-25.
Smoliar, Stephen W. 1992. Representing Listening Behavior: Problems and Prospects. In Mira Balaban, et al. (Ed.): Understanding Music ivitlt AI. Cambridge, Massachusetts: AAAl Press. 53-63.
Smoliar, Stephan W. 1976. Music Programs: An Approach to Music Theory through Computational Linguistics. Journal of Music Tlieory. 20: 105-31.
Stammen, Dale and Bruce Pennycook. 1994. Real-the segmentation of music using an adaptation of Lerdahl and Jackendoffs Groupine. Principles. Proceedings of the 3rd ~nt&ational Conference on Mirsic ~ e r c e ~ t i h Ünd ~06;nition. 269-2701
.
Stammen, D., B. Pennycook and R. Parncutt. 1994. A Revision of Pamcutt's Psychoacoustical Model of the Root of a Musical Chord. Proceedings of the 3rd International Conference on Music Perception and Cognition. 357-358.
Stammen, Dale and Bruce Pennycook. 1993. Real-time Recognition of Melodic Fragments Using the Dynamic Timewarp Algorithm. Proceedings of the 1993 International Computer Mitsic Conference. 232-235.
Terhardt, E., G. Stoll and M. Seewann. 1982. Algorithm for extraction of pitch and pitch salience from complex tonal signals. Journal of tlieAmericanAcouslica1 Society. 71: 679- 688.
Vercoe, Barry. 1992. A Realtime Auditory Mode1 of Rhythm Perception and Cognition. Paper presented at the Second International Conference on Music Perception and Cognition. Los Angeles, Califomia.
Vercoe, Barry. 1985. nie Synthetic Petformer in the Context of Live Music.Proceedingsof the 1984 International Computer Music Conference. 199-200.
Vercoe, Barry. 1971. Harry B. Lincoln, The Cornputer and Music, and Barry S. Brook, Musicology and the Cornputer. Perspectives of New Music. 9.1: 323-330.