View
226
Download
4
Category
Preview:
Citation preview
Intonation Components in short English Statements
Yi Xu
Haskins Laboratories
New Haven, Connecticut
Ching X. Xu
Department of Communication Sciences and Disorders
Northwestern University, Evanston
Running Title: Intonation Components in English
Address: Yi XuHaskins Laboratories270 Crown StreetNew Haven, CT 06511USA
Telephone: (203)865-6163
E-mail: xu@haskins.yale.edu
Yi Xu & Ching. X. Xu
2
ABSTRACT
In this study we attempt to identify the basic components of statement intonation as related t o
focus, accent and lexical stress in General American English. Instead of viewing f0 contours as direct
acoustic correlates of intonation components, we regard them as the outcome of implementing
different functional components of intonation under various articulatory constraints. Eight
American English speakers were recorded while reading aloud short declarative sentences with or
without narrow focus at different locations. Results of analyses suggest that f0 contours in short
declarative sentences in English are determined by three separate specifications: local pitch target,
articulatory effort, and pitch range. Every syllable seems to be associated with a pitch target which
determines the ideal local pitch contour. Non-focused, non-final accents seem to carry a static
[high], and word-final accent under focus and sentence-final accent seem to carry a dynamic [fall].
Unaccented syllables, whether or not lexically stressed, probably carry a static [mid] rather than
being completely targetless. Articulatory effort determines how forcefully a local target is
implemented. The pitch targets of accented syllables seem to be implemented with strong efforts,
those of unaccented syllables with weak efforts, while lexically unstressed syllables with even weaker
efforts. Pitch range determines the height and span of f0 at which local pitch targets are
implemented. Focus appears to operate by expanding the pitch range of the on-focus stressed
syllables, suppressing the pitch range of all post-focus syllables, and leaving the pitch range of pre-
focus words intact. To account for the present data as well as other recent findings, a new model of
intonation is considered.
Intonation Components in English
3
1. INTRODUCTION
A major objective in studying intonation is to determine its basic components. This is by no means a
trivial task. The meandering f0 curve of a speech utterance could be viewed as being constructed in
various ways. Some studies treat pitch contours such as rising, falling, and more complex shapes as
the basic components (Pike, 1945, 1948; Abramson, 1978; Bolinger, 1951, 1986; Crystal, 1969;
't Hart, Collier, & Cohen, 1990; Taylor 1994). Some studies analyze tone and intonation into pitch
registers such as H (high) and L (low) (Woo, 1969; Leben, 1973; Gandour, 1974; Anderson, 1978;
Duanmu, 1994), and associate the H and L directly to the peaks and valleys in the f0 tracings
(Pierrehumbert, 1980; Pierrehumbert & Beckman, 1988; Arvaniti, Ladd, & Mennen, 1998; Ladd et
al., 1999; Ladd, Mennen & Schepman, 2000). In particular, there has been a long-standing debate
over whether contours or registers are the most basic components of tone and intonation (Anderson,
1978; Pierrehumbert & Beckman, 1988; Duanmu, 1994).
One of the reasons why these issues are not easily settled is that observed f0 contours do not always
correspond directly to real functional units of intonation. In the study of segmental units in speech
such as consonants and vowels, there have been a consensus that the acoustic forms of these units in
connected speech are usually variants of their canonical forms. It follows that there probably is also a
discrepancy between the canonical and surface forms of intonational units. The difficulty with
intonation research is that the canonical forms are often difficult to isolate. Nevertheless, some
studies have considered ways to reconcile the discrepancy between surface and underlying forms of
intonation components. The superposition theories of intonation (Fujisaki 1988; Gårding 1979;
Grønnum 1995) regards surface f0 contours as the outcome of local pitch contours superimposed on a
global intonation curve. The approach taken by researchers at the Institute of Perception Research
in the Netherlands regards observed f0 contours as consisting of perceptually (and communicatively)
relevant straight lines which are complicated by micro-variations due to phonetic overspecification.
They believe that perceptually relevant contours can be discovered by replacing observed f0 contours
with stylized straight lines that are perceptually indistinguishable from the raw f0 contours. This
account does not specify, however, how exactly these straight lines, assuming they are the underlying
forms of intonation components, become complicated surface f0 contours through "phonetic
overspecification."
The autosegmental and metrical (AM) approach, as represented by Pierrehumbert (1980, 2000) and
Pierrehumbert and Beckman (1988), assumes that, underlyingly, English intonation consists of only
two level tones — H and L, and that surface f0 contours are linked to them through a set of elaborate
Yi Xu & Ching. X. Xu
4
phonetic implementation rules. The essence of these rules is interpolation and pitch range
modification, and pitch height readjustment, which we will discuss in more detail in 1.8.
Fujisaki (Fujisaki, 1983, 1988, 1992) adopted an approach that attempts to link surface f0 directly t o
muscle commands. He proposes that surface f0 contours result from the responses of a second order
linear system to two types of underlying commands, accent commands and phrase commands. The
accent commands have idealized stepwise waveforms and the phrase commands have idealized
impulse waveforms (Fujisaki, 1988: 348). The responses to these commands generate critically
damped oscillation of f0 which rises exponentially in the direction of the commands and then falls
back exponentially to the baseline after the termination of the commands. Both types of commands
therefore generate critically damped curves that rise and fall at various rates. The output f0 curve
generated by this model is the arithmetic sum of the logarithmic representations of the curves
generated by the two types of commands. This model thus specifies an explicit connection between
complicated f0 contours and underlying commands that are rather simple in form.
The Fijisaki model makes two important assumptions. The first is that there exists a constant
“restoring” force that is always in the opposite direction of both the accent and phrase commands
(Fujisaki, 1992). Due to this restoring force, f0 always goes back toward the baseline after the
termination of an accent or phrase command. In the model, the restoring force comes from the
elasticity of the vocal folds which act like a pair of passive springs being stretched by the
cricothyroid muscle (CT) during phonation. There is a challenge to this assumption, however.
According to Hollien (1960) and Hollien and Moore (1960), the vocal folds are actually the longest
at rest rather than during phonation, and that at the onset of voice, the vocal folds always shortens.
During phonation, vocal fold length does increase with fundamental frequency, but it never exceeds
its length during rest. It is therefore unlikely that the vocal folds would snap back as soon as the CT
stops contracting, causing f0 to drop automatically.1 Another key assumption of the Fujisaki model is
that surface f0 is directly linked to muscle commands without an intermediate level of organization.
In this way, f0 generation is not linked or constrained by supralaryngeal structures such as the
syllable. As will be discussed next, such link and constraint are critical to our understanding of f0 of
Mandarin tones, and possibly to f0 contours in English as well.2
Despite questions regarding its basic assumptions, the Fujisaki model demonstrates the possibility that
complex surface f0 contours may be generated by an interaction between simple but linguistically
driven underlying events and an articulatory system that implements these events. Assuming that an
interaction of this nature does occur and it does play a critical role in pitch contour generation,
understanding the properties of both the linguistic events and the articulatory system then becomes
the key to the understanding of how pitch contours work in speech. In recent years, a number of
Intonation Components in English
5
findings seem to have made such understanding easier than before. These include the findings about
contextual tonal variations, f0-syllable alignment, maximum speed of pitch change, the realization of
focus, downstep and declination, and the realization of purportedly toneless elements. In the
following, we will briefly review these findings and discuss how they may guide our understanding of
the data collected in the present study.
1.1. Contextual f0 variations in tone languages
In languages like Mandarin, Thai and Vietnamese, a lexical tone is carried by a syllable.3 A tone-
syllable combination can be said in isolation, e.g., as a monosyllabic utterance, or just as a
monosyllabic word or morpheme in citation form. Due to this property, the f0 contours of isolated
tones have been well established over the years (e.g., Bai, 1934; Pike, 1948; Chao, 1956, 1968;
Abramson, 1962, 1976; Lin, 1965, 1988; Howie, 1976; Shih, 1988). In recent years, much attention
has been given to variation of f0 contours of lexical tones when they are said in connected speech
(Han & Kim, 1974; Gandour, Potisuk & Dechongkit 1994; Lin & Yan, 1991; Wu, 1982, 1984,
1988, 1990; Xu, 1993, 1994, 1997, 1999). When a tone is produced next to other tones, its f0
contour deviates from the citation form, sometime extensively. Figure 1 (a-c) shows examples of
carryover and anticipatory effects on H, R and F in Mandarin. Each graph in Figure 1 displays mean
f0 contours of four five-syllable sentences in which only the tone of the second syllable varies across
H, R, L and F. As can be seen, the f0 contour of the initial portion of the third syllable varies
extensively with the tone of the second syllable. In fact, they each seem to be transitions from the
end of the previous tone to the most appropriate f0 contour for the tone of the third syllable: high-
level for H, rising for R, and falling for F. As a result, the most proper contour of a tone is best
approximated in the later portion of the third syllable, while the influence of the preceding tone is
the most salient in the early portion of the syllable. Similar effects have been reported for other tone
languages (Gandour et al. 1994 for Thai, and Li & Lee, 2002 for Cantonese).
Insert Figure 1 about here
In Figure 1 the f0 contours of the first syllable also seem to vary to some extent with the tone of the
second syllable. But the variations are much smaller in amplitude than the carryover variation just
mentioned. Furthermore, these anticipatory variations are mostly dissimilatory in the sense that f0 is
raised by any tone of the second syllable that contains a low value: R, L or F. This kind of
anticipatory effect has been found in a number of languages (Gandour et al., 1992 and Gandour et al.
1994 for Thai, Hyman 1993 for Enginni, Mankon, and Kirimi, Laniran 1992 for Yoruba, Laniran &
Gerfen 1997 for Igbo; and Xu 1993 for Mandarin). The underlying mechanism of this effect is still
unclear, although there have been some hypotheses (Gandour et al., 1992; Gandour et al., 1994; Xu
Yi Xu & Ching. X. Xu
6
1993, 1997). Thus the realization of a tone seems to vary asymmetrically with the surrounding
tones. The variation with the preceding tone appears assimilatory whereas the variation with the
following tone dissimilatory.
1.2. Maximum speed of pitch change — The key source of what makes f0 contours
nonequivalent to underlying targets
As can be seen in Figure 1, the assimilatory effect of a tone upon the following tone is largely in the
form of a seemingly long transition between the ending f0 of the preceding tone and underlying onset
pitch of the following tone. This may suggest that there is an articulatory constraint on how fast
speakers can change pitch. But it is also possible that speakers deliberately make these transitions
long. It would thus be helpful to know how much of the transitions is indeed directly due to the
constraint of maximum speed of pitch change. For this purpose, Xu and Sun (2002) assessed how fast
speakers can make pitch changes voluntarily. In the study, native speakers of Mandarin and English
produced alternate high and low pitches as rapidly as possible by imitating a number of fast synthetic
pitch alternation patterns. It is found that, for both English and Mandarin subjects, the maximum
speed of pitch change is positively related to the magnitude of pitch change, i.e., the larger the
magnitude, the faster the maximum speed of pitch change. It is also found that the minimum time it
takes to complete a pitch change is also positively related to the magnitude of the pitch change. The
linear equations for the speed and time of pitch change (for all subjects) as a functions of pitch
change magnitude are shown in (1) to (4).
s = 10.8 + 5.6 d (1)
s = 8.9 + 6.2 d (2)
t = 89.6 + 8.7 d (3)
t = 100.4 + 5.8 d (4)
where s is the average maximum speed of pitch change in semitones per second (st/s), t is the amount
of time it takes (in ms) to complete the pitch change and d is the size of pitch change in semitone.
Xu & Sun (2002) further compared the mean maximum speed of pitch change computed with
equations (1)-(4) to the maximum speed of pitch change reported for several languages, including
Mandarin, English, and Dutch. The two kinds of speed were found to be largely comparable for all
these languages, provided that the speed of pitch change was really the fastest possible in each case.
This finding indicates that in many occasions, the fastest speed of pitch change is indeed approached
in speech. This in turn suggests that our understanding of f0 contours in speech should always take
this articulatory constraint into consideration. For example, according to (3), it would take at least
142 ms to complete a 6-semitone pitch rise. Applying this to speech, it means that in a syllable with
Intonation Components in English
7
a duration of 180 ms, the greater half of the f0 contour in the syllable would have to be used for
completing the pitch rise from L to H even if speakers used their maximum speed of pitch change.
This would suggest that the long f0 transitions due to carryover influence probably is largely due to a
physical limitation that Mandarin speakers cannot overcome. Nor should English speakers be able t o
avoid similar transitions when they change pitch from one level to another, since their maximum
speed of voluntary pitch change is essentially the same as that of Mandarin speakers (Xu & Sun,
2002).
1.3. Pitch targets
Putting together the findings about contextual tonal variations and maximum speed of pitch change,
it becomes evident that observed f0 events in speech cannot be the underlying functional units per se.
Rather, they are more likely products of speaker’s effort to implement some kind of underlying
pitch targets under various articulatory constraints. This view is summarized in Xu and Wang (2001)
as the pitch target implementation model of tone realization. According to this model, observed f0
contours are generated by interactions between underlying pitch targets and articulatory constraints.
The underlying targets can be either static or dynamic, as illustrated in Figure 2. The vertical lines in
Figure 2 indicate the onset and offset of two adjacent syllables. The dashed lines represent two
adjacent pitch targets: a dynamic [rise] and a static [low]. For Mandarin, these targets are assumed t o
be associated with the R and L tones which are carried by the two syllables in the figure, respectively.
The solid curve represents the surface f0 contour, which is assumed to be the result of implementing
the pitch targets under various articulatory constraints, including the maximum speed of pitch
change. Due to the combined pressure to realize them both as rapidly as possible and as accurately as
possible, these targets are approached asymptotically, as is indicated by the shape of the solid curve
corresponding to either syllable 1 and syllable 2 in Figure 2.
Insert Figure 2 about here
Two other likely articulatory constraints are also incorporated in the model as illustrated in Figure 2.
First, although, at the abstract phonetic level, each target is assigned to a syllable without stringent
alignment requirement, its implementation nevertheless strictly coincides with the entire syllable,
i.e., starting at the syllable onset and ending at the syllable offset. This is due to a likely constraint
on synchronization of laryngeal and supralaryngeal movements (See 1.6. for more detailed
discussion). Second, due to inertia and friction, there is an acceleration period before the target-
approaching f0 movement reaches full speed. This is seen in the convex-up shape at the very
beginning of the solid curve in both syllables in Figure 2. The convex shape is more prominent in
Yi Xu & Ching. X. Xu
8
syllable 2 because the inertia to be overcome has a positive velocity as a result of implementing the
target [rise] in the first syllable (Xu & Wang, 2001).
Note that in this model, unlike the Fujisaki model discussed earlier, there are no automatic forces
that return f0 to a neutral value. This is because empirical studies such as Gandour et al. (1992, 1994),
Xu (1993, 1997 1998, 1999) and Li and Lee (2002) have found that the most appropriate f0 contour
of a tone is always best approximated in the final portion of a syllable, and that the subsequent f0
contour in the following syllable is always going toward the next tonal target rather than toward a
common neutral value, as can be seen in Figure 1.
1.4. Focus
Focus, i.e., discourse/pragmatics motivated emphasis, is also known as focal prominence, contrastive
stress, emphatic stress, sentence-level stress, etc. The acoustic realization of focus has been
investigated by many studies (Bruce 1977; Bruce, & Touati 1992; Caspers & van Heuven 1993;
Cooper, Eady & Mueller 1985; D'Imperio, 2001; Eady & Copper 1986; Eady et al. 1986; Gårding
1987; Jin 1996; Liberman & Pierrehumbert 1984; Pierrehumbert 1980; Prieto, van Santen &
Hirshberg 1995; Shih 1988). The general consensus has been that focus is conveyed mainly through
variations in f0. This may potentially be a problem for tone languages like Mandarin, because tones
are also conveyed mostly through f0. However, as found by Jin (1996) and Xu (1999), focus and
tones are realized concurrently in Mandarin by varying different aspects of f0 contours. In general,
tone identities are implemented as local pitch targets, while focus is implemented as regional pitch
range variations. As can be seen in Figure 3, the pitch range directly under focus is expanded; and the
pitch range after focus is suppressed (lowered and compressed). Furthermore, as can be also seen in
Figure 3, the pitch range before focus does not seem to deviate from the neutral-focus condition.
Insert Figure 3 about here
Though there has not been a consensus on whether languages like English, too, implement focus with
three distinct pitch ranges, existing data suggest that this may be the case. For example,
Pierrehumbert (1980) examined the relative f0 heights of an early pitch accent and a later one in an
utterance as a function of focus location. Her data suggest that when there is an early focus in the
utterance, the f0 range in the later portion of the utterance is reduced, whereas the earlier f0 contour
is only slightly lowered when the focus is on a later pitch accent. In a series of studies by Cooper and
Eady and their colleagues (Cooper et al., 1985; Eady & Cooper, 1986; Eady et al., 1986), it was
found that the effect of a narrow focus4 in a declarative English sentence is to raise the f0 of the
Intonation Components in English
9
focused word and to lower the f0 of the later words in the sentence. In contrast to the lowered f0 of
the post-focus words, however, f0 of the pre-focus words was found to remain much the same as in a
focus-neutral sentence. Gårding (1987) reported a similar asymmetry of f0 variation around focus.
1.5. Downstep and declination
Downstep refers to the phenomenon that in a tone string of H L H, the second H is lower in f0 than
the first H. It was first reported for African tone languages (Stewart, 1965, 1983; Meeussen, 1970;
Hyman, 1973). It was also reported for non-tone languages (e.g. Pierrehumbert, 1980 for English;
Poser, 1984 and Pierrehumbert & Beckman, 1988 for Japanese; and Prieto, Shih, & Nibert, 1996 for
Spanish). Declination refers to the phenomenon that the overall f0 level as well as the f0 peaks and
valleys becomes gradually lower over the course of an utterance. It is found for both tone languages
and non-tone languages. For non-tone languages, the phenomenon is first reported by Cohen and
't Hart et al. (1967). For tone languages, it is reported as downdrift, as opposed to downstep which is
more local (Hombert, 1974; Laniran & Gerfen, 1997). There have been various accounts of
declination. Some accounts attribute the effect to physiological factors, such as reduction of
subglottal pressure (Lieberman & Tseng, 1980). Other accounts attribute the effect to meaningful
linguistic structures. Liberman and Pierrehumbert (1984), for example, point out that many of the
physiological accounts were posited without analysis of the tonal components of the utterances. Xu
(1999) shows how observed downstep and declination can be decomposed into different contributing
factors when both tone and focus are systematically controlled. Through detailed analyses,
contributions of independent mechanisms can be identified. As shown in Figure 4, downstep seems t o
stem from two mechanisms: anticipatory raising and carryover lowering. Both effects are exerted by
a L tone, which raises the f0 of the preceding H and lowers the f0 of the following H. The two effects
combined generates a negative tilt of the f0 surrounding the L. This effect can occur repeatedly when
there are more L tones intervening the H tones. This repeated applications of anticipatory raising
and carryover lowering would thus generate a gradual f0 descent over the course of the entire
utterance. But downstep is only one of the sources of declination. Focus, with its characteristic on-
focus pitch range expansion and post-focus pitch range suppression, generates an additional down
trend. This can be seen in Figure 4b where the rather steep downtrend seems to be due to both focus
and downstep. There is also another known factor that can potentially generate an even greater
down trend than downstep and focus. As shown by Lehiste (1975) and Umeda (1982) the
introduction of a new topic at the beginning of a paragraph may introduce an initial f0 peak almost
one octave higher than later f0 peaks in an utterance.
Insert Figure 4 about here
Yi Xu & Ching. X. Xu
10
1.6. f0-syllable alignment
A number of recent studies have reported that certain f0 events such as peaks and valleys have a
relatively stable alignment with the onset or offset of the syllable. These findings come from two
major sources, research on tone languages (Kim 1999; Xu 1998, 1999, 2001a) and research on non-
tone languages (Arvaniti, Ladd and Mennen 1998; Caspers & van Heuven 1993; Ladd, Mennen and
Schepman 2000; Prieto et al., 1995). For non-tone languages, it has been found that f0 peaks and
valleys are aligned with both the onset and offset of a syllable carrying a pitch accent. Caspers and
van Heuven (1993) find that the onset of an “accent-lending” f0 rise is always aligned with the
syllable onset. Arvaniti et al. (1998) report that in Greek an f0 maximum “is very precisely aligned
just after the beginning of the first postaccentual vowel” (p. 23). Ladd et al. (1999) find that in
English pre-nuclear accent, the f0 peak occurs around 40 ms after the offset of the stressed syllable at
normal speech rate. Ladd et al. (2000) observe that in Dutch, the rising prenuclear pitch accent has
two different alignment patterns for the phonologically long and short vowels. When the vowel in
the accented syllable is phonologically long, the f0 peak usually occurs at the end of the vowel. When
a vowel in the accented syllable is phonologically short, however, the f0 peak usually occurs in the
following consonant. More interestingly, when the accented syllable contains the vowel /i/ which is
phonologically long but phonetically similar in duration to the short vowel /I/, the f0 peak also
occurred in the following consonant, though the location of the peak is still significantly earlier than
that with /I/. This seems to be evidence that f0 contour alignment is determined both by phonological
vowel length and by articulatory constraint on how fast pitch can be changed.
For tone languages, earlier reports of the experimental results have put much emphasis on the finding
that certain f0 peaks and valleys are consistently aligned with the syllable offset (Kim 1999; Xu
1998, 1999, 2001a). For example, Kim (1999) reports that in Chichewa f0 peaks occur consistently
right after the offset of the H-bearing syllable if the syllable is pre-penult. Xu (1998, 1999, 2001a)
reports that in Mandarin the f0 peak associated with R and f0 valley associated with F remain close t o
syllable offset, and that the f0 peak associated with H and f0 valley associated with L generally occur
before syllable offset but also remain close to syllable offset. In contrast, certain earlier turning
points, e.g., f0 valley in R and f0 peak in F, occur near the center of the syllable, as can bee seen in
(b)-(d) in Figure 1. This emphasis on the f0 alignment with the syllable offset may have
overshadowed another important aspect of the same set of findings. That is, there is also strong
evidence that the onset of the movement toward each pitch target coincides with the onset of the
syllable. In Figure 1, for example, regardless of the tone of the second syllable in each graph, the
movements toward the high-level, rising and falling contours appropriate for H, R and F,
respectively, always start from the onset of the third syllable. Furthermore, in Figure 1c a valley
consistently occurs around the boundary between syllables 3 and 4, and in Figure 1d a peak
Intonation Components in English
11
consistently occurs soon after the boundary between syllables 3 and 4. Given the likely underlying
targets of the adjacent tones in both cases, these turning points seem to be where the implementation
of the tone of syllable 3 ends and that of syllable 4 begins. Similar evidence has also been seen in
Yoruba (cf. discussion in Xu, 2002).
The consistent alignment of f0 events with segmental elements of the syllable has been interpreted in
different ways. On the one hand, it has been interpreted as evidence that these f0 events are
deliberately targeted at specific locations in the syllable (D’Imperio, 2002; Ladd et al., 1999; Ladd et
al., 2000; Ladd & Schepman, 2003). On the other hand, Xu and Wang (2001) have argued that these
patterns should be interpreted as evidence that the underlying pitch targets have to be synchronously
implemented with the syllable, presumably due to the biomechanic constraint that concurrent motor
movements have to be fully synchronized, especially when they continually reoccur at high speed
(Kelso, 1984; Kelso et al. 1981; Kelso, Southard & Goodman, 1979; Schmidt, Carello & Turvey,
1990). Note that the first interpretation is heavily dependent on the assumption that speakers have
the freedom to align the f0 turning points anywhere they want. Based on accumulating evidence, as
argued in Xu (2002), speakers do not have such freedom. Therefore, it is unlikely that f0 turning
points are the properties of the intonational components themselves. Rather, they are only evidence
for the properties of the underlying components.
1.7. Neutral tone
In Mandarin, beside the four full lexical tones, there is also a fifth tone often known as the neutral
tone. This tone is similar to the unstressed syllable in English in terms of pitch specification because
it is generally believed to be toneless (Chao, 1968; Yip, 2002). Its f0 is believed to be totally
dependent on the tonal context, and due specifically to either spreading from the preceding tone or
interpolation between the preceding and the following tones (Chao, 1968; Shih, 1988; Yip, 1990). A
recent study, however, found that neither spreading nor interpolation is likely to be the mechanism
responsible the f0 contours of the neutral tone (Chen & Xu, 2002). Figure 5 shows f0 contours of the
neutral tone as compared to full tones in similar tonal contexts. In Figure 5a, the F tone in syllable 2
immediately follows four different tones in syllable 1. In Figure 5b, three neutral tones occur before
the F tone. As can be seen in the Figure 5b, the f0 of the first neutral tone indeed varies substantially
with the preceding tone, but so does the F tone in Figure 5a. What is different is that, whereas the f0
contours of the F tone in Figure 5a fully converge in the final portion of the syllable, the contours
remain well separated by the end of the first neutral-tone syllable, and they do not fully converge
even by the end of the third neutral-tone syllable.5 Chen and Xu (2002) conclude that these patterns
demonstrate that the neutral tone is not totally targetless. Rather, it seems to be associated with a
static target [mid] (or [mid-low]), judging from the fact that its f0 contours converge toward a value
Yi Xu & Ching. X. Xu
12
lower than the high level of the F tone but higher than the low level of the L tone. What makes the
neutral tone different from all other lexical tones is that its target seems to be implemented with a
rather weak articulatory effort, as is evident from the much slower convergence than in a full tone.
Despite the weak effort, nonetheless, the f0 of the neutral tone does not seem to be affected by the
following full tone. In fact, as can be also seen in Figure 5b, it is the offset f0 of the neutral tone that
seems to determine the onset f0 of the following full tone.
In Figure 5b we can also see that the f0 peak occurs after the end of the H-tone syllable. This is in
contrast with earlier findings that the f0 peak associated with the H tone rarely occurs after the
syllable offset when followed by a full-tone syllable (Xu, 1999, 2001a). This "peak delay" is
understood as resulting from the neutral tone's weak ability to reverse the final f0 movement in the
preceding syllable due to its weak articulatory effort.
Given the seeming similarity between the Mandarin neutral tone and unstressed syllables in English, it
is conceivable that what has found about the Mandarin neutral tone is applicable also to English.
Insert Figure 5 about here
1.8. The case of English
Currently, the most widely accepted phonological framework of American English intonation is the
Pierrehumbert model (Pierrehumbert, 1980; Pierrehumbert & Beckman, 1988), which is also known
as the autosegmental and metrical (AM) model (Ladd, 1996). The Pierrehumbert model assumes that,
underlyingly, English intonation consists of only two level tones — H and L. A string of H and L
tones are organized into pitch accents, which are strung together linearly to form intermediate
phrases, which are then organized into intonational phrases. An intermediate phrase is marked by a
phrase accent at its edges (H- or L-), and an intonational phrase is marked by boundary tones at its
right edge (H% or L%). Pitch accents, phrase accents, and boundary tones are all linearly ordered and
can be combined into various mono-tonal and bi-tonal combinations: H*, L*, H*+L, H+L*, L*+H,
L+H*.
In each pitch accent the "starred" tone is assumed to be aligned with the stressed syllable while the
non-starred tone(s) with the unstressed syllable(s). The model further assumes that non-accented
words do not carry tones, and their f0 comes from interpolation between adjacent accents. In fact, all
the surface f0 contours are assumed to result from phonetic interpolation of tones which are the f0
turning points such as peaks and valleys. The interpolation is either straight-lined or curved. The
curved interpolation is so-called "sagging" interpolation, which makes the f0 of the unaccented
Intonation Components in English
13
syllable(s) between two pitch accents "sag" like a rope hung between two trees (Pierrehumbert, 1980,
1981). The Pierrehumbert model views intonation as strictly linear in two senses. First, all global
shapes of f0 are the results of sequentially ordered local f0 registers. Second, with the exception of
non-categorical factors such as overall effort and emotional state, there is no temporal overlap of
tonal components. Thus at any give time interval, there can be one and only one tone. The only
exception is that both the intermediate phrase boundary tone and the intonational phrase boundary
tone, as developed later in the theory, may influence the realization of all the tones within the same
phrase (Pierrehumbert & Beckman, 1988). Finally, the Pierrehumbert model does not reserve any
special status for nuclear accent other than referring to it as the last pitch accent in an intonational
phrase.
The Pierrehumbert model of English intonation is similar to our current understanding of Mandarin
tone and intonation (Xu, 2001b; Xu & Wang, 2001) in that both recognize that underlying tonal and
intonational units are not equivalent to surface f0 contours. The two differ from each other,
however, in terms of how the underlying units are linked to surface f0 contours. The Pierrehumbert
model assumes that tones correspond directly to the extreme f0 points, i.e., peaks and valleys, and
that the rest of the f0 contours come from interpolation between the extreme points. Our
understanding of Mandarin tone and intonation, on the other hand, is that the underlying tonal and
intonational units are linked to f0 contours through articulatory approximation of simple, linear
pitch targets at linguistically specified pitch ranges with linguistically specified amount of effort, as
has been discussed in 1.3 and 1.7. Recent findings about the similarity between English and Mandarin
speakers in terms of maximum speed of pitch change (Xu & Sun, 2002) suggest that this
understanding is potentially applicable to English as well. First, since English speakers are also bound
by the same articulatory constraints that Mandarin speakers are subjected to, surface f0 contours of
English, including the turning points, should be treated also only as evidence for the underlying pitch
targets rather than as the targets themselves. Second, our recent findings about the acoustic
manifestation of tone and focus in Mandarin demonstrate that it is possible for multiple categorical
components of tone and intonation to co-occur at the same location in a sentence: a lexical tone is
not eradicated whether it is on-focus, pre-focus or post-focus, while focus itself is also effectively
conveyed (Xu, 1999). Since there is apparently multiple layers of information that need to be
conveyed through intonation in English, it is possible that in English, too, different intonational
components can occur concurrently, i.e., overlapping with one another in time. There are at least
three kinds of prominence that are conveyed mainly or partially through f0 in English, namely,
lexical stress, focus and pitch accent. Lexical stress is the relative prominence of individual syllables
in a word, which is lexically specified. Focus, as discussed in 1.4., is discourse/pragmatics motivated
emphasis, whose occurrence is required by the information flow of the dialogue or monologue. Pitch
Yi Xu & Ching. X. Xu
14
accents, which occur on certain words in an utterance and make them more prominent than other
words, have often been equated to focus (And, focus has often been referred to as the nuclear accent,
cf. Ladd, 1996 for detailed discussion). Ladd (1996) shows, however, at least impressionistically,
pitch accents do not always coincide with focus. Hirschberg (1993) demonstrates that the most
important predictor of pitch accents (although both nuclear and pre-nuclear accents are included) is
part of speech, which can predict three quarter of the human-labeled pitch accents. Part of speech is
apparently different from discourse/pragmatics motivated focus, nor were most of the other factors
that Hirschberg (1993) found to further improve the prediction of pitch accents. Thus there is a need
to separate pitch accents from focus, and a need to find out whether and how the two are
differentially manifested through f0 contours.
1.9. Goal of the study
The foregoing discussion leads us to two critical questions about English intonation. First, what are
the underlying forms of pitch targets associated with local intonational components? Second, can
different types of intonational components co-occur in time without eradicating each other? And if
they can, how do they each manifest themselves effectively in terms of f0? The present study is
designed to address these questions from a rather rudimentary level. We will examine short
declarative sentences said with narrow focus at various locations in order to address the following
specific questions.
a) What are the pitch targets associated with local prominences in a declarative sentence: static
[high], or dynamic [rise] or [fall]?
b) Is focus realized with pitch specification only for the accented/stressed syllable or with pitch
specifications both for accented/stress syllable and for post-focus syllables (including the unstressed
syllable after the stressed syllable of the focused word)?
c) Do post-focus words have no pitch targets of their own and are thus implemented with only a
flat low f0 contour, or do they still have their own pitch targets, which are implemented with reduced
pitch range?
d) Do stressed syllables lose their original accents when under focus, or do they retain their
accents but with changed pitch range?
e) Do syllables between pitch accents carry any pitch targets, or is f0 only interpolated through
these syllables?
Intonation Components in English
15
2. Method
2.1. Stimuli
The stimuli are short declarative sentences. To make extensive f0 alignment analysis possible, we
need to use words that have sonorant (preferably nasal) onsets and with no coda consonants if
possible. The target sentences used are in the form of “Lee may know my niece.” The italicized
words vary in word length, stressed pattern, phonological length of stressed syllable and focus status.
Word length varies from monosyllabic to trisyllabic. Stress pattern varies between word-final and
non-final. Phonological length of stressed syllable is either long or short. Focus status varies from
on-focus to pre-focus and/or post-focus, as focus location varies from sentence-initial to sentence-
medial to sentence-final. The following are the compositions of the stimulus sentences. Two words,
‘may’ and ‘my’, remain unchanged in all sentences, and they are usually unaccented (unless in special
contexts, which are not included in the present design). There are three sentence groups composed
for examining f0 contours at three locations in the sentence: beginning, middle, and end. In each
sentence group, the alternative words in the same location rotate to form different sentences.
Sentences in each group were produced in two focus conditions: no narrow focus, and focus on the
underscored word.
1. Lee / Nina / Lamar / Emily / Ramona may know my niece 5 (words) ¥ 2 (foci) ¥ 7 (repetitions) = 70
2. Lee may lure / mimic / minimize my niece 3 (words) ¥ 2 (foci) ¥ 7 (repetitions) = 42
3. Lee may know my niece / nanny / mummy 3 (words) ¥ 2 (foci) ¥ 7 (repetitions) – 7 =
35 6
Focus is controlled by having subjects say the target sentences as answers to prompt questions that
ask about specific pieces of information available in the target sentences. This method has been used
successfully in previous studies (Cooper et al.,1985; Xu, 1999). The prompt questions are shown
below together with illustration of focus locations in exemplar target sentences.
Prompt: Target:
Who may know your niece? Lee may know my niece.
What may Lee do to your niece? Lee may lure my niece.
Who may Lee know? Lee may know my niece .
What did you say? Lee may know my niece.
The overall duration of these sentences was also manipulated by having subjects say the same
sentence at two different speaking rates: normal and fast. (A pilot test found that some speakers had
Yi Xu & Ching. X. Xu
16
difficulty maintaining focus consistently at slow speaking rate. So, only two speaking rates were
used.) This is to elicit a wide range of duration variation in order to make f0 alignment analysis more
reliable.
2.2. Subjects
Eight native speakers of American English, aged 20-35, participated as subjects. Four of them were
females, and the others males. They were recruited from the Northwestern University campus and
were paid for their participation. None of them reported having any speech disorders. They all spoke
general American English without noticeable accents.
2.3. Recording Procedure
Recording was conducted in a sound-treated booth at the Speech Acoustics Laboratory in the
Department of Communication Sciences and Disorders at Northwestern University. The subject was
seated comfortably in front of a computer monitor in the booth. The microphone was placed by the
side of the monitor, approximately 1 foot away from the subject's lips. In each trial, the subject
pressed the “Next” button displayed on the screen and the target sentence was displayed on the
screen. At the same time, a prompt question was played through a loudspeaker. The subject then read
aloud the displayed sentence as a response to the prompt question. The prompt questions were
recorded at two speaking rates, normal and fast. Subjects were instructed to say the target sentence at
similar speaking rate as that of the prompt question. They were also instructed not to pause in the
middle of a sentence. In case a mistake was made as judged by the experimenter, the subject was asked
to repeat the sentence. The sentences were presented in random order, and a different order was used
for each subject. Before the start of the real trials, the subject went through a number of practice
trials until he/she was familiar with the procedure.
2.4. f0 extraction
The acoustic analysis procedure was similar to those used in Xu (1997, 1998, 1999, 2001a). First the
digitized signals were converted to a format readable by programs in the ESPS/waves+ signal
processing software package (Entropic Inc.). Then individual target sentences were extracted and
saved as separate ESPS signal files. The program epochs in the ESPS package was then run to mark
every vocal cycle in the target words. After that, the marked signals were labeled manually in the
ESPS xwaves program for the onset and offset of each segment (both consonants and vowels) of the
target words using the xlabel program. Manual editing was performed to correct spurious vocal pulse
labeling by the epochs program (such as double-marking or vocal-cycle skipping).
Intonation Components in English
17
The vocal pulse markings and segment labels for each utterance were saved by the xlabel program in
a text file. Those text files were then processed by a set of custom-written computer programs.
These programs first converted the duration of vocal cycles into f0 values, and then smoothed the
resulting f0 curve using a trimming algorithm that eliminated abrupt bumps and sharp edges (cf. Xu,
1999 for details).
3. Analysis and Results
Recognizing that acoustic patterns do not resemble underlying phonetic targets directly, as discussed
in the Introduction, our goal is not to find direct "acoustic correlates" of either pitch targets or focus.
Rather, the goal is to find acoustic evidence for the underlying pitch targets and pitch range
specifications that are associated with pitch accents and focus. The search for the evidence will
follow the following rationale, which is based mostly on what we have learned about articulatory
constraints on f0 production as discussed in the Introduction:
(1) It takes time to change pitch articulatorily. Thus a significant portion of observed f0 contours
must be transitions toward the intended underlying targets rather than being the targets themselves.
(2) Due to rigid coordination of laryngeal and supralaryngeal movements, there is little room for
speakers to make micro-adjustments of f0 alignment. And, based on recent findings (Ladd et al.,
1999; Ladd & Schepman, 2003), we take it as our working assumption that in English the syllable is
also the unit of pitch target alignment like in Mandarin, unless proven otherwise.
(3) Based on (1) and (2), the f0 contour in the early portion of a syllable is understood as mainly a
transition toward the pitch target associated with the syllable, whereas the later portion of the f0 in
the syllable will be viewed as more directly reflecting the underlying target, especially if the syllable is
sufficiently long.
(4) It also takes effort to change pitch. Thus less effort should lead to slower pitch changes. The
reverse should also be true. That is, slower pitch movements during transitions should be indication of
weaker efforts rather than total absent of underlying targets.
Guided by these rationales, our data analysis attempts to find answers to the questions listed near the
beginning of the Method section. Table 1 lists these questions again together with the specific f0
events we were looking for in order to answer the questions.
Insert Table 1 about here
Yi Xu & Ching. X. Xu
18
The following analysis consists of two phases. In phase I we perform visual inspection of the f0
contours. In phase II, we perform various quantitative analyses.
3.1. Phase I — Visual Inspection of f0 Contours
The first step in visual inspection was to check for outliers. The purpose was to exclude sentences
that were said with apparently wrong focus. f0 contours of the 7 repetitions of each sentence with the
same focus and speaking rate were displayed as illustrated in Figure 6, which displays the f0 contours
of the sentence “Nina may know my niece” with no narrow focus, produced by all subjects at
“normal” rate. These curves are displayed using normalized time, i.e., with the same number of
points taken from each syllable at equal proportional interval, e.g., 0, 1/20, 2/20, 3/20, …, 20/20. As
can be seen, displayed in this way, the f0 curves by each subject, except subject 2 (whose case will be
discussed later), are highly consistent across the seven repetitions. When an inconsistency was
noticed, the following criteria were used to determine if an outlier was involved and if it should be
excluded.
A repetition is excluded if and only if
a) it is obviously different from the rest of the repetitions, and
b) it has the wrong focus as judged auditorily by the authors
A repetition is not excluded if
a) it differs from other repetitions only in pitch range but not in perceived focus
Insert Figure 6 about here
Altogether, a total of four repetitions from subject 2 were excluded (1.4% of the total, and all from
different conditions) and 1 from subject 4 (0.3% of the total) was excluded.
After excluding the outliers, for each subject, the repetitions of each sentence at each speaking rate
were averaged to obtain a mean f0 curve. Then the mean duration of each syllable across the
repetitions was computed. This mean duration was used in displaying the f0 contours of each syllable
in the sentences in the same focus condition. In this way we could compare the tonal contours of
different sentences without losing sight of the actual duration of each syllable. Figure 7 displays mean
f0 curves of all sentences produced at normal rate by all subjects except subject2. F0 curves of subject
2 were not included in the mean F0 curves because of their apparent inconsistencies with those of
other subjects'. The open squares, circles and diamonds on the f0 curves indicate syllable boundaries.
For syllables with initial sonorants, the boundaries are set at the point where the spectral pattern
Intonation Components in English
19
makes an abrupt shift into a typical nasal or lateral pattern. (cf. Xu, 1999 for more detailed
description of the labeling procedure). For syllables with stops and fricatives, the boundaries are set at
the onset of closure or frication.
Insert Figure 7 about here
Through visual inspection, we made a number of direct observations on the f0 curves, most of which
are visible in Figure 7, but some, especially the individual differences, are not. First, we noticed fairly
consistent patterns in the height of f0 peaks of focused words as compared to the neutral focus
sentence:
1. The f0 peak of a word is consistently higher under a narrow focus than in the neutral-
focus sentence.
2. The f0 peaks of all words after a narrow focus are lower than in the neutral-focus
sentence.
3. The f0 peaks of words before a narrow focus are lower than in the neutral-focus sentence
for some subjects (mostly females) but not for others (mostly males).
Second, we observed the following patterns and trends in terms of the location of f0 peaks in and
around focused words:
4. In all accented syllables, f0 starts to rise near the beginning of the syllable.
5. If the lexical stress is word final (Lee, Lamar or lure), the f0 peak usually occurs within
but near the end of the stressed syllable.
6. If the lexical stress is not word final (Nina, Emily, Ramona, mimic, minimize, nanny or
mummy), the peak mostly occurs in the unstressed syllable following the stressed syllable.
7. In a final monosyllabic word (niece), the peak occurs around the middle of the stressed
syllable.
Third, we observed some further details, some of which overlap with observations 4-7.
8. f0 peak occurs earlier when the vowel of the stressed syllable is phonologically (and
phonetically) long (Lee, Lamar, Nina, Ramona, lure, nanny) than when the vowel is
short (Emily, mimic, minimize, mummy).
Yi Xu & Ching. X. Xu
20
9. f0 drop after a stressed syllable is faster in a focused word than in a non-focused word —
this may suggest post-focus suppression as an active force.
10. The scope of post-focus suppression seems to include not only all post-focus words but
also post-accent unstressed syllable(s) in the focused word.
Finally, we notice a number of individual differences.
11. For subject 2, the first and second f0 peaks in all sentences without narrow focus are much
later than other subjects
12. For subjects 1, 2, 4, 6, the f0 peak occurs right before the offset of the accented syllable
in “Lee”, “Lamar”, and “lure” when they are under focus. For subjects 3, 5, 7 8, however,
the f0 peak occurs well before the offset of the accented syllable in these words.
13. While there are apparent on-focus pitch range expansion and post-focus pitch range
suppression, for subjects 7 and 8 at least, there are also visible f0 movement
corresponding to the accented words in the post-focus region. Faint traces of post-focus
accents can be also seen in the f0 curves of subjects 1, 4, 5 (in “know”, f0 rises after
syllable onset).
3.2. Phase II — Quantitative Analyses
In this section, we first report results of statistical analyses performed to verify the observations
described in the previous section. We then report results of further quantitative analyses aimed at
finding out the underlying mechanisms of the observed patterns. The following measurements were
taken from individual f0 curves produced by all eight subjects using a set of custom-written C
programs.
• Minf0 (st) — lowest f0 in the stressed syllable of the accented words (or in all words for some
analyses), measured in semitone with the lowest f0 of each subject as the reference.
• Maxf0 (st) — highest f0 in the stressed syllable of the accented words (or in all words for
some analyses), measured in semitone with the lowest f0 of each subject as the reference.
• Rise size (st) — difference in semitone between maximum f0 and minimum f0 in the stressed
syllable of an accented word
• Accent-dur (ms) — duration of the stressed syllable in an accented word
Intonation Components in English
21
• Rise time (ms) — time interval between f0 minimum and f0 maximum in the stressed syllable
of an accented word
• Rise speed (st/s) = 1000 * Rise size / Rise time
• Maxf0-to-C2 — time interval between f0 maximum and onset of the first post-accent syllable
• C1-to-maxf0 — time interval between onset of the accented syllable and f0 maximum
• Minf0-to-C1 — time interval between f0 minimum and onset of the accented syllable
• Peak location = 100 ¥ C1-to-maxf0 / accent-dur
• Valley location = 100 ¥ C1-to-minf0 / accent-dur
3.2.1. Focus effect
We first address the issue of how focus is realized in terms of f0 of the accented syllable under focus.
Table 2 displays maxf0, minf0, rise size, rise speed, and accent-dur broken down according to focus
(on/none), speaking rate (normal/fast), accent location (word-final/word-nonfinal), and position
(word1/word3/and word5). Also displayed in the table are probability values resulting from four-factor
repeated-measures ANOVAs performed on the five measurements. (The effect of gender was found
to be non-significant for any of the dependent variables in a set of five-factor mixed-measure
ANOVA’s. We therefore excluded it in the ANOVA’s reported in Table 2.) As can be seen in Table
2, the effect of focus is highly significant for all dependent variables except minf0. Under focus,
maximum f0 becomes higher, the size of f0 rise becomes larger, the speed of f0 rise becomes faster,
and the duration of the accented syllable becomes longer. It is worth pointing out that although the
speed of f0 rise under focus increased drastically, it is still well below the maximum speed of pitch rise
reported by Xu and Sun (2002) for the corresponding rise size (23.4 st/s at 4.4 st vs. 10.8 + 5.6 x 4.4
= 35.4 st/s per Table VI in Xu & Sun (2002)). But this speed is similar to what was reported by Ladd
et al. (1999) and Ladd et al. (2000).
Accent-dur is significantly longer at normal rate than at fast rate, as would be expected. The effect of
rate is also significant for rise size. But the difference between the two rates is very small.
Table 2 shows that the effect of accent location is significant on minf0, rise speed and accent-dur.
When the accent is word final, the duration of the accented syllable is increased by 66.6 ms, but the
speed of f0 rise is also increased. The increase in rise speed may seem to be related to the increase in
rise size, because, according to Xu and Sun (2002), rise speed is directly related to rise size. However,
Yi Xu & Ching. X. Xu
22
the range of rise size increase in Table 2 is only 0.2 st, which, according to Table VI of Xu and Sun
(2002), can generate a speed difference of only 0.72 st/s, much smaller than the 2.7 st/s shown in
Table 2. This rise speed increase thus appears deliberate. That is, the underlying pitch target is more
like a [fall] in a word-final accent than in a word-nonfinal accent. However, there is a significant
interaction between accent location and position. The largest difference between word-final and
word-nonfinal accent is in word 5 (5.8 st/s), whereas in word 1 and word 3 the differences are 1.5 and
0.9 st/s, respectively. There is also a significant three-way interaction between focus, accent location
and position. The largest difference between word-final and word-nonfinal accent is in word 5 under
focus: 9.2 st/s, whereas in word 1 and 3 either under focus not under focus, the largest difference is
2.3 st/s. Thus it seems that the sentence final position under focus is somewhat special. This will
become clearer in further analysis later.
The effect of position is significant for all dependent variables except rise size. As the position of
the accented syllable becomes later in a sentence, maximum and minimum f0 become lower, rise
speed becomes slower, and accent duration becomes longer.
Insert Table 2 about here
Overall, Table 2 shows that when the effects of rate, accent location and position are controlled,
under focus, the accented syllable becomes longer, the maximum f0 associated with the accented
syllable becomes higher, the size of the pitch rise becomes larger, and the speed of the pitch rise
becomes faster.
To examine the effect of focus on the f0 of post-focus words, a set of ANOVA’s were performed on
the f0 of non-accented syllables and the results are displayed in Table 3.
Insert Table 3 about here
Table 3 shows the mean values of maximum f0 in each word after word 1 and word 3 when they are
under focus and when there is no narrow focus in the sentence. The post-focus words are also divided
depending on whether the accented syllable in the focused word is followed by an unstressed syllable
(adjacency — close: Lee, Lamar, Lure; far: Nina, Ramona, Emily, mimic, minimize). As can be seen,
overall, the maximum f0 of words following the focused word is significantly lower than the same
words in the no focus condition whether focus is on word 1 or word 3. Also, the maximum f0 of post-
focus words is higher when the first post-focus syllable is immediately following the accented syllable
in the focused word than when separated by one or more unstressed syllables in the focused word.
Intonation Components in English
23
However, there is a significant interactions between focus and adjacency, and between adjacency and
position (p = 0.0133 and p < 0.0001). As can be seen in Figure 8, it is only when adjacency is close
that maximum f0 of the word immediately following the focused word is higher in the focused
condition than in the no-focus condition. From what we know about the mechanisms of f0
production and from what can be seen in Figure 7, this difference is mainly attributable to the fact
that it takes time for the f0 raised by focus to drop to the desired post-focus level.
Insert Figure 8 about here
The fact that maximum f0 of the syllable immediately following the accented syllable in the focus
condition is higher than in the non-focus condition, as seen in Figure 8 and Table 3, may suggest that
the scope of the focus includes the following unaccented syllable. The curves in Figure 7 indicate,
however, that there is a sharp drop in f0 even in the unstressed syllable following the accented
syllable. Figure 9 shows the mean f0 in semitone at different locations in the post-accent syllable
broken down by focus and post-accent stress. Only word1 and word3 sentences are included, because
in word5 sentences, “niece” does not have any post-accent syllable. As can be seen in Figure 9, the
downward slope is shallower when the post-accent syllable is weak than when it is strong. At the same
time, post-accent f0 drop is faster when the accented syllable is focused than when it is not focused,
whether or not the post-accent syllable is an unstressed syllable within the focused word or a stressed
syllable in the following word. A four-factor (focus, accent location, position, location in syllable)
repeated-measures ANOVA finds a highly significant interaction between focus and position (p =
0.0068), confirming that f0 drops sharply within the post-accent syllable. (The effect of focus is
non-significant, but those of accent location and location highly significant (p < 0.0001 for both).
There is also a significant interaction between focus and accent location (p < 0.0001).) Hence, the
high maximum f0 of the first post-focus syllable is immediately followed by a sharp fall toward a
much lower f0. And this fall seems to be due to speakers' attempt to lower f0 immediately after the
focused, accented syllable, whether or not the following syllable is part of the focused word.
Insert Figure 9 about here
Table 3 also shows that maximum f0 of a word is not significantly different whether or not it
precedes a focus. This despite the fact that for some subjects pre-focus syllables seem to have lower
f0 maxima than the when there is no focus in the same sentence, as can be seen in Figure 7.
Insert Figure 10 about here
Yi Xu & Ching. X. Xu
24
Another question that needs to be answered is whether post-focus words are totally accentless after a
narrow focus. Figure 10 shows percentage of discernable post-focus f0 peaks and the rise size of these
peaks in semitone in sentences with initial focus (in sentence group 1 shown in 2.1.1.). A peak is
defined as discernable if there is an f0 point between the onset and offset of the words “know” and
“niece” that is higher than both the starting and ending f0 of the word. The graph on the left
indicates that there are greater number of discernable peaks when there is no focus than when either
word 1 or word 3 is focused. A three-factor repeated measures ANOVA with focus, rate and position
as independent variables finds the effect of focus to be highly significant (p < 0.0001), but the effect
of position non-significant. Nevertheless, the lowest percentage of peak occurrence in post-focus
condition is still nearly 60%. The graph on the right in Figure 10 shows that there is again a
difference in rise size between the focus and no focus condition. However, a three-factor repeated
measures ANOVA finds the effect of focus to be non-significant, but the effect of position significant
(p = 0.0088). There is also no significant interaction between focus and position. Note that although
the mean rise size is quite small overall, the rise occurs in a declining f0 contour. So the size of the
accent is actually larger than the rise size seems to suggest directly.
Finally, previous studies have found that sentence-final pitch accents do not differ in f0 due to focus.
To test whether this is the case in the present data, a four-factor repeated measures ANOVA was
performed on maximum f0 of word 5. The effect of focus turns out to be significant, with maximum
f0 being higher under final focus (9.3 st) than when there is no narrow focus (6.9 st), F(1,6) = 14.793,
p = 0.0085.
3.2.2. f0 events associated with accented syllables
As discussed earlier, although f0 peaks and valleys are not necessarily the critical acoustic correlates
of a linguistic tonal unit such as lexical tone, pitch accent or focus, analysis of their alignment
relative to segmental/syllabic unit may help us identify the underlying pitch targets associated with
the accented syllables. Table 4 displays different kinds of pitch targets and their f0 alignment with the
associated syllables according to previous studies (Xu, 1998, 1999, 2001a).
Insert Table 4 about here
3.2.2.1. Alignment of f0 peaks
Five factors that may potentially contribute to f0 peak alignment are controlled: speaking rate (rate),
focus, position in sentence (position), location of accented syllable within word (accent location),
and length of accented syllable (accent length — phonological length of accented syllable: long —
Intonation Components in English
25
"Lee", "Nina", "Lamar", "Ramona", "lure", "niece", "nanny"; short — "Emily", "mimic",
"minimize", "mummy"). The last two factors, however, are not independent of each other in the
data set. Their effects therefore have to be examined separately. Also, because stress cannot be on a
word-final open syllable with short vowel, we excluded words with short accented syllables ("Emily",
"mummy") when examining the effect of location of accented syllable within word, and excluded
words with final stress when examining the effect of length of accented syllable. And, because accent
location and accent length fully coincide at the word 3 position, this position is not included in the
alignment analysis reported next. The alignment patterns in those words, nevertheless, did conform
to the same pattern as in the other two positions. Two separate sets of four-factor repeated
ANOVAs were performed and the probability values together with the means are displayed in the
upper and lower halves of Table 5, respectively. Also shown in Table 5 are mean values of maxf0-to-
C2 and peak location broken down by focus, rate, position, accent location and accent length.
From Table 5 we can see that the effects of focus and rate are significant when maxf0-to-C2 and
peak location are broken down by accent length, but not when they are broken down by accent
location. But it is interesting that it is in the no-focus condition that the delay is greater, whether the
difference is statistically significant. This suggests that the underlying target under focus is more like
a [fall] than a [high].
In sharp contrast to those of focus and rate, the effects of accent location, accent length and
position are highly significant on both dependent variables. Figure 11 shows the mean values of the
two dependent variables broken down further by position, accent location and accent length. In the
figure, we can see that the tendency for f0 peaks to occur later is related to three conditions: when
accent location is not word-final, when accent length is short, and when the accented word is not
sentence final. This is true in both the upper and lower panels in the figure.
Interestingly, looking at Table 2 again, we notice that duration of the accented syllable decreases
with position in an orderly manner: word1 < word3 < word5. This agrees with the trend in the right
panel of Figure 11 quite well. It is possible that all these situations are related to shorter duration of
the accented syllable, and that it is the shortened duration of the accented syllable that pushes the f0
peak location rightward. To verify this possibility, we recomputed mean duration of the accented
syllable in word 1 and word 5 according to focus, rate, accent location and accent length. They are
displayed in Figure 12. The graph on the left excludes data from words with short accented syllables;
and the graph on the right excludes data from words with word-final accented syllables. In both
graphs, a general trend can be seen: the longer the duration of the accented syllable, the earlier the
location of the f0 peak. In general, it is when the duration of the accented syllable is shorter than 200
ms that the f0 peak occurs in the following syllable, with the exception of the sentence final
Yi Xu & Ching. X. Xu
26
position. It seems that f0 peaks tend to occur earlier in the sentence final position, especially when
the accented syllable is sentence-final. Further examinations need to be done.
Insert Table 5 about here
Insert Figure 11 about here
Insert Figure 12 about here
3.2.2.2. More detailed analyses / f0 peak alignment in relation to duration of accented syllables
The analyses so far have revealed certain gross patterns related to focus and accents realization. The
sources of these pattern and their variations, however, are still not clear. As discussed in regard t o
Figure 12 and Table 5, we are still not certain what determines the exact location of the f0 peak
associated with an accented syllable, i.e., whether it occurs before or after the offset of the accented
syllable and how far away the peak is from the syllable offset. We have seen, however, peak location
is related to whether the accented syllable is word-final (accent location) and whether the vowel in
the accented syllable is phonologically short or long (vowel length). To determine which of the two
factors is dominant, a set of regression analyses were performed using accent duration as predictor
and maxf0-to-C2 as dependent variable. Again, to separate the effect of accent location, we excluded
short-vowel syllables, since there were no short-vowel word-final accented syllables in the data. And,
to examine the effect of vowel length, we excluded word-final accents. Figure 13 displays the
regression results. Because accent location and vowel length fully coincide in word 3, this position is
not included in the figure. The upper panel of Figure 13 shows the values of r2, and the lower panel
shows the slope of the regression line. In the left-hand graphs, the results are broken down by focus
and accent location, and in the right-hand graph the results are broken down by focus and length of
accented vowel. As can be seen, when the accent is word final (Lee, Lamar, niece) and on-focus, the
r2 values are quite large: 0.356 and 0.381 for word 1 and word 5, respectively. The corresponding
slopes of regression lines are large and positive (0.351, 0.364), indicating that the f0 peak occurs
increasingly earlier in the accented syllable as the syllable becomes longer. As was seen in Figure 11,
the relative location of the f0 peak is 80% and 56% of the syllable duration in word-final accents
when word 1 and word 5 are under focus. The difference in relative peak location between word 1 and
word 5 does not seem to have much to do with the slightly longer duration of the accented syllable in
word 5 than in word 1. This is because word 5 has much earlier peaks than word 1 even when the
durations of the accented syllables are comparable. For example, according to the regression
Intonation Components in English
27
equations, when accent duration is 200 ms, maxf0-to-C2 is 19.2 and 37.0 ms for no-focus and focus
conditions in word 1, respectively, but 99.9 and 84.0 ms for on-focus and focus conditions in word 5,
respectively. This indicates that f0 peaks have a stronger tendency to occur early in sentence final
position than in sentence-initial position. When the sentences have no narrow focus, the r2 values
are all very small except if the accent is sentence final (niece): 0.455. In the latter case, the slope of
the regression line is also positive, indicating that the peak location moved earlier as the syllable
duration became longer. The right-hand graphs of Figure 13 show the regression results broken down
by focus and vowel length. Note that none of the r2 values is greater than 0.2. This indicates that the
location of f0 peaks is not closely related to the duration of the accented syllable with either vowel
length.
Insert Figure 13 about here
Insert Figure 14 about here
To verify if what we have seen in sentence initial and sentence final positions also occurs in the
sentence medial position, Figure 14 displays regression results for word 3. The only sizeable r2 value
for word 3 is for the word-final/long-vowel syllable: r2 = 0.477. This indicates that it is only when the
accented vowel is long and/or word-final and when it is under focus that the f0 peak is affected by
duration of the accented syllable. When the accented vowel is short and non-word-final, the mean
values of maxf0-to-C2 are negative whether on focus or not: –29 ms and –18 ms, indicating that the
peaks mostly occurs after the offset of the syllable. This is in contrast to word 5 as shown in
Figure 11, where maxf0-to-C2 is mostly positive both when the accent is short and when it is non-
word-final.
To summarize, (a) when neither under focus nor sentence final, the f0 peak associated with an
accented syllable occurs close to and before the syllable offset if it is not followed by an unstressed
syllable, but close to and after the syllable offset if it is followed by an unstressed syllable; in neither
case does the peak location vary systematically with syllable duration. (b) If the accented syllable is
sentence final or if it is both word final and under focus, the f0 peak occurs well before the offset of
the accented syllable and its location becomes increasingly earlier relative to the syllable offset when
the duration of accented syllable increases. Implications of these patterns will be discussed in the
General Discussion.
Yi Xu & Ching. X. Xu
28
3.2.2.3. Alignment of f0 valleys
As shown in Table 4, determination of the basic form of the pitch target associated with an accent
requires not only information about alignment of f0 peaks, but also that of f0 valleys. In particular,
knowing the f0 valley alignment can help us distinguish [rise] from [high] and [fall], and [rise] from
[low]. The analysis of f0 peaks has already indicated that the pitch target in the accent in word 1, 3,
or 5 is unlikely to be [low], because it is never the case that f0 maximum occurs at the beginning of
the accented syllable. The preceding analyses also suggest that the target is either [high] or [fall]
depending on a number of factors. However, there is another possibility that has not be fully tested,
i.e., an f0 rise during a syllable could also be due to a [rise]. There has already been some evidence that
this is not highly likely target, because f0 peaks mostly occur within the accent syllable unless the
syllable duration is very short. This is in contrast to the R tone in Mandarin, which presumably
carries a [rise], where the peak always occurs after the end of the syllable, regardless of syllable
duration. Further verification of this understanding may be obtained by analysis of f0 valley
alignment. Table 6 compares the effects of several factors on two measurements of f0 valley
alignment, namely, C1-to-minf0 and valley location (100 x C1-to-minf0 / accent-dur). In the table
we can see that the largest mean value of C1-to-minf0 is 19.2 ms in word-final accents. But even this
value correspond to only 7.0% of the duration of the accented syllable. There are only two
marginally significant difference: between normal and fast speaking rates and between word-final
accent and word-nonfinal accent. However, the difference in the means are so smaller (7% vs. 9%),
we can still regard them to be not very different. In general, therefore, the pitch target is unlikely t o
be [rise] in the accented syllable in the sentences examined in the present study.
Insert Table 6 about here
3.2.3. f0 events during unaccented syllables
The f0 contours of an utterance consist of not only prominent peaks, but also curves in between the
peaks. As discussed in the Introduction, there are different theories about how contours between
peaks are formed. They can be divided into three major groups, spreading, interpolation and target
implementation. Spreading is mostly from left to right. Interpolation involves tones both before and
after the f0 contours at issue. All three hypotheses can be verified by examining the influence of the
f0 events upon each other. An interpolated curve should equally reflect the influence of the preceding
and following pitch targets, whether the interpolation is linear or "sagging" (Pierrehumbert, 1980,
1981). Spreading, on the other hand, implies that there should be 100% influence from the tone on
the left. Target implementation implies influence mainly from the left, which would diminish over
Intonation Components in English
29
time as the target is being approached. To examine the relative influence of the preceding and
following accents on a non-accented syllable, we performed several sets of regression analyses on the
f0 height at different locations in the unaccented syllable immediately after the accented syllables.
Figures 15 displays the results of the regression analyses.
Insert Figure 15 about here
In Figure 15, the regressor is rise-size (in semitone relative to the minimum f0 of the word), and the
dependent variables are post-pitch at 50, 100, 150 and 200 ms after the accented syllable. Post-pitch
is computed by subtracting minimum f0 of the word from the f0 values at the four locations in the
post-accent syllable. The graphs on the left show r2, which indicates how much of the variation in
post-pitch can be accounted for by the height of the preceding accent as represented by rise size. The
graphs on the right display the slopes of the regression lines, indicating whether post-pitch varies in
the same or opposite (i.e., when the slope is negative) directions as rise size. As can be seen, overall,
post-pitch at 50 ms after the post-accent syllable can be well predicted by rise-size in word 1 and
word 3 positions. The prediction is not as good in word 5, although it can still account for 25.6,
34.2% of the variations for the no-focus and post-focus conditions, respectively. The predictability
reduces over time. But the rate of reduction is faster when the post-accent syllable is stressed ("may"
after "Lee" and "Lamar") than when it is unstressed (in "Nina", "Ramona" and "Emily"). The slope
of the regression line also changes over time, but it remains positive in both word 1 and word 3
positions while becoming negative at 100 ms post-accent in word 3. The negative slope of the
regression line at sentence final position seems to indicate an extra effort to implement something
that is quite independent of the preceding accent.
Insert Figure 16 about here
In Figure 16, the regressor is again rise-size, but the dependent variables are pre-pitch at 50 ms and
100 ms prior to the onset of the accented syllable. Similar to post-pitch, pre-pitch is computed by
subtracting minimum f0 of the word from the f0 values at three locations in the pre-accent syllable.
The graphs on the left again show r2, while the graphs on the right show the slope of the regression
line. In these graphs, pre-pitch is overall poorly predicted by rise-size. Only in word 2 are there r2
values over 0.2, and those are at locations farthest away from the accented syllable. Since they
occurred only in two conditions in word 2, it is difficult to determine if these higher r2 values reflect a
real anticipatory influence or are merely accidental. Thus there appears to be little consistent
influence of the accented syllables on the f0 of the pre-accent syllables.
Yi Xu & Ching. X. Xu
30
The results of the regression analyses in this section do not seem to favor either the spreading or
interpolation account for the f0 of the unaccented syllables. Rather, they seem to agree with the
prediction of the target implementation account, i.e., substantial influence from the preceding accent
which reduces over time and little influence from the following accent. Also consistent with the
target implementation account is the finding that a preceding accent exerts more influence on the f0
of an unstressed syllables than on a stressed syllable. An additional finding is that the influence of the
preceding accent is very quickly overcome if the unaccented syllable is sentence final, and the slope
of the regression line actually become negative 100 ms after the preceding accent. This seems t o
suggest an extra effort to implement something with an independent identity.
3.2.4. The case of subject 2
Most of the analyses so far have excluded data from subject 2 because of her extensive inter-trial
inconsistency in terms of the basic f0 patterns. As can be seen in Figure 6, where all the other subjects
would have a high f0 value, subject 2 sometimes has a low f0 value, and vice versa. Informal listening
to her sentences suggested that she may have used different tonal patterns for accented syllables as
well as unaccented ones. Upon closer inspection, we noticed that such alternate f0 patterns in terms
of location of peaks and valleys occurred in other sentences as well. When using peak location
patterns as reference, we can see what is happening when there is no narrow focus in the sentence:
the first f0 peak often occurs much later, mostly in the middle or later in the word "may" (59 out of
98 trials). In contrast, the first f0 peak always occurs well before or around the onset of "may" for
other speakers. More consistently, the f0 contour in the accented syllable in word 3 usually assumes a
sharp fall toward a valley near the syllable offset (69 out of 98 trials), indicating that this speaker
actually tried to implement a [low] pitch target for this syllable.
For whatever reason, subject 2 seems to have assigned [low] pitch target to the second accented word
in these sentences, and a high pitch target to the last accented word. This is apparently different for
the other 7 subjects examined in the present study. Such free alternation of high and low tonal targets
has been suggested before by Goldsmith (1999). Since it occurred in only one subject in the present
study, no definitive conclusions can be drawn about it. Also interestingly, the alternating peak
locations with this subject is true only for sentences without a narrow focus on word 1 or word 3.
Whenever there is a narrow focus on word 1 or word 3, the location of the f0 peak becomes quite
consistent, and they are not different from the general peak location patterns of other speakers as
shown in Figure 7: the first f0 peak never occurs later than the middle of the word "may", and there is
never a sharp f0 fall in the accented syllable in word 3 toward a valley near the syllable offset. This
indicates that there is no [high] or [fall] pitch target for this syllable when focus is on word 1 or 3.
Intonation Components in English
31
4. General Discussion
The present study is motivated by a number of recent advances in the understanding of tone and
intonation. First, there has been converging evidence that surface f0 contours do not directly
resemble underlying pitch components that function in speech intonation. Articulatorily, not only is
the larynx unable to change f0 instantaneously, but also it is unable to change f0 fast enough to render
the transitional movements negligible (Xu & Sun, 2002). In fact, at a normal speaking rate, the
observed f0 contours are likely to be mostly transitions toward various ideal targets rather than being
the targets themselves (Xu, 1997, 1999). Biomechanically, there is a strong constraint to fully
synchronize related movements (Kelso et al., 1979; Kelso, 1984; Kelso et al. 1981; Schmidt et al.,
1990), and there has been evidence that in the case of tone, a tonal target is synchronously
implemented with a syllable (Xu, 1998, 2001a, 2002; Xu & Wang, 2001). Such synchronization
implies that in each syllable, the earlier portion of the f0 contour is mostly transitional, whereas the
later portion is closer to the ideal target. The first task of the present study is therefore to search for
evidence of the underlying forms of the local pitch targets, assuming that observed f0 movements are
mostly transitions toward these targets. The present study is also motivated by the recent finding
that different layers of information can be conveyed simultaneously by f0 in a tone language (Xu,
1999). The second task of the present study is therefore to examine whether and how such
simultaneous conveyance of intonational information by f0 is also done in a non-tone language like
English. Finally, recent research on Mandarin has suggested that syllables previously thought to be
"toneless" are likely produced with local pitch targets just as syllables with full tones, and that their
highly variable f0 contours are probably due to the weak effort used in implementing the targets
(Chen & Xu, 2002). The third task of the present study is therefore to investigate if this mechanism
also underlies the generation of f0 contours of the "accentless" syllables in English.
4.1. Local pitch targets
As afore-mentioned, our search for the underlying targets of pitch accents in English is guided by two
newly-gained understandings: (a) that an f0 transition toward any target takes time and it often spans
much of the duration of a syllable, and (b) that the implementation of a pitch target is synchronized
with the syllable associated with the target. The following summaries the local pitch targets of
syllables together with specific evidence reported in section 3.
1. No [rise] found.
When a syllable carries a [rise], (a) f0 falls first and then rises in the later portion of the
syllable, and the rise accelerates towards the end of the syllable; (b) an f0 minimum occurs
well before syllable offset and its location varies systematically with syllable duration, and (c)
Yi Xu & Ching. X. Xu
32
an f0 maximum consistently occurs immediately after the accented syllable, regardless of the
vowel length. Our analysis, however, did not find these characteristics in any of the syllables
we examined. Instead, the results of the analyses in 3.2.2.3. show that f0 minima consistently
occur very early in the syllable (7-9% into the syllable: 3.2.2.3). Thus there is no evidence
for the existence of [rise] in the pitch accents of the short declarative sentences we
examined.
2. [high] in non-focused accented syllables that are not sentence-final, and in focused accented
syllable that are not word-final.
Evidence: (a) f0 contours in these syllables mostly rise throughout their duration but slow
down toward their offset, especially when the duration is relatively long (i.e., when the vowel
is phonologically long) (Figure 7). (b) f0 peaks occur around the end of these syllables, but
their exact locations vary: before the syllable offset when followed by a stressed syllable, but
after the syllable offset when followed by an unstressed syllable (Figure 11). However, the
location of the f0 peaks does not vary systematically with the duration of the syllable
(Figures 13 &14). (c) f0 valleys consistently occur immediately after the syllable onset (Table
6).
3. [fall] in focused word-final accent and in non-focused sentence-final accent.
Evidence: (a) f0 contours in these syllables rise first and then fall in the later portion of the
syllable (Figure 7); the speed of the rise is faster than that in non-final accents (Table 2); and
the speed of the fall accelerates toward the end of the syllable (Figure 7). (b) f0 peaks occur
well before the syllable offset, and their locations vary quite systematically with syllable
duration (Figure 11). (c) f0 valleys consistently occur immediately after the syllable onset
(Table 6).
4. [mid] in all unaccented syllables. The evidence for this will be discussed later in 4.3.
It has been well-known that the distinction between H* and LH* in the Pierrehumbert model of
intonation is difficult to make (Ladd & Schepman, 2003). The solution proposed by Ladd and
Schepman (2003) was to merge the two accents into LH* because the f0 minima consistently
occurred at the beginning of the accented syllable in both of the alleged accents. As we have discussed
above, based on the new understanding, that an f0 minimum consistently occurs at the beginning of
an accented syllable is actually evidence, together with other patterns listed in Table 1, that the
underlying pitch target is either [high] (when neither under focus nor sentence final) or [fall] (when
word-final and under focus or sentence final).
Intonation Components in English
33
Silverman and Pierrehumbert (1990) also investigated f0 peak alignment in American English. Unlike
in the present study, they focused mostly on one alignment measurement, namely, the distance
between onset of accented vowel and the location of f0 peak, which they referred to as peak delay.
They compared peak delay in words with 0, 1, or 3 unaccented syllables after the accent syllable.
Their grouping is not exactly the same as the word-final vs. word-nonfinal grouping as in the present
study. However, when viewing their data using the word-final/word-nonfinal grouping, we can see that
the peak location is almost always around or after the end of the accented syllable in word-nonfinal
accents even as rhyme duration increases. In contrast, in word-final accents, peak location remains
around the middle of the rhyme as rhyme duration increases. Thus their data provide further support
for our interpretation of the underlying pitch targets of the accented syllables in English.
4.2. Manifestation of focus
Our analysis found that, under focus, the accented syllable becomes longer, the maximum f0
associated with the accented syllable becomes higher, the size of the pitch rise becomes larger, and
the speed of the pitch rise becomes faster. In addition to that, however, the analyses have also shown
that, similar to Mandarin (Xu, 1999), the realization of focus in English is not only in terms of f0
changes in the syllable directly under focus, but also in terms of f0 changes in syllables after focus.
Overall, the maximum f0 of words following the focused word is significantly lower than the same
words in the no focus condition whether focus is on word 1 or word 3. Furthermore, as shown in
Figure 7, the high maximum f0 under focus is immediately followed by a sharp fall toward a much
lower f0. And this fall seems to be due to speakers' attempt to lower f0 immediately after the accented
syllable after focus, whether or not the following accented syllable is part of the focused word. Such
post-focus f0 drop was reported by Cooper et al. (1985), Eady & Cooper (1986) and Eady et al.
(1986) but has not generally been accepted as part of the manifestation of focus. There has been
evidence in studies of focus perception that low f0 after focus is critical for the correct identification
of focus by listeners (Rump & Collier, 1996; Hasegawa & Hata, 1992; Xu, Xu & Sun, 2003; Xu & Xu,
2003). So, the evidence seems compelling that post-focus pitch range compression is part of the
manifestation of focus itself rather than of something else.
Our quantitative analysis further shows that Post-accent f0 drop is faster when accented syllable is
focused than when it is not focused, whether or not the post-focus syllable is stressed. This finding
has two implications. First, post-focus f0 drop is likely to be done with an active articulatory force.
Second, the scope of post-focus f0 drop includes not only all the post-focus words but also the post-
accent syllable(s) in the focused word.
Yi Xu & Ching. X. Xu
34
Research on focus in Mandarin has found that the f0 changes related to focus is not simply raising the
f0 of the focused words and lowering the f0 of the post-focus words, it is in fact modifying the pitch
ranges of the of the sentence: expanding it for the focused word, raising its maximum f0 and lowering
its minimum f0, suppressing, i.e., lowering and narrowing it for post-focus words, and leaving it intact
for pre-focus words. The design of the present study, which limits the current investigation t o
declarative sentences only, does not allow us to verify this bi-directional pitch range modification for
the accented syllable, because, as has been found in Mandarin, pitch targets such as [high] and [fall]
expands only upwards. Only the L tone and the R tone, which presumably have [low] and [rise] pitch
targets, were found to expand downward in Mandarin. The data reported by Eady & Cooper (1986),
however, suggest that the pitch targets associated with stressed syllables in a Yes-No question are
probably either [low] or [rise] and the minimum f0 of those syllable did become lower under focus.
Our analysis also found that the underlying pitch target of an accented syllable is not always the same
under focus and not under focus. In a word-final stress, the accented syllable tends to have a [fall]
target, whereas in a non-word-final stress the accented syllable tends to have a [high] target. This is
different from Mandarin, in which the pitch targets are assigned lexically and are not changed by
focus.
In Mandarin, post-focus pitch range suppression, though quite extensive, does not totally eliminate f0
contours related to lexical tones, as can be seen in Figure 3b. It has been an open question, however,
whether there are stress-related f0 movements after focus in English. In the Pierrehumbert model of
intonation, no pitch accents occur after focus, which, also known as the nuclear accent, is by
definition the last pitch accent in a intonation phrase. But a recent study by Di Cristo and Jankowski
(1999) has found that, at least in French, post-focus accents still retain their identity in the form of
certain f0 contour patterns. In the present study, analyses of both peak occurrence and size of f0 rise
in post-focus words demonstrate that the percentage of peak occurrence is still nearly 60% or higher
and the size of the f0 rise is not significantly different when post-focus and when the sentence has no
narrow focus (cf. Figure 10). It is particularly worth noting that, the f0 rises occur against a declining
f0 contour which is possibly related to post-focus suppression. So the size of the intended f0 rises is
likely larger than the observed size. Thus the present data indicate that in English, too, pitch accents
in post-focus words still remain, albeit suppressed severely by the focus.
Data in Table 3 also show that maximum f0 of an accented word is not significantly different whether
or not it precedes a focus. This indicate that pre-focus accents generally remain intact. This also
agrees with the findings of Cooper et al. (1985) for English and Xu (1999) for Mandarin.
Intonation Components in English
35
Finally, unlike what has been found in Mandarin by Xu (1999) and Cooper et al. (1985), the present
data show that a sentence-final focus significantly raises f0 of the last word in the sentences examined
in the present study as compared to the same sentences without a narrow focus. This seems to agree
with the finding of Rump and Collier (1996) that it is possible to find patterns of f0 configuration
which can be perceived as either having a single sentence-final focus and the so-called broad focus.
However, their data also show that the distinction between focus vs. non-focus is smaller at the
sentence-final position than in earlier positions. Even in our data, a sentence-final accent not under
focus has a [fall] target just as does a non-sentence-final but word-final accents under focus. This
could indicate that a declarative sentence with no narrow focus carries a default final focus, as
assumed by some phonological analyses of intonation (Ladd, 1996). However, there is also a
possibility that it is the low boundary tone attached to the end of a declarative sentence
(Pierrehumbert, 1980) that is partially responsible for the final fall in f0. So, the question regarding
the exact nature of sentence-final pitch accent remains open.
4.3. f0 of unaccented syllables
The analyses described in 3.2.3. indicate that f0 values of unaccented syllables are influenced much
more by the preceding accent than by the following accent. This influence, however, reduces over
time and the reduction is not accompanied by increase in influence from the following accent. This
pattern suggests that f0 of an unaccented syllable does not come from "tone spreading", i.e., the
spreading of tonal feature from one tone-bearing unit to the next (Goldsmith, 1990; Hyman &
Schuh, 1974; Pierrehumbert, 1980), which would predict sustained full influence of the preceding
accent. Neither can the pattern be explained as resulting from interpolation between flanking pitch
accents, whether the interpolation is straight-lined or curved (Kochanski & Shih, 2003;
Pierrehumbert, 1980), because interpolation would generate equal amount of influences from both
the preceding and following accents. More importantly, the pattern in fact calls into question the
general consensus that unaccented syllables do not have any underlying pitch targets of their own. If
the influence of the preceding accent decreases over time with little increase in the influence of the
following accent, there must be a third source for the f0 movements during the unaccented syllables.
In a study of Mandarin neutral tone, which is also generally believed to be toneless, Chen and Xu
(2002) found that the f0 of the neutral tone is best understood if we assume that (a) this tone has its
own pitch target, which is probably a static [mid], and (b) this target is implemented with
categorically less articulatory effort than those of the full tones. The English sentences used in the
present study did not provide exactly the same manipulations as did in Chen and Xu (2002).
Nevertheless, we can see from Figure 7 that pre-accent f0 minimum is usually higher than the lowest
f0 of the speaker, which can be found at the end of the sentence or in the post-focus region. Thus it
is possible that unaccented syllables are implemented with a static [mid] target. Furthermore, the
Yi Xu & Ching. X. Xu
36
regression analyses in 3.2.3 show that an unstressed syllable is more susceptible than a stressed but
unaccented syllable to the influence of the preceding pitch accent. Thus even when not accented, the
relative strength of a syllable can be further differentially manifested through articulatory effort.
In the Pierrehumbert model of English intonation, stressed syllables in words carrying pitch accents
are always specified with tones such as H and L. Unstressed syllables of accented words sometimes are
each specified with a tone but sometimes are not. Words that do not carry pitch accents are not
specified with tones at all. In that framework, f0 of those syllables without a tone come from linear
or nonlinear interpolation between neighboring accentual tones or boundary tones. Such phonetic
interpolation, however, is inconsistent with the principle proposed by Pierrehumbert (1980) and
emphasized in her later work (Pierrehumbert, 2000) that there is no long-distance phonetic look
ahead. Mathematically, interpolation, especially a linear one, is rather simple to implement.
Articulatorily, however, interpolation would be a rather laborious mechanism if not an impossible
one, because it requires the articulatory system to store both the preceding and following tones as
references, and to continuously calculate the current state based on the exact elapsed time at every
moment during articulation. Assigning a pitch target to the unstressed syllable and approaching it
with a weak effort makes the articulation task much simpler, because the articulatory system does
not need to anticipate the f0 value of the upcoming accent, and it does not need to refer to the
precise proportional time between accents for determining the moment-to-moment f0 value. The
only on-line assessment of current state needed is that of the distance from the targeted state.
Identifying the sources of f0 for the unaccented syllables is not only important for understanding the
f0 contours of these syllables, but also critical for understanding the f0 contour of accented syllables.
Assuming that the implementation of a pitch accent is synchronized with the syllable, f0 at the
beginning of the syllable has to start from the level achieved in the preceding unaccented syllable. If
the accent has the pitch target of [high] or [fall], as discussed earlier, f0 has to rise during the initial
portion of the syllable, thus generating the appearance that both the beginning "low" and the initial
"rise" are part of the pitch accent. In a series of recent studies, Ladd and his colleagues found
consistent alignment of f0 valleys with the onset of accented syllables (Arvaniti et al., 1998; Ladd et
al., 1999; Ladd et al., 2000). And they interpret such alignment as indication that there is a L
associated with the accent which is aligned to the syllable onset. With the understanding that
unaccented syllables also have their own pitch targets which are implemented also asymptotically,
the pitch targets of the accented syllables can be simplified by excluding the portion due to the
influence of the preceding unaccented syllables. Thus the f0 valley found to consistently align with
the onset of an accented syllable should not be interpreted as a L, but rather as evidence that the
implementation of the pitch target associated with the pitch accent starts from the onset of the
syllable.
Intonation Components in English
37
4.4. Implications for intonation theories
The findings of the present study have implications not only for our understanding of English
intonation, but also for theoretical understanding of intonation in general. As discussed in the
Introduction, most existing theories can be identified as either linear or superpositional (Fujisaki
1988; Gårding 1979; Grønnum 1995; Ladd, 1996; Pierrehumbert, 1980; Pierhumbert & Beckman,
1988; 't Hart et al. 1990). The findings of the present study suggest that both of these general
approaches have merits and yet neither seems adequate for the new data. The findings of the present
study indicate that there are both linearity and superposition in intonation. The linearity is seen in
the finding that every syllable is likely to have a pitch target and the generation of local f0 contours
does not involve either long-distance or bidirectional interpolation. The superposition is seen in the
finding that functions like focus may involve adjustment of pitch ranges for a number of consecutive
syllables. These findings seem to call for a new theoretical model of intonation. Such a model should
be based on the recognition of individual components of intonation (Xu, 2001b). The model should,
first of all, make a distinction between communicative functions that convey meanings through
intonation and articulatory mechanisms that implement these functions. This distinction is necessary
because intonation is produced by an articulatory system whose physical properties introduce extra f0
variations not intended by the speakers. These variations therefore should not be viewed as part of
the communicative meanings. This is especially true if f0 variations due to the articulatory system
are non-trivial, as is the case with inertia and synchronization. Due to inertia, no intended pitch
changes can be made instantaneously. In fact, speakers often have to make f0 movements as fast as
they can when changing pitch in their speech (Xu & Sun, 2002). Due to the need to synchronize
related motor movement (Kelso, 1984; Kelso et al., 1979), transitions toward underlying tonal
targets have to start from the beginning of the syllable and stop by the end of the syllable (Xu, 1997,
1999; present data), despite the fact that this would often mean insufficient time to complete the
transition. As a result, the closest resemblance to the ideal f0 contour of a lexical tone is often found
only near the end of the syllable. Furthermore, the same synchronization of tone and syllable seems
to take place regardless of the segmental composition of a syllable (Xu, 1998; Xu & Xu, in press).
The articulatory pitch-production system, however, is only an instrument that generates surface f0
contours, and it has to be controlled by the input that carries communicative information. There are
many different communicative functions that need to be conveyed through melody in speech,
including (but certainly not limited to) the Lexical, Syntactic/Semantic, Sentential, Focal and Topical
functions. These functions often need to be transmitted simultaneously. The present data as well as
those from previous studies (Chen & Xu, 2002; Xu, 1999) demonstrate that simultaneous
transmission of multiple communicative functions can be done by each function manifesting itself
through a unique way of controlling one or more of three parameters: pitch target, pitch range and
Yi Xu & Ching. X. Xu
38
articulatory effort. Data from Xu (1999) demonstrate that lexical tone and focus can be realized
concurrently in Mandarin, and the present data demonstrate that pitch accent and focus can be also
realized concurrently in English. In both cases, pitch accent and tone are manifested as local pitch
targets, while focus is expressed mostly through manipulation of pitch ranges. The present data also
demonstrate that lexical stress in English can be realized in parallel with both pitch accent and focus.
In both cases the weak stress seems to be achieved by deliberately reduced articulatory effort. Data
from other studies also suggest that there are separate controls for sentence type, i.e., statement vs.
question (Eady & Cooper, 1986), and for new topic, i.e., the introduction of a new subject into the
conversation or monologue (Lehiste, 1975; Umeda, 1982).
Based on these understandings, a new model of intonation is sketched in Figure 17. The model
assumes a major division between communicative functions that convey meanings through
intonation and articulatory mechanisms that implement these functions. Intonation related
communicative functions manifest themselves in parallel by separately specifying (a) local pitch
target, (b) pitch range and (c) articulatory effort. Taking these specifications as input, the
articulatory module applies physical forces to successively approach local targets at the specified
pitch ranges with the specified amounts of effort. The timing control in the articulatory module
synchronizes the local targets with the associated syllables. The resulting f0 contours thus continually
approach successive local targets within different pitch ranges at varying speeds (column 4). Due t o
space limitation, a full elaboration of the model will be presented in a separate paper.
Insert Figure 17 about here
4.5. Caveats
Due to limited scope of the present study, many issues are left unaddressed, including both the
communicative functions and the articulatory mechanisms. Regarding the communicative functions,
the role of many pragmatic, attitudinal and emotional functions are not addressed. Regarding the
articulatory module, the present study did not definitively prove the target-syllable synchronization
in English, although consistent alignment of f0 minimum around syllable onset was found,
corroborating similar findings in previous research (Ladd et al. 1999). A recent study has found
consistent gradient language-specific alignment patterns (Atterer & Ladd, forthcoming). However,
alignment differences across languages does not necessarily mean asynchrony between syllable and
pitch target. They may be reflections of the cross-language gradient differences in the underlying
pitch targets themselves rather than gradient differences in the degrees of synchrony across the
languages. A more definitively way of verifying the synchronization hypothesis would be t o
manipulate the tonal context, especially the preceding tonal context of an accent. This can be easily
Intonation Components in English
39
done in a tone language like Mandarin (as did in Xu, 1999), but is rather difficult in languages like
English. The closest experimental manipulation we have seen is done by Ladd et al. (2003), where
the number of unstressed syllables intervening two accents is manipulated. What they found was that
the duration of the f0 rise in the second accent remains constant despite variation in the height of
the f0 minimum between the two accents. However, they did not examine whether the alignment of
the f0 minimum remains constant relative to the onset of the second accented syllable. So, whether
pitch targets are fully synchronized with the syllable in English still needs to be further investigated.
The nature of boundary tones is also not specifically examined in the present study, although some
evidence of it is seen in the analysis of unaccented syllables at sentence final versus earlier positions
as discussed in 3.2.3. The understanding of boundary tone is also important for understanding other
local pitch targets. For example, the final syllable in a sentence would contain a boundary tone. If so,
shouldn't the overall falling pattern observed in the sentence-final syllable be interpreted as
consisting of a [high] followed by a [low] boundary tone? This question needs to be answered by
future studies specifically designed to look into the nature of boundary tones.
5. Conclusions
It has been long debated over whether pitch registers or pitch contours should be considered as the
primary components of intonation in English as well as in many other languages. Recent findings
about articulatory constraints on the production of pitch movements suggest that neither
understanding is likely to be adequate because they both implicitly assume that observed f0 contours
directly resemble the underlying functional components. In the present study, we take it as given that
any intended pitch change needs substantial amount of time to complete and hence the real
components of intonation, especially the local ones, can be only partially reflected in the f0
contours. We therefore treated f0 events such as turning points, slopes, and their alignment with
segmental units only as evidence for the possible underlying intonation components. Recent studies
on Mandarin has found that both static and dynamic targets may underlie f0 contours corresponding
to lexical tones in the language, and that these targets are likely to be implemented quite
synchronously with their associated syllables. One of the tasks of the present study was therefore t o
use similar cues as used in Mandarin to reveal the nature of the underlying local pitch targets in
English.
It has also been long debated over whether intonation components are linearly sequenced or
superposed on top of each other. In the present study we recognize that for any communicative
function to be effectively conveyed in intonation, it must have its own unique manner of
manifestation. And for different communicative functions to be conveyed concurrently, they would
Yi Xu & Ching. X. Xu
40
not wipe out each other's cues so that each would reach the listener only at the expense of the
others. Thus there may be a multitude of different means to manifest different intonation meanings,
both linear and superpositional. The second task of the present study was therefore to examine how
each of the possible components of intonation is manifested and how they can coexist with other
components. In particular, we examined whether and how lexical stress, pitch accent and focus can be
conveyed at the same time.
Our analysis of f0 contours of short declarative sentences in American English provided data relevant
to both tasks. In terms of local pitch targets, first, non-focused and non-word-final accents seem t o
be associated with a static [high], and word-final accent under focus and sentence-final accents,
whether or not under narrow focus, seem to be associated with a dynamic [fall]. Second, unaccented
syllables, whether stressed or unstressed, seem to be associated with a static [mid] rather than being
completely targetless, and their f0 contours come from implementation of this static target rather
than from interpolation between surrounding accents. Third, unstressed syllables seem to be
associated with a weak articulatory effort that distinguishes them further from unaccented but
stressed syllables. In terms of concurrent realization of different functions, focus is found to raise the
pitch range of the on-focus, stressed syllables, suppress the pitch range of all post-focus syllables, and
leave the pitch range of pre-focus words largely intact. In other words, neither are lexically related
pitch targets directly under focus fully replaced by focus itself, nor are post-focus pitch accents
completely eliminated.
Finally, we considered implications of the present data on the theoretical understanding of intonation
in general by contemplating a new model of intonation. The model assumes a major division between
a functional module and an articulatory module. The functional module consists of communicative
functions that are parallel to each other, and the articulatory module is composed of various physical
forces. In this model, separate communicative functions parallelly determine the local pitch targets,
the pitch ranges for the pitch targets and the amount of effort given to each pitch target. The
articulatory module then implements the pitch targets at the specified pitch ranges with the specified
amounts of effort. The resulting f0 contours therefore exhibit not only manifestation of each and
every individual communicative function, but also the effects of various physical properties of the
articulatory system.
ACKNOWLEGEMENT
This work is supported in part by NIH Grant DC03902.
Intonation Components in English
41
6. References
Abramson, A. S. (1962) The Vowels and Tones of Standard Thai: Acoustical Measurements andExperiments. Bloomington: Indiana University Research Center in Anthropology, Folklore,and Linguistics, Pub. 20.
Abramson, A. S. (1976) Thai tones as a reference system. In Thai linguistics in honor of Fang-KueiLi (T. W. Gething, J. G. Harris, & P. Kullavanijaya, editors), pp. 1-12. Bangkok:Chulalongkorn University Press.
Abramson, A. S. (1978) The phonetic plausibilirty of the segmentation of tones in Thai phonology.In Proceedings of The twelfth International Congress of Lingusitics, Vienna, pp. 760-763.
Anderson, S. R. (1978) Tone features. In Tone: A linguistic survey (V. A. Fromkin, editor), pp. 133-175. New York: Academic Press.
Arvaniti, A., Ladd, D. R., & Mennen, I. (1998) Stability of tonal alignment: the case of Greekprenuclear accents. Journal of Phonetics, 36, 3-25.
Atterer, M. & Ladd, D. R. (forthcoming) On the phonetics and phonology of “segmental anchoring”of f0: evidence from German. Submitted to Journal of Phonetics.
Bai, D. (1934) Guanzhong shengdiao shiyan lu [Experiments with tones of Guanzhong dialects]. In InShiyusuo Jikan [A Collection by Shiyusuo] (, pp. 355-361.
Bolinger, D. L. (1951) Intonation: levels versus configuration. Word, 7, 199-210.
Bolinger, D. (1986). Intonation and its parts: melody in spoken English. Stanford University Press,Palo Alto.
Bruce, G. (1977) Swedish word accents in sentence perspective. In TRAVAUX DE L'INSTITUTE DELINGUISTIQUE DE LUND XII. (B. Malmberg & K. Hadding, editors). Lund: Gleerup.
Bruce, G. & Touati, P. (1992) On the analysis of prosody in spontaneous speech withexemplification from Swedish and French. Speech Communication, 11, 453-458.
Caspers, J. & van Heuven, V. J. (1993) Effects of time pressure on the phonetic realization of theDutch accent-lending pitch rise and fall. Phonetica, 50, 161-171.
Chao, Y. R. (1956) Tone, intonation, singsong, chanting, recitative, tonal composition, and atonalcomposition in Chinese. In For Roman Jakobson (M. Halle, editor), pp. 52-59. Mouton:The Hague.
Chao, Y. R. (1968) A Grammar of Spoken Chinese. Berkeley, CA: University of California Press.
Chen, Y. & Xu, Y. (2002) Pitch Target of Mandarin Neutral Tone. Presented at LabPhon 8, NewHaven, CT.
Cohen, A. & 't Hart, J. (1967) On the anatomy of intonation. Lingua, 19, 177-192.
Cooper, W. E., Eady, S. J., & Mueller, P. R. (1985) Acoustical aspects of contrastive stress inquestion-answer contexts, Journal of the Acoustical Society of America, 77, 2142-2156.
Crystal, D. (1969) Prosodic Systems and Intonation in English. London: Cambridge UniversityPress.
Di Cristo, A. & Jankowski, J. (1999) Prosodic organisation and phrasing after focus in French. InProceedings of The 14th International Congress of Phonetic Sciences, San Francisco, 2, pp.1565-1568.
D'Imperio, M. (2001) Focus and tonal structure in Neapolian Italian. Speech Communication, 33,339-356.
Yi Xu & Ching. X. Xu
42
D’Imperio, M. (2002) Language-Specific and Universal Constraints on Tonal Alignment: The Natureof Targets and “Anchors”. In Proceedings of The 1st International Conference on SpeechProsody, Aix-en-Provence, France, pp. 101-106.
Duanmu, S. (1994) Against contour tone units. Linguistic Inquiry, 25, 555-608.
Eady, S. J. & Cooper, W. E. (1986) Speech intonation and focus location in matched statements andquestions, Journal of the Acoustical Society of America, 80, 402-416.
Eady, S. J., Cooper, W. E., Klouda, G. V., Mueller, P. R., & Lotts, D. W. (1986) Acousticcharacteristics of sentential focus: Narrow vs. broad and single vs. dual focus environments,Language and Speech, 29, 233-251.
Fujisaki, H. (1983) Dynamic characteristics of voice fundamental frequency in speech and singing. InThe Production of Speech (P. F. MacNeilage, editor), pp. 39-55. New York: Springer-Verlag.
Fujisaki, H. (1988) A note on the physiological and physical basis for the phrase and accentcomponents in the voice fundamental frequency contour, In Vocal Physiology: VoiceProduction, (O. Fujimura, editor), pp. 347-355. New York: Raven Press, Ltd.
Fujisaki, H. (1992) Modeling the process of fundamental frequency contour generation. In SpeechPerception, Production and Linguistic Structure (Y. Tohkura, E. Vatikiotis-Bateson, & Y.Sagisaka, editors), pp. 313-326. Amsterdam: IOS Press.
Gandour, J. (1974) On the representation of tone in Siamese. UCLA Working Papers in Phonetics,27, 118-146.
Gandour, J., Potisuk, S., & Dechongkit, S. (1994) Tonal coarticulation in Thai, Journal of Phonetics,22, 477-492.
Gandour, J., Potisuk, S., Dechongkit, S., & Ponglorpisit, S. (1992) Anticipatory tonal coarticulationin Thai noun compounds, Linguistics of the Tibeto-Burman Area, 15, 111-124.
Gårding, E. (1979) Sentence intonation in Swedish. Phonetica, 36, 207-215.
Gårding, E. (1987) Speech act and tonal pattern in Standard Chinese, Phonetica. 44, 13-29.
Goldsmith, J. A. (1990) Autosegmental and Metrical Phonology. Oxford: Blackwell Publishers.
Goldsmith, J. A. (1999) Dealing with prosody in a text-to-speech system. International Journal ofSpeech Technology, 3, 51-63.
Grønnum, N. (1995) Superposition and subordination in intonation — a non-linear approach. InProceedings of The 13th International Congress of Phonetic Sciences, Stockholm, 2, pp.124-131.
Han, M. S. and K.-O. Kim (1974) Phonetic variation of Vietnamese tones in disyllabic utterances,Journal of Phonetics, 2, 223-232.
Hasegawa, Y. & Hata, K. (1992) Fundamental frequency as an acoustic cue to accent perception.Language and Speech, 35, 87-98.
Hirschberg, J. (1993) Pitch accent in context: Predicting prominence from text. ArtificialIntelligence, 63, 305-340.
Hollien, H. (1960). Vocal pitch variation related to changes in vocal fold length. Journal of Speechand Hearing Research 3: 150-156.
Hollien, H. & Moore, G. P. (1960) Measurements of the vocal folds during changes in pitch. Journalof Speech and Hearing Research, 3, 157-165.
Hombert, J.-M. (1974) Universals of downdrift: their phonetic basis and significance for a theory oftone. Studies in African Linguistics, Supplement 5, 169-183.
Intonation Components in English
43
Howie, J. M. (1976) Acoustical Studies of Mandarin Vowels and Tones. London: CambridgeUniversity Press.
Hyman, L. M. (1973) The role of consonant types in natural tonal assimilations. In ConsonantTypes and Tone (L. M. Hyman, editor), pp. 151-179. Los Angeles, CA: Department ofLinguistics, University of Southern California.
Hyman, L. M. (1993) Register tones and tonal geometry. In The Phonology of Tone (H. v. d. Hulst& K. Snider, editors), pp. 75-108. New York: Mouton de Gruyter.
Hyman, L. & Schuh, R. (1974) Universals of tone rules. Linguistic Inquiry, 5, 81-115.
Jin, S. (1996) An Acoustic Study of Sentence Stress in Mandarin Chinese. Ph.D. dissertation, TheOhio State University.
Kelso, J. A. S. (1984) Phase transitions and critical behavior in human bimanual coordination.American Journal of Physiology: Regulatory, Intergrative and Comparative, 246, R1000-R1004.
Kelso, J. A. S., Holt, K. G., Rubin, P., & Kugler, P. N. (1981) Patterns of human interlimbcoordination emerge from the properties of non-linear, limit cycle oscillatory processes:Theory and data. Journal of Motor Behavior, 13, 226-261.
Kelso, J. A. S., Southard, D. L., & Goodman, D. (1979) On the nature of human interlimbcoordination. Science, 203, 1029-1031.
Kim, S.-A. (1999) Positional effect on tonal alternation in Chichewa: Phonological rule vs. phonetictiming. In Proceedings of Annual Meeting of Chicago Linguistic Society, Chicago, 34, pp.245-257.
Kochanski, G. & Shih, C. (2003) Prosody modeling with soft templates. Speech Communication, 39,311–352.
Ladd, D. R. (1996) Intonational phonology. Cambridge: Cambridge University Press.
Ladd, D. R., D. Faulkner, H. Faulkner and A. Schepman (1999). "Constant "segmental anchoring" off0 movements under changes in speech rate," J. Acoust. Soc. Am. 106, 1543-1554.
Ladd, D. R., I. Mennen and A. Schepman (2000). "Phonological conditioning of peak alignment inrising pitch accents in Dutch," J. Acoust. Soc. Am. 107, 2685-2696.
Ladd, D. R. & Schepman, A. (2003) "Sagging transitions" between high pitch accents in English:experimental evidence. Journal of Phonetics, 31, 81–112.
Laniran, Y. (1992) Intonation in Tone Languages: The phonetic Implementation of Tones inYorùbá. Unpublished Ph.D. dissertation, Cornell University.
Laniran, Y. & Gerfen, C. (1997) High raising, downstep and downdrift in Igbo. In Proceedings of The71st Annual Meeting of the Linguistic Society of America, Chicago, pp. p. 59.
Leben, W. R. (1973) Suprasegmental Phonology. Unpublished Ph.D. dissertation, MassachusettsInstitute of Technology.
Lehiste, I. (1975) The phonetic structure of paragraphs. In Structure and process in speechperception (A. Cohen & S. E. G. Nooteboom, editors), pp. 195-206. Springer-Verlag: NewYork.
Li, Y. J. & Lee, T. (2002) Acoustical f0 analysis of continuous Cantonese speech. In Proceedings ofInternational Symposium on Chinese Spoken Language Processing 2002, Taipei, Taiwan,pp. 127-130.
Yi Xu & Ching. X. Xu
44
Liberman, M. & Pierrehumbert, J. (1984) Intonational invariance under changes in pitch range andlength. In Language Sound Structure (M. Aronoff & R. Oehrle, editors), pp. 157-233.Cambridge, Massachusetts: M.I.T. Press.
Lieberman, P. & Tseng, C. Y. (1980) On the fall of the declination theory: breath-group versus"declination" as the base form for intonation. Journal of the Acoustical Society of America,67, S63.
Lin, M.-C. (1965) Yingao xianshiqi yu Putonghua shengdiao yingao texing [The pitch indicator andthe pitch characteristics of tones in Standard Chinese]. Acta Acoutica Sinica, 2, 8-15.
Lin, M.-C. (1988) Putonghua shengdiao de shengxue texing he zhijue zhengzhao [The acousticcharacteristics and perceptual cues of tones in Standard Chinese]. Zhongguo Yuwen [ChineseLinguistics], 204, 182-193.
Lin, M.-C. & Yan, J. (1991) Tonal coarticulation patterns in quadrisyllabic words and phrases ofMandarin. In Proceedings of The 12th International Congress of Phonetic Sciences, 3, pp.242-245.
Liu, F. & Xu, Y. (in press) Underlying targets of initial glides -- Evidence from focus-related f0
alignments in English. In Proceedings of To appear in Proceedings of The 15th InternationalCongress of Phonetic Sciences, Barcelona.
Meeussen, A. E. (1970) Tone typologies for West African Languages. African Language Studies, 11,266-71.
Nishizawa, N., Sawashima, M. and Yonemoto, K. (1988). Vocal fold length in vocal pitch change.Vocal Physiology: Voice Production. O. Fujimura. Raven Press, Ltd., New York: 75-83.
Pierrehumbert, J. (1980) The Phonology and Phonetics of English Intonation. Ph.D. dissertation,Massachusetts Institute of Technology.
Pierrehumbert, J. (1981) Synthesizing intonation. Journal of the Acoustical Society of America, 70,985-995.
Pierrehumbert, J. (2000) Tonal elements and their alignment. In Prosody: Theory and Experiment(M. Horne, editor), pp. 11-36. London: Kluwer Academic Publishers.
Pierrehumbert, J. & Beckman, M. (1988) Japanese Tone Structure. Cambridge, MA: The MIT Press.
Pike, K. L. (1945) The Intonation of American English. Ann Arbor: University of Michigan Press.
Pike, K. L. (1948) Tone Languages. Ann Arbor: University of Michigan Press.
Poser, W. (1984) The phonetics and phonology of tone and intonation in Japanese. Ph.D.dissertation, MIT, Cambridge, MA.
Potisuk, S., Harper, M. P., & Gandour, J. (1999) The classification of Thai tone sequences insyllable-segmented speech using the analysis-by-synthesis method. IEEE Transactions onSpeech and Audio Processing, 7, 95-102.
Prieto, P., Santen, J. v., & Hirschberg, J. (1995) Tonal alignment patterns in Spanish, Journal ofPhonetics, 23, 429-451.
Prieto, P., Shih, C., & Nibert, H. (1996) Pitch downtrend in Spanish. Journal of Phonetics, 24, 445-473.
Rump, H. H. & Collier, R. (1996) Focus conditions and the prominence of pitch-accented syllables.Language and Speech, 39, 1-17.
Schmidt, R. C., Carello, C., & Turvey, M. T. (1990) Phase transitions and critical fluctuations in thevisual coordination of rhythmic movements between people. Journal of ExperimentalPsychology: Human Perception and Performance, 16, 227-247.
Intonation Components in English
45
Shih, C.-L. (1988) Tone and intonation in Mandarin, Working Papers, Cornell Phonetics Laboratory,No. 3, 83-109.
Silverman, K. E. A. & Pierrehumbert, J. B. (1990) The timing of prenuclear high accents in English.In Papers in Laboratory Phonology 1 — Between the Grammar and Physics of Speech (J.Kingston & M. E. Beckman, editors), pp. 72-106. Cambridge: Cambridge University Press.
Stevens, K. N. (2002) Toward a model for lexical access based on acoustic landmarks and distinctivefeatures. Journal of the Acoustical Society of America, 111, 1872-1891.
Stewart, J. M. (1965) The typology of the Twi tone system. Legon, Ghana: Institute of African Studies,University of Ghana.
Stewart, J. M. (1983) Key lowering (downstep/downglide) in Dschang, Journal of African Languagesand Linguistics, 3, 113-138.
't Hart, J., Collier, R., & Cohen, A. (1990) A perceptual Study of Intonation — An experimental-phonetic approach to speech melody. Cambridge: Cambridge University Press.
Taylor, P. A. (1994). A phonetic Model of Intonation in English (Indiana University Linguistics ClubPublications, Bloomington, Indian).
Umeda, N. (1982) “f0 declination” is situation dependent, Journal of Phonetics, 10, 279-290.
Wang, C., Yue, W., Hirose, K. and Fujisaki, H. (1994). A scheme for Chinese speech synthesis byrule based on pitch-synchronous multi-pulse excitation LP method. Proceedings ofInternational Conference on Spoken Language Processing, Yokohama. pp. 1679-1682.
Woo, N. (1969) Prosody and phonology. Ph.D. dissertation, Massachusetts Institute of Technology.
Wu, Z. (1982) Putonghua yuju zhong de shengdiao bianhua [Tonal variations in Mandarin sentences].Zhongguo Yuwen [Chinese Linguistics], 439-450.
Wu, Z. (1984) Putonghua sanzizu biandiao guilü [Rules of tone sandhi in trisyllabic words in StandardChinese]. Zhongguo Yuyan Xuebao [Bulletin of Chinese Linguistics], 2, 70-92.
Wu, Z. (1988) Tone-sandhi patterns of quadro-syllabic combinations in Standard Chinese. Report ofPhonetic Research, Institute of Linguistics (CASS), Beijing, China, PL-ARPR/1988, 1-13.
Wu, Z. (1990) Can poly-syllabic tone-sandhi patterns be the invariant units of intonation in spokenStandard Chinese? In Proceedings of ICSLP 90, pp. 12.10.1-4.
Xu, C. X. & Xu, Y. (2003) Recognizing focus in noise filled sentences. Journal of the AcousticalSociety of America, 113, Pt. 2, 2327.
Xu, C. X. & Xu, Y. (in press) Effects of Consonant Aspiration on Mandarin Tones. Journal of theInternational Phonetic Association.
Xu, Y. (1993) Contextual Tonal Variation in Mandarin Chinese, Ph.D. dissertation. The Universityof Connecticut.
Xu, Y. (1994) Production and perception of coarticulated tones. Journal of the Acoustical Society ofAmerica, 95, 2240-2253.
Xu, Y. (1997) Contextual tonal variations in Mandarin, Journal of Phonetics, 25, 61-83.
Xu, Y. (1998) Consistency of tone-syllable alignment across different syllable structures and speakingrates. Phonetica, 55, 179-203.
Xu, Y. (1999) Effects of tone and focus on the formation and alignment of f0 contours. Journal ofPhonetics, 27, 55-105.
Xu, Y. (2001a) Fundamental frequency peak delay in Mandarin. Phonetica, 58, 26-52.
Yi Xu & Ching. X. Xu
46
Xu, Y. (2001b) Sources of tonal variations in connected speech. Journal of Chinese Linguistics,monograph series #17, 1-31.
Xu, Y. (2002) Articulatory constraints and tonal alignment. In Proceedings of The 1st InternationalConference on Speech Prosody, Aix-en-Provence, France, pp. 91-100.
Xu, Y. & Liu, F. (2002) Segmentation of glides with tonal alignment as reference. In Proceedings of7th International Conference On Spoken Language Processing, Denver, Colorado, pp. 1093-1096.
Xu, Y. & Sun, X. (2002) Maximum speed of pitch change and how it may relate to speech. Journalof the Acoustical Society of America, 111, 1399-1413.
Xu, Y. & Wang, Q. E. (2001) Pitch targets and their realization: Evidence from Mandarin Chinese.Speech Communication, 33, 319-337.
Xu, Y., Xu, C. X., & Sun, X. (2003) Identifying intrinsic constituents of focus through "imitationvia restoration". Journal of the Acoustical Society of America, 113, Pt. 2, 2327.
Yip, M. (1990) The Tonal Phonology of Chinese. New York: Garland Publishing.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.
Intonation Components in English
47
Footnotes
1 Also the vocal fold length does not monotonically increase with frequency (Nishizawa, Sawashima, &
Yonemoto, 1988).
2 In its recent development, the Fujisaki model has incorporated negative accent commands that lower f0 rather than
raising it as do the positive commands (Potisuk, Harper & Gandour, 1999; Wang et al. 1994). The negative
commands are introduced to account for tones such as L (Low) and R (Rising) in Mandarin and Thai, in which f0
drops sometimes are too fast to be accounted for by elasticity. However, the mechanism of the automatic return to
the central level after the cessation of a negative command is even less clear than that of a return after a positive
command.
3 Here and throughout the paper, we are using the most conventional understanding of a syllable as its working
definition. That is, a syllable consists of all the segments that are commonly considered to belong to it, including
the onset and coda consonants and the vowel. Acoustically, certain apparent acoustic landmarks (Stevens, 2002)
such as the onset of stop and nasal closure, are treated as marking the syllable boundaries. We do not claim or even
believe this to be the ultimate definition of the syllable, as some of our own studies are already suggesting
otherwise (Xu & Liu, 2002; Liu & Xu, in press). For the purpose of the present paper, nevertheless, we found the
conventional definition of the syllable to be convenient for our discussion.
4 It is often claimed that a sentence bears a broad focus even when there is no special emphasis on any part of a
sentence (See Ladd, 1996 for detailed arguments.)
5 Figure 6 also shows that the neutral tone after L tone seems to behave somewhat differently as when following
other tones. The L tone seems to have a power to raise the f0 of the following neutral tone, which is maximally
manifested in the second neutral-tone syllable. This raising effect of the L tone seems to be independent of the
overall behavior of the neutral tone.
6 The "–7" is because unfocused “Lee. . . niece” is used to contrast both with focused “Lee” and focused “niece,” as
shown below.
Yi Xu & Ching. X. Xu
48
Tables
Table 1. Specific f0 events that can help to answer the above questions.
a) What are the pitch targets associated with local prominences in a declarative sentence: static
[high], or dynamic [rise] or [fall]?
f0 event [high] [rise] [fall]
Location ofmaximum
around the end of accentedsyllable; but the exact locationvaries: before syllable offset whenfollowed by a stressed syllable, butafter syllable offset when followedby an unstressed syllable
always immediately afteraccented syllable
always well beforeoffset of accentedsyllable
Location ofinitialminimum
consistently around onset ofaccented syllable
well after onset ofaccented syllable
consistently aroundonset of accentedsyllable
b) Is focus realized with pitch specification only for the accented/stressed syllable or with pitch
specifications both for accented/stress syllable and for post-focus syllables?
f0 event Specification for focused syllableonly
Specification for both focused and post-focussyllables
Pitch range expanded for focused syllable;unchanged for post-focus syllables
expanded for focused syllable; lowered andnarrowed for post-focus syllables
c) Do post-focus syllables have no pitch targets of their own and are implemented with only a flat
low f0 contour, or do they still have their own pitch targets, which are implemented with reduced
pitch range?
f0 event No target Targets implemented with reduced pitch rangeContour totally flat similar to those on and before focus but much reduced in pitch range
d) Does focus change the pitch target of a stressed syllables?
f0 event Same pitch target Changed pitch targetContour similar f0 alignment different f0 alignment
e) Do syllables between pitch accents carry their own pitch targets, or is f0 only interpolated through
these syllables?
f0 event No target, only interpolation With their own pitch targets, possibly [mid]Height of f0
max, min, meanfully dependent on bothpreceding and following accents
only partially dependent on preceding accents;little dependence on following accent
Contour shape straight or curved interpolationbetween preceding and followingaccents
becomes lower and less dependent onpreceding accent over time; the lowest pointin the last unstressed syllable becomes thestarting point of f0 of contour of thefollowing accented syllable
Intonation Components in English
49
Table 2. Mean values of various measurements in the accented syllable under theeffects of focus, rate, accent location and position, together with probabilityvalues from four-factor repeated-measures ANOVAs. p values smaller than 0.05are printed in boldface.
Focus Rate Accent location Position
yes no normal fast final non-final word1 word3 word5
Maxf0 (st) 11.0 8.2 9.6 9.7 9.6 9.7 10.9 9.9 8.1
p =0.0086 p =0.3839 p =0.1455 p <0.0001
Minf0 (st) 6.6 6.8 6.5 6.9 6.5 6.9 7.7 6.8 5.6
p =0.2845 p =0.2845 p =0.0001 p <0.0001
Rise size (st) 4.4 1.4 3.0 2.7 3.0 2.8 3.2 3.0 2.4
p =0.0092 p =0.0035 p =0.0594 p =0.1217
Rise speed (st/s) 23.4 9.5 15.6 17.4 17.8 15.1 18.2 16.6 14.5
p =0.0041 p =0.0511 p <0.0001 p =0.0356
Accent-dur (ms) 222.6 195.4 232.2 185.8 242.3 175.7 188.5 208.0 230.5
p =0.0001 p <0.0001 p <0.0001 p <0.0001
Yi Xu & Ching. X. Xu
50
Table 3. Mean values of various measurements in the post- and pre-accent syllablesunder the effects of focus, adjacency and position, together with probabilityvalues from three-factor repeated-measures ANOVAs. p values smaller than0.05 are printed in boldface.
Focus Adjacency Position
yes no close far word2 word3 word4 word5
Maxf0-post-word1 (st) 6.2 7.4 7.1 6.6 8.9 6.5 5.9 6.0
p =0.0016 p =0.0002 p <0.0001
Maxf0-post-word3 (st) 6.5 6.8 7.3 6.0 7.2 6.1
p =0.0205 p =0.0028 p =0.0309
Focus Position
yes no word1 word2 word3 word4
Maxf0-pre-word3 (st) 9.3 9.9 9.9 9.3
p =0.1256 p <0.0042
Maxf0-pre-word5 (st) 8.6 8.6 10.0 9.5 7.8 7.2
p =0.8775 p <0.0001
Intonation Components in English
51
Table 4. Potential pitch targets and their associated alignment patterns.
Pitch targetf0 event [high] [rise] [fall] [low]f0 maximum after syllable
offsetafter syllableoffset
around middle ofsyllable
around syllableonset
f0 minimum around syllableonset
around middle ofsyllable
around syllableonset
around syllableoffset
Yi Xu & Ching. X. Xu
52
Table 5. Mean values of maxf0-to-C2 and peak location (= 100 ¥ C1-to-max / accent-dur)under the effects of focus, rate, accent location (upper half), accent length(lower half) and position, together with probability values from four-factorrepeated-measures ANOVAs. p values smaller than 0.05 are in boldface.
Focus Rate Accent location Position
yes no normal fast Final non-final word1 word5
Maxf0-to-C2 (ms) 52.6 42.0 50.5 44.2 71.9 22.7 8.1 86.6
p =0.1033 p =0.0902 p <0.0001 p =0.0004
Peak location (%) 78.5 81.2 80.9 78.8 68.9 90.8 97.2 62.5
p =0.4467 p <0.2118 p <0.0001 p =0.0011
Focus Rate Accent length Position
yes no normal fast long short word1 word5
Maxf0-to-C2 (ms) 12.5 -16.4 -5.7 1.8 30.2 -34.2 -52.8 48.9
p =0.0036 p =0.0302 p <0.0001 p <0.0001
Peak location (%) 83.6 124.1 107.9 99.9 87.6 120.1 130.6 77.1
p =0.0051 p =0.0431 p =0.0165 p =0.0136
Intonation Components in English
53
Table 6. Mean values of C1-to-minf0 and Valley location (100 x C1-to-minf0 / accent-dur) under the effects of focus, rate, accent location, and position, together withprobability values from four-factor repeated-measures ANOVAs. p valuessmaller than 0.05 are printed in boldface.
Focus RateAccentlocation Position
yes no normal fast final non-final word 1 word 3 word 5
C1-to-minf0 (ms) 3.9 16.9 12.2 8.6 19.2 1.6 14.3 7.9 8.9
p =0.1977 p =0.0416 p <0.3204 p=0.6520
Valley location (%) 1.2 6.7 4.2 3.7 7.0 0.9 5.8 3.4 2.7
p =0.1879 p <0.6842 p <0.0365 p =0.6342
Yi Xu & Ching. X. Xu
54
Figure Captions
Figure 1. In (a)-(c) Mandarin F, H and R tones (in syllable 3) are preceded by four different tones and
followed by H tone. In (d) the R tone in syllable 3 is followed by L tone. Vertical lines indicate
syllable boundaries. Adapted from Xu (1999).
Figure 2. Illustration of the pitch target implementation model. The vertical lines represent syllable
boundaries. The dashed lines represent underlying pitch targets. The thick curve represents the f0
contour that results from asymptotic approximation of the pitch targets. Adapted from Xu & Wang
(2001).
Figure 3. (a) Interaction of tone and focus in Mandarin in H H H H H (left) and H L H L H (right)
sequences. The locations of focus are indicated by the labels around the curves. (b) Suppression of
post-focus tones when syllable 2 carries four different tones. In all cases, the focus is on the first two
syllables. Adapted from Xu (1999).
Figure 4. f0 down trend introduced by downstep (a), and by both initial focus and downstep (b).
Adapted from Xu (1999). The thin curves are f0 tracings of the tone sequence H H H H H, whereas
the thick curves are those of H L H L H. The f0 interruptions at the beginning of syllable 5 in the H
L H L H sequences are due to a voiceless stop [t].
Figure 5. Mean f0 contours of Mandarin sentences containing 0 or 3 neutral tone (N) syllables. In
both graphs, the tone of syllable 1 alternates across H, R, L and F. In (a) the tone following syllable 1
is F. In (b), there are 3 neutral tones syllables following syllable 1. Vertical lines in the graphs indicate
syllable boundaries. Data from Chen and Xu (2002).
Figure 6. Time-normalized f0 curves of seven repetitions of “Nina may know my niece” said by eight
subjects at “normal” rate with no narrow focus.
Figure 7. Mean f0 contours of all sentences produced at normal rate by 7 subjects. In each graph, the
ordinate is the mean f0 in Hz averaged over 49 repetitions by 7 subjects, and the abscissa is time in
ms. The duration of each syllable in a f0 curve is the grand average of 49 repetitions by 7 subjects.
The thicker curves have narrow focus on one of the words as indicated by the underscore in the
sentence printed in each graph. The open squares and circles indicate syllable boundaries, located at
the first vocal pulse of the initial consonants. In the sentences containing the words “mimic” and
minimize,” the gaps in f0 curves correspond to the closure or frication of the final consonants.
Figure 8. Post-focus maximum f0 broken down by adjacency to preceding focus and position in
sentence when focus is on word 1 (left) and word 3 (right).
Intonation Components in English
55
Figure 9. Mean f0 in semitone at different locations in the post-accent syllable broken down by focus
and post-accent stress.
Figure 10. Percentage of discernable post-focus f0 peaks (left) and size of the peaks (right) in the
post-focus stressed syllables. A peak is discernable if there is an f0 point between the onset and offset
of the words “know” and “niece” that is higher than both the starting and ending f0 of the word.
Figure 11. Mean values of maxf0-to-C2 (top) and peak location (bottom) broken down by focus,
position, accent location (left) and accent length (right).
Figure 12. Mean duration of the accented syllable in word 1 and word 5 according to focus and accent
location (left), and focus and accent length (right).
Figure 13. Results of regression analyses with accent duration as predictor and maxf0-to-C2 as
dependent variable. Upper panels: r2; Lower panels: slope of the regression line. Left: results broken
down by focus and accent location. Right: results broken down by focus and length of accented vowel.
Figure 14. Results of regression analyses on word 3 with accent duration as predictor and maxf0-to-C2
as dependent variable.
Figure 15. Results of regression analyses on f0 height at different locations in the unaccented syllable
immediately after accented syllables.
Figure 16. Results of regression analyses on f0 height at different locations in the unaccented syllable
immediately preceding accented syllables.
Figure 17. A brief sketch of a dual-module model of intonation. The model assumes a major division
between communicative functions that convey meanings through intonation and articulatory
mechanisms that implement these functions. Intonation related communicative functions manifest
themselves in parallel (column 1 from left) by separately specifying (a) local pitch targets, (b) pitch
ranges and (c) articulatory effort (column 2). Taking these specifications as input, the articulatory
module (column 3) applies physical forces to successively approach local targets at the specified
pitch ranges with the specified amounts of effort. The timing control in the articulatory module
synchronizes the local targets with the associated syllables. The resulting f0 thus continually approach
successive local targets within different pitch ranges at varying speeds (column 4). See Figure 2 for
what the surface local f0 contours generated by this model may look like.
Yi Xu & Ching. X. Xu
56
Figure 1.
(a) (b)
60
80
100
120
140
160
0 17 34 51 68 85
H HHH
H
L
R
F
60
80
100
120
140
160
0 17 34 51 68 85
H HHR
H
L
R
F
(c) (d)
60
80
100
120
140
160
0 17 34 51 68 85
H HHF
H
L
R
F
60
80
100
120
140
160
0 17 34 51 68 85
H HLR
H
L
R
F
Intonation Components in English
57
Figure 2.
Yi Xu & Ching. X. Xu
58
Figure 3.
(a)
60
80
100
120
140
160
0 17 34 51 68 85
H HHH
Word 2
H
Word1
None
Word 3
60
80
100
120
140
160
0 17 34 51 68 85
H HLH
Word 2
L
Word1
None
Word 3
(b)
60
80
100
120
140
160
0 17 34 51 68 85
H HH
H
R
F
H
60
80
100
120
140
160
0 17 34 51 68 85
H HH
H
R
F
R
60
80
100
120
140
160
0 17 34 51 68 85
H HH
H
R
F
L
60
80
100
120
140
160
0 17 34 51 68 85
H HH
H
R FF
Intonation Components in English
59
Figure 4.
(a) (b)
60
85
110
135
160
0 10 20 30 40 50 60 70 80
No Narrow Focus
H HHH/L H/L
60
85
110
135
160
0 10 20 30 40 50 60 70 80
Focus
H HHH/L H/L
Yi Xu & Ching. X. Xu
60
Figure 5.
(a)
50
100
150
200
250
0 100 200 300 400 500 600 700 800
H
L
R
F F
(b)
50
100
150
200
250
0 100 200 300 400 500 600 700 800
Time (ms)
H
L
R
F
F
NN
N
Intonation Components in English
61
Figure 6.
S2
0
100
200
300
400
Ni may know my niecena
S2
0
100
200
300
400
S2
0
100
200
300
400
S2
0
100
200
300
400
S2
0
50
100
150
200
Ni may know my niecena
S2
0
50
100
150
200
S2
0
50
100
150
200
S2
0
50
100
150
200
Normalized time
Yi Xu & Ching. X. Xu
62
Figure 7.
100
150
200
0 200 400 600 800 1000 1200
Mean time (ms)
Lee may know my mummy
100
150
200
0 200 400 600 800 1000 1200
Mean time (ms)
Ramona may know my niece
100
150
200Lee may know my niece
100
150
200Emily may know my niece
100
150
200Lamar may know my niece
100
150
200Nina may know my niece
100
150
200Lee may Lure my niece
100
150
200Lee may know my nanny
100
150
200Lee may minimize my niece
100
150
200Lee may mimic my niece
Intonation Components in English
63
Figure 8.
Focus:
9.1 8.87.4 7.2 6.9 6.8 6.7 6.7
10.7
7.25.9 5.5 4.9 5.0 5.2 5.2
0
2
4
6
8
10
12
14
close far close far close far close far
Post
-focu
s m
axf0
(st)
no yes
Adjacency:word2 word5word4word3Position:
Focus:
7.3 6.6 6.8 6.69.4
5.6 5.8 5.3
0
2
4
6
8
10
12
14
close far close far
Post
-focu
s m
axf0
(st)
no yes
Position:Adjacency: word5word4
Yi Xu & Ching. X. Xu
64
Figure 9.
4
6
8
10
12
0 25 50 75 100
Location in post accent syllable (%)
F0 (
st)
strong_no-focus weak_no-focus
strong_post-focus weak_post-focus
Intonation Components in English
65
Figure 10.
Focus:
83.4 89.271.6 59.7
0
20
40
60
80
100
word3 word5
Position
No
. of
pea
ks (
%)
no yes
Focus:
0.37
1.21
0.260.77
0
0.5
1
1.5
word3 word5
PositionP
eak
rise
siz
e (s
t)
no yes
Yi Xu & Ching. X. Xu
66
Figure 11.
Accent locaton:
-21
11
724524
54
121 106
-200
-100
0
100
no yes no yes
Maxf
0-t
o-C
2 (
ms)
non-final word final
Focus:word1 word5Position:
Accent length:
-121
-47
32 17
-21
11
72 75
-200
-100
0
100
no yes no yes
Max
f0-t
o-C
2 (
ms)
short long
Focus:word1 word5Position:
Accent locaton:
110 95 67 8287 79 45 560
100
200
300
no yes no yes
Pea
k lo
cation (
%)
non-final word final
Focus:word1 word5Position:
Accent length:
292
16880 91110 95 67 69
0
100
200
300
no yes no yes
Pea
k lo
cation (
%)
short long
Focus:word1 word5Position:
Intonation Components in English
67
Figure 12.
Accent locaton:
182 204 227 252207 244 241 259
0
100
200
300
no yes no yes
Acc
ent
dura
tion (
ms)
non-final word final
Focus:word1 word5Position:
Accent length:
64 73
177 191182 204 227 256
0
100
200
300
no yes no yes
Acc
ent
dura
tion
(m
s)
short long
Focus:word1 word5Position:
Yi Xu & Ching. X. Xu
68
Figure 13.
maxf0-to-C2 regressed over accent-dur
0
0.2
0.4
0.6
0.8
r2
word 1 0.025 0.018 0.023 0.356
word 5 0.046 0.006 0.455 0.381
no focus on-focus no focus on-focus
non-final word-final
maxf0-to-C2 regressed over accent-dur
0
0.2
0.4
0.6
0.8
r2
word 1 0.074 0.161 0.025 0.018
word 5 0.027 0.069 0.046 0.006
no focus on-focus no focus on-focus
short vowel long vowel
maxf0-to-C2 regressed over accent-dur
-0.2
0
0.2
0.4
0.6
0.8
1
Slo
pe
of
regre
ssio
n lin
e
word 1 -0.202 0.11 0.11 0.351
word 5 0.431 0.107 0.587 0.364
no focus on-focus no focus on-focus
non-final word-final
maxf0-to-C2 regressed over accent-dur
-0.6
-0.2
0.2
0.6
1
slope
of
regre
ssio
n lin
e
word 1 -0.1386 0.677 -0.202 0.11
word 5 -0.428 0.3 -0.431 0.107
no focus on-focus no focus on-focus
short vowel long vowel
Intonation Components in English
69
Figure 14.
maxf0-to-C2 regressed over accent-dur in word 3
0.005 0.011 0.002
0.477
0
0.2
0.4
0.6
0.8
no focus on-focus no focus on-focus
non-final word-final
r2
maxf0-to-C2 regressed over accent-dur in word 3
-0.161
0.108
-0.046
0.413
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
no focus on-focus no focus on-focus
non-final word-finalSlo
pe
of re
gre
ssio
n lin
e
Yi Xu & Ching. X. Xu
70
Figure 15.
post-pitch regressed over rise-size in word 1
0
0.2
0.4
0.6
0.8
1
r2
50 ms 0.684 0.734 0.473 0.838
100 ms 0.33 0.593 0.206 0.601
150 ms 0.122 0.529 0.039 0.308
200 ms 0.064 0.445 0.013 0.094
stressed unstressed stressed unstressed
no-focus post-focus
post-pitch regressed over rise-size in word 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Slo
pe
of
regre
ssio
n lin
e
50 ms 0.847 0.987 0.568 1.12
100 ms 0.489 0.856 0.282 0.618
150 ms 0.369 0.259 0.746 0.098
200 ms 0.193 0.558 0.049 0.163
stressed unstressed stressed unstressed
no-focus post-focus
post-pitch regressed over rise-size in word 3
0
0.2
0.4
0.6
0.8
1
r2
50 ms 0.186 0.622 0.561 0.912
100 ms 0.053 0.223 0.347 0.622
150 ms 0.022 0.059 0.175 0.29
200 ms 0.005 0.018 0.05 0.203
stressed unstressed stressed unstressed
no-focus post-focus
post-pitch regressed over rise-size in word 3
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Slo
pe
of
regre
ssio
n lin
e
50 ms 0.557 0.834 0.555 1.067
100 ms 0.295 0.658 0.257 0.936
150 ms 0.215 0.397 0.162 0.446
200 ms 0.12 0.24 0.099 0.345
stressed unstressed stressed unstressed
no-focus post-focus
post-pitch regressed over rise-size in word 5
0
0.2
0.4
0.6
0.8
1
r2
50 ms 0.256 0.342
100 ms 0.055 0.106
150 ms 0.12 0.143
200 ms 0.22 0.001
no-focus post-focus
post-pitch regressed over rise-size in word 5
-0.6
-0.4
-0.2
0
0.20.4
0.60.8
1
1.2
Slo
pe
of
regre
ssio
n lin
e
50 ms 0.556 0.549
100 ms -0.413 -0.225
150 ms -0.512 -0.5
200 ms -0.453 -0.036
no-focus post-focus
Intonation Components in English
71
Figure 16.
pre-pitch regressed over maxf0 in word 1
0.0
0.2
0.4
0.6
0.8
r2
50ms 0.013 0.03 0.044 0.013
100ms 0.038 0.034 0.022 0.081
start 0.015 0.038 0.00900 0.002
stressed unstressed stressed unstressed
no-focus pre-focus
pre-pitch regressed over maxf0 in word 1
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Slo
pe
of re
gre
ssio
n lin
e
50ms -0.03 -0.044 -0.243 -0.049
100ms 0.361 0.295 0.052 0.135
start 0.295 0.367 0.061 0.044
stressed unstressed stressed unstressed
no-focus pre-focus
pre-pitch regressed over maxf0 in word 3
0.0
0.2
0.4
0.6
0.8
r2
50ms 0.0005 0.035 0.124 0.04
100ms 0.042 0.02 0.143 0.032
start 0.348 0.285 0.00002 0.001
stressed unstressed stressed unstressed
no-focus pre-focus
pre-pitch regressed over maxf0 in word 3
-0.2
0
0.2
0.4
0.6
0.8
1
Slo
pe
of re
gre
ssio
n lin
e
50ms -0.012 -0.113 -0.046 -0.033
100ms 0.167 0.138 -0.089 -0.057
start 0.952 0.899 -0.002 0.014
stressed unstressed stressed unstressed
no-focus pre-focus
pre-pitch regressed over maxf0 in word 5
0.0
0.2
0.4
0.6
0.8
r2
50ms 0.003 0.165 0.037 0.00001
100ms 0.03 0.002 0.018 0.087
start 0.086 0.065 0.040 0.005
stressed unstressed stressed unstressed
no-focus pre-focus
pre-pitch regressed over maxf0 in word 5
-0.2
0
0.2
0.4
0.6
0.8
1
Slo
pe
of re
gre
ssio
n lin
e
50ms -0.006 -0.048 -0.06 0.003
100ms -0.088 0.021 -0.028 -0.062
start 0.244 0.186 0.074 -0.023
stressed unstressed stressed unstressed
no-focus pre-focus
Intonation Components in English
Figure 17.
Recommended