Upload
volien
View
215
Download
0
Embed Size (px)
Citation preview
Supporting Information Appendix for
Functional Flexibility of Infant Vocalization
and the Emergence of Language
D. Kimbrough Oller, Eugene H. Buder, Heather L. Ramsdell,
Anne S. Warlaumont, Lesya Chorna, Roger Bakeman
*Correspondence to [email protected]
1
TABLE OF CONTENTS
SUPPORTING BACKGROUND pp. 3–23
Categories of vocalizations in humans and other primatesAffect and context in the judgment of function in infant vocalizationsCharacteristics of the protophones of interest in the present workEvidence of the existence of precanonical protophone categories and their systematic
production in human infancyAcoustic analysis studies and early vocal categoriesRepetition and its role in recognition of vocal categoriesRecurrence quantification analysis to illustrate temporal clumping of
vocal categoriesParent-infant interaction in vocal category formation and parent
recognition of categoriesOn the possible role of gesture in the origins of language
SUPPORTING METHODS pp. 24–42
Infants and recordingsSelection of data for the present study Coding softwareUtterance location for coding Coding training and coding procedures for both vocal type and facial affect Definitions of vocal types and facial affect types in the studyPositive, neutral and negative affect as a proxy for functionIllocutionary force coding Perlocutionary effect coding Observer agreement levels for both vocal type and facial affect
SUPPORTING RESULTS pp. 43–69
Audio-Video examples (Movies) illustrating the protophones and their variability in facial affect expression
Odds Ratio analyses to illustrate the distinction between protophones and stereotyped species-specific signals
Consistency across infants for the six patterns showing functional flexibility in the protophones but not in the stereotyped species-specific signals
Functional flexibility of vocalization even at the youngest ages of the infants in the study
Observer agreement on six patterns of functional flexibility in protophones Additional contingency table analyses illustrating individual differences on
protophone expression of affect, but not on stereotyped species-specific signal expression of affect
Log-linear analyses: Individual differences in the expression of affect in infant vocalization
The role of affect expression in the functional interpretation of infant protophones Facial affect and illocutionary force of infant protophonesFacial affect and perlocutionary effects of infant protophones
Robustness of functional flexibility of protophones across contextsFacial affect codes based on the master coding for the audio-video examples
SUPPORTING REFERENCES pp. 70–76
2
SUPPORTING BACKGROUND
Categories of vocalizations in humans and other primates
There are two general categories of vocalizations in nonhuman primates and apparently
in virtually all mammals: 1) vegetative sounds and 2) stereotyped species-specific calls that have
been termed “fixed signals” by the classical ethologists (1, 2)—see comments below on more
recent interpretations of, and research on such calls . Vegetative sounds such as coughs, sneezes,
and burps are the product of bodily functions and show little, if any sign of having been naturally
selected as communicative vehicles although they can be exploited by listeners as information
about the producer (e.g., a cough can betray the location of prey to a predator). Because of the
lack of obvious selection for communication, vegetative sounds are usually viewed as relatively
unimportant in theories of the evolution and development of language.
Species-specific calls, on the other hand, are assumed to have been selected for their
social communicative value, and they are often a focus in the search for origins of language (3–
5). Examples of species-specific vocalizations in human infants are cry and laughter. For non-
human primates, they include distress cries, contact calls, warning calls and so on, often
characterized as relatively stereotyped sound types, each category with a specific social function
(4). The functions of animal signals in the context of this view have been thought to be universal
within species, present very early in life regardless of hearing status or other experiential factors,
and not fundamentally modifiable in the form of their production. This viewpoint was essentially
expressed by Darwin (6) and has been reiterated in numerous empirical research studies in the
past century—see a current review of the relevant primate literature by Owren and colleagues
(7), but as will be seen below, it appears the view will need to be softened in light of recent
findings.
3
In accord with this traditional view, primate species-specific calls are often seen as
relatively inflexible in the following sense: The function that each call serves is thought to
remain essentially the same on each occasion of usage and is not modified to take on that of
another call by natural experience or by specific training. Thus for example, a warning call is
thought always to be a warning call and has not been shown to be transformable into a contact
call. Conditioning can play a role in vocal production of nonhuman primates, but the literature
has reported substantial limits on the kinds of shifts in usage of primate vocalizations that occur
in conditioning studies; for example a change in the rate of production of a food call may be
conditionable, but this is not the same as conditioning the food call to be used as a warning call
(8). The prevailing view is that subcortical and limbic structures control nonhuman primate
sounds and that such sounds are largely involuntary.
Human infants in the first months of life, like all primates, show both vegetative sounds
and at least two vocals types that are analogous to the species-specific calls of other primates.
These are cry and laughter, both of which have been extensively studied (9–14). However,
human infants also show sounds that are neither vegetative nor stereotyped species-specific calls,
and importantly they are also not speech. We call these sounds protophones, and they occur in all
normal infants (15, 16) at frequencies that appear to substantially exceed those of the stereotyped
species-specific calls (17, 18). In the past these sounds have often been labeled “babbling”, but
some authors (even very recent ones) have limited the term “babbling” to canonical babbling
only (19, 20), the protophone category that includes well-formed syllables as in [baba] or
[mama], for example. Canonical babbling begins typically by 7 or 8 months of life. Precanonical
protophones of the first months of life include categories formed out of vocal exploration
primarily at the laryngeal level—thus the well-timed articulations that are required in canonical
4
syllables are not involved. Among the most prominent of the precanonical protophones are
vocants, squeals and growls—the literature on vocal development repeatedly notes the existence
of these types (although the terminology sometimes varies) (21–24).
Human speech, in contrast with the species-specific vocalizations produced by primates
generally, is produced with substantial cortical as well as subcortical and limbic involvement and
is supremely voluntary and modifiable. Owren et al. page 541 summarize:
"… the vocal flexibility and volitional control that is so often sought in primates is
largely absent while being strikingly clear in humans (7) "
This difference may also apply to protophones, given their close relationship to speech.
It is important to emphasize, however, that while no one seems to doubt that humans
have greater vocal flexibility than non-human primates, the degree and developmental course of
flexibility of vocalization in non-human primates remains an open question. Cheney and
Seyfarth have summarized research on nonhuman primate calls (3), with two key points:
1. Even production of calls is flexible to some extent, since, for example, alarm calls are
not necessarily obligatory, but depend on social context, and a variety of changes
occur in calls across development.
2. Comprehension of calls is much more subject to modification by experience than
production of calls.
In light of recent research, these points should be supplemented by the observation that
some sounds in non-human primates may be more flexible in usage than those that appear to
have provided the primary basis for the assumed limited flexibility, as proposed in classical
ethology. Loud calls are easy to hear or record at the sorts of distances from which many
observations are made in primate research (alarm calls fit this mold), but sounds that tend to
5
occur at lower amplitudes (sometimes referred to as close-calls) are harder to observe and may
indeed be more flexible in usage. Notable attention has been paid to lipsmacking (25, 26),
“girneys” (27), and other calls that are thought to be potentially more flexible in terms of social
usage (28, 29). Lipsmacking may be of particular interest, because although it involves no
phonation, it does include quiet sounds that are often produced during grooming, a circumstance
that has been hypothesized to have been a possible key setting for natural selection of language-
like behavior (30, 31). It is clear that more research is needed on low intensity vocalizations in
non-human primates because they appear to occur with little or no obvious affective biasing and
with low arousal. Consequently they may be particularly good candidates for evaluation as
possible homologues with human infant protophones. In chimpanzees it appears that grunts may
be particularly good candidates for such exploration—recent research has already shown
interesting changes in utilization across contexts and ages of individuals (32).
Affect and context in the judgment of function in infant vocalizations
To put our discussion of functional flexibility in perspective, let us consider how
functions of vocalization are assessed in research on both human infants and related species. A
standard approach for attempting to interpret functions of communications in non-humans is to
evaluate the physical and social “context” in which the communications occur. For example,
since different alarm calls occur in some species when a particular predator is spotted—vervet
monkeys providing the paradigm example (33, 34)—it makes sense to treat the context (in this
case the specific kind of impending possible predation) as strongly suggestive of function. The
call produced in response to spotting the predator can be said to have the immediate function of
“alarm”, “warning”, or “danger” signaling. This sort of function is the “illocutionary force”—in
accord with our extended interpretation of Austin’s (35) term—and it denotes the social act
6
produced by the signaler in the act of signaling. Our group has proposed to treat signals that are
inherently communicative—i.e., have been naturally selected to be signals—as having
illocutionary force whether the signals are produced intentionally or not (142,143), even though
Austin may have restricted his usage of the term illocution to intentional social acts. Austin’s
work focused on mature linguistic communication and to our knowledge never addressed a
possible application of the speech act distinctions to human infancy or animal communication.
Our extension of the Austinian term “illocution” to include naturally selected signals that may be
produced with little or no explicit social intention is important in our view because it facilitates
the explication of the origins of communication in human infancy and in animal communication.
The naturally selected warning call (where the illocutionary force in accord with our
usage is “warning”) in vervet monkeys produces an increased probability that after hearing the
warning, an individual receiver or the group will execute appropriate escape behaviors or at least
look for the source of possible danger. This sort of function is the “perlocutionary force” (again
Austin’s coinage), encompassing the effects that occur as a result of listeners having interpreted
or reacted to the signal (and presumably its illocutionary force). If a species has different types of
alarm calls that occur in response to different types of predators (as has been reported for vervet
monkeys and a variety of other species), it makes sense also to interpret the calls in terms of a
referential function, although this referentiality may be a perlocutionary effect executed by
listeners rather than being an intentional illocutionary act of the producer.
Our interpretation emphasizes the primary role of the sender in illocutionary force,
although of course the receiver in successful linguistic communication also interprets the
illocutionary force of the sender. Even in infant vocal communication, parent receivers and
laboratory coders interpret illocutionary force (e.g., they may say “he’s complaining” in response
7
to infant fussy sounds or cry). Recognizing the additional role of the receiver in illocutionary
force is important because there are clear cases where illocutionary force of the sender is
misinterpreted. For example, Person 1: “I thought you were making a request to turn up the heat
when you said it was cool in our house”. Person 2: “No, I like it cool. I was complimenting you
on your choice of temperature”. Similarly perlocutionary effect is determined by the reaction of
the receiver, and we thus view perlocution as being primarily in the receiver’s domain, although
the sender may intend by speaking to produce a particular perlocutionary effect or class of
effects. In spite of the sender’s purpose, the actual perlocutionary effect may differ from that
intended. For example, Person 1: “I command you to turn the heat down.” Person 2: “And I
refuse to do so because your imperious attitude insults me”.
It appears that many signals of prelinguistic human infants and animals are naturally
selected to transmit particular illocutionary forces (or classes of them), and that selection is
dependent upon perlocutionary outcomes that make those illocutions, on balance, advantageous
for both sender and receiver, in accord with the reasoning of Maynard Smith and Harper (citation
20 in the Main text). So for the naturally selected communications of human infancy and
presumably for animal communication, matching between the sender’s intended perlocutionary
effect and the one that actually occurs in the receiver must, we reason, occur more frequently
than mismatching.
While Austin’s terms have not usually been invoked in the context of literature on animal
communication, the importance of creating a bridge to literature in human infancy motivates a
recognition of the special role for the sender in illocution and for the receiver in perlocution.
Returning then to vervet alarm calls, the understanding of how they work has been emphasized
as involving an important distinction between roles of sender and receiver(s). Given that vervet
8
receivers may interpretively derive referentiality to specific predators even though that reference
is not directly intended by the sender, this kind of communication has been said to involve
“functional referentiality” (36), to suggest a distinction from the more abstract and flexible
referentiality that usually occurs in human language. In the typical case of mature human
language, referentiality is not just “functional”, but is instead clearly intended by the sender. We
invoke Austin’s term “perlocutionary effect” in this context, then, to help provide the desirable
bridge to human infancy literature where the term is used to emphasize the key role of the
receiver in interpreting actions in ways that may suggest reference, even though the sender
(human infant or monkey) may have intended no reference.
Regardless of how we view the source of referentiality, context provides useful
information about function of alarm calls, because according to published reports, context can be
uniquely associated with a sequence of predator spotting, alarm calling, and escape behaviors by
listeners. The alarm call and its sequelae can therefore be interpreted to form a “triadic” event
involving 1) signaler(s), 2) receiver(s), and 3) predator, with the predator (which supplies the
key element of external “context”) being the direct or indirect focus of the signaler(s) and
receiver(s) in the course of the event.
But external context often does not provide such clear determination of function. This
appears to be especially true when communication is not triadic, as is the case in much social
communication. In the case of the human infant in the first half year of life, there is no sign of
the sort of triadic communication suggested by vervet alarm calls. Human infants have no alarm
calls, and very early in development there are none of the signs of joint attention that occur later
in human development—no pointing, no sharing of gaze jointly in a triadic fashion, and no clear
way that vocalizations or gestures designate objects or entities outside the self (37–39). Instead,
9
in the first half year, communication is typically monadic, e.g., the infant cries, with no obvious
communicative intent (although we can, in accord with our extended usage of Austin’s term,
refer to the cry as having an illocutionary force of “distress expression” or “plea”), but the parent
responds, yielding a potentially beneficial perlocutionary effect; or communication is dyadic,
e.g., the parent and infant interact vocally sharing affect and turn taking, with the parent
optionally responding to needs of the infant if needs are perceived in the dyadic interaction, but
the interaction is confined to social interaction. In these cases infants signal only about
themselves and/or the interactor, but not about external entities or events, and any response of
caregivers appears to be based on their interpretation of infant state and the conditions of the
environment. The key point is that the vocalizations of the infant in these cases provide little if
any direct information about the environment.
In such monadic or dyadic communicative events, the illocutionary force of the infant is
usually not determinable by external context, since by definition the illocutionary force of the
signal pertains to internal states and/or intentions. A hungry infant may cry or produce
protophone sounds with negative facial affect, but the same is true of an infant with a belly ache,
an infant that needs sleep, or an infant who has been stuck with a hypodermic needle. Thus
external context can only be a partial indicator of illocutionary function. Similarly, a happy
infant may produce positively valenced protophones, but external context often does not reveal
or determine the illocutionary function of the sounds (although tickling provides a case where
illocution may be more easily inferred from context). Often it appears to be the perlocutionary
effect revealed in the response of the caregiver (resulting actions or states of mind of the
caregiver revealed by things the caregiver says based on the infant signal) that provides the best
indication of how the function of young infant vocal signals should be interpreted. While we
10
have not observed it to be so, we grant that it is possible that individual infants may have
propensities to produce certain types of protophones or certain protophones with particular types
of facial affect more commonly in some situations than in others. For example, one can imagine
an infant having a particular tendency to produce positive squeals as part of a vocal routine at the
changing table. However, even if we were to find such a pattern, it still would not be easily
determined what aspect of the context might be producing the pattern, as the candidate aspects
are large in number (is it that the infant is being made more comfortable, or that the infant likes
the proximity of the person doing the changing, or is it something about the lights on the ceiling,
etc.?) Our approach has been to use affect as a key determiner of function in part because it can
be observed directly, and in part because infant affect is known to play a key role in parental
caregiving and thus in the functional outcomes of infant signaling (40–44).
It stands to reason that the same sorts of difficulties in determining illocutionary
functions of signals apply in the case of non-human primate infants. It also appears that with
more mature human and non-human primates (as opposed to infants), the difficulties of
determining illocutionary functions based on external context may become even more severe
given the apparently increasing variety of contexts within which vocalizations tend to occur, as
indicated in recent studies in chimpanzees and bonobos (32, 45–49).
Perlocutionary effects can also be difficult to determine whenever the external context is
complex in and around the time and place the signals are produced. It is worthy of note that the
basis for interpretation of functions in human and non-human cases is importantly different
because mature humans (caregivers or adult observers) can be used as informants about their
internal states, i.e., about how and why they produce signals and about how and why they react
as they do to the signals of others, and parents often provide such information spontaneously
11
during interaction with their infants (“I think he’s sleepy”, “what a happy sound!”, etc.). Thus the
study of signal functions can be assessed with additional tools in the human case, to supplement
the analysis of external context, and in many cases the spontaneous vocalizations of parents
during interaction constitute a key element to help interpret the external context. Thus, in the
case of the human infant, we reason that caregivers can and should be used as important
informants.
We also reason that infant vocalizations (in both humans and non-human primates) must
have evolved and developed to be interpretable in terms of adaptive functions, and consequently
caregivers must have (evolved to have) the capability to recognize infant vocalizations
functionally. In particular, illocutionary forces of very young human infant vocalizations can at
least be classified broadly in ways that correspond with judgments of affect that can be elicited
from mature human observers (potential caregivers).
We have proposed, consistent with this line of reasoning, that (at least in very young
human infants) facial affect should provide relatively stable evidence from which to determine
mutually exclusive classes of possible illocutionary functions and perlocutionary effects that can
be judged reliably by mature observers. If a particular vocal type can be utilized with the full
range of affect from positive through neutral to negative, we contend there must be
accompanying variation in function. Vocalizations occurring with positive affect correspond to
one class of possible functions and those occurring with negative affect correspond to a different
class of possible functions.
Consider an example. We have observed that adult caregivers respond to a squeal given
with negative facial affect in a way similar to their response to cry, by seeking to understand if
something is wrong with the infant (is the infant uncomfortable, hungry…?) and often by taking
12
action to correct the problem. We might say the function (illocutionary force sensu Austin) of the
negative squeal is “complaint” or “expression of distress” (and our coders of illocutionary force
easily adapt to coding in those terms), and one real world effect (perlocutionary effect sensu
Austin) is an increased probability that the discomfort will be alleviated through actions taken by
caregivers. In contrast, we have observed that adult caregivers do not respond to a squeal
produced with positive facial affect as a complaint, but rather they respond in a way that is
similar to how they respond to infant laughter, treating the vocalization as a social act expressing
joy and/or affiliation, and encouraging further positive interaction. The positive squeal might
thus be characterized as an “expression of joy or fun”, an “exultation” or an “encouragement for
positive interaction and bonding” (illocutionary force), and the effect of the positive squeal
(perlocutionary effect) might be characterized as an increased probability of continued positive
interaction and social support from caregivers.
The key point here is that while we may not be sure how to uniquely portray the
functions (either illocutionary or perlocutionary forces) involved in these acts, we can be sure
that the functions are not the same for the positive and negative affective versions of the same
sounds. Thus, the determination that there exists any category of infant vocalizations that can be
utilized with positive, neutral, and negative facial affect illustrates that human infants possess
functional flexibility of vocalization very early in life, and given that this kind of flexibility is a
foundational requirement for language, it provides a quantifiable very early indicator that should
be possible to compare usefully across humans and other species where vocal type and affect can
be judged. The recent development of a facial affect coding scheme for chimpanzee modeled
after the Ekman method for human facial affect judgment suggests that this possibility is within
reach (50, 51).
13
The preceding argument is not, however, intended to imply that external context should
be abandoned as a possible source of information for interpretation of human infant
vocalizations. On the contrary, we recently have evaluated several empirically observable factors
concurrently with vocalization and facial affect. These include gaze direction of the infant and a
variety of contextual circumstances (interaction on a high chair, interaction on the changing
table, etc.) including one where the infant is not interacting at all (separated). Furthermore, in
response to a review we have evaluated perlocutionary effects in the form of responses of parents
to infant vocalizations with varying facial affect. Results of analyses using these “contexts” are
reported under Supporting Results: The role of affect expression in the functional
interpretation of infant protophones, and Supporting Results: Robustness of functional
flexibility of protophones across contexts.
Characteristics of the protophones of interest in the present work
In this paragraph we summarize definitions that have previously been provided for
vocants, squeals and growls (52), the protophones considered in the present research. Vocants
are vowel-like sounds produced with normal phonation (sometimes called “voicing” or “vocal
fold vibration”), the type of phonation that is used typically in speech. In vocants the
fundamental frequency or pitch of the voice is within the typical range for the speaker (or infant
producer of the sound). Squeals are high-pitched sounds, produced above the typical range for
the speaker, often in falsetto register. Growls are sounds produced with harsh voice, typically
perceived as low in pitch for the speaker. Further definition of these protophones and how they
are coded can be found below in Supporting Methods: Coding training and coding procedures for
both vocal type and facial affect). For examples of these protophones go to Supporting Results:
Audio-video examples (Movies).14
In addition to protophones, infants have stereotyped species-specific signals. These
(when they are produced reflexively) have similar properties to those of stereotyped species-
specific signals in other primates. Cry and laughter are the most prominent members of this
class, and they provide a solid reference point in human infants against which to compare the
protophones and how they function.
The study of functions of protophones has yielded somewhat chaotic conclusions (17, 53)
in part because protophones function so differently from human species-specific calls, and
perhaps also from non-human primate calls. There exists a temptation to seek consistent
“meanings” or illocutionary forces (35) for each protophone category (perhaps because the
stereotyped species-specific signals seem to have consistent functions, and because words have
relatively stable “meanings”), yet it is the freedom of protophones to assume differing forces and
functions that, in our view, highlights the significance of these sounds as precursors to and
foundations for the speech capacity, a capacity that requires all words and sentences to have such
freedom (52).
It might seem surprising at first blush that squeals and growls can be affectively variable
given the common impression that we squeal with delight and growl in anger. In the early years
of our research in vocal development, we registered some surprise on this point as we were
beginning to notice that early in life, affect variability is common for squeals and growls. Is it
possible that squealing and growling are more affectively biased in adults than in infants?
Perhaps so, but no one to our knowledge has quantified the degree of the bias (these kinds of
sounds do not occur terribly often in adults), and it is obvious that the biases can be violated by
any adult (or older child) who chooses to do so. For example, a growl can clearly be produced in
15
the midst of gustatory or other pleasures, and shrieking squeals (with very negative facial affect)
sometimes occur in fear or when expressing horror or repugnance.
Evidence of the existence of precanonical protophone categories and their systematic
production in human infancy
Acoustic analysis studies and early vocal categories. Buder and colleagues have
addressed acoustic analysis of infant vocalizations (54–59), characterization of phonatory signals
(60, 61), and a variety of additional topics on acoustics related to infant vocal categories such as
vocant, growl, and squeal (62, 63).
Supporting Figure 1: Acoustic results of fundamental frequency analysis for infant utterances from the sample judged to be squeals (yellow), vocants (blue), or growls (violet). See text above for explanation. The three dimensions displayed are the log of the mean, the log of the standard deviation, and the log of the highest fundamental frequency for each utterance.
Supporting Figure 1 provides sample data from a preliminary classification of vocant,
growl, and squeal using polynomial logistic regression (64) to evaluate candidate F0 statistics:
Mean, highest, and SD from 1352 auditorily coded utterances of 3 infants (at 3, 7, and 11
16
months from 2 recording sessions of 20-min at each age for each infant): The figure above
illustrates longitudinal data from one of these infants. Three F0 variables (mean, highest, and SD)
yielded higher than 70% correct classification. The figure displays the data in a 3D log F0-
measure space (vocants plotted in blue with circles, growls in violet with + signs, and squeals in
yellow with triangles), with projections of group sample ellipses encompassing 0.68 probability
of data inclusion on the 2D facets: While each measure contributed to the model, mean F0 was
paramount and the growl/squeal distinction was strongest in this space.
Log-linear modeling (65), also determined that variables for child, age, and session are
needed to fit the observed frequencies of the categories. The significant session variable
illustrates temporal clumping (see below), systematic change in likelihood that particular sound
types occur across recording sessions even after other factors have been controlled, arguably
providing further evidence of active vocal exploration and suggesting protophone category
formation by the infant.
Complementary work from our group (66) has used automated methods—neural
network classifiers to both classify and visualize infant vocalizations as squeals, vocants and
growls. When base rate of occurrence of the three vocalization types was normalized, leave-one-
out cross-validation classification of these protophones was 55% correct, which was significantly
higher than the chance rate of 33.3%. Visualizations supported the idea that squeals tend to be
higher pitch and/or produced with greater harshness than vocants. It seems likely that in the
future automated classification will provide a key objective supplement to human coding.
Repetition and its role in recognition of vocal categories. Research on infant vocal
categories has often focused on auditory/acoustic characteristics (62, 67), collapsed to means and
SDs. This approach obliterates sequential information, though there are clear sequential
17
dependencies in infant sounds that provide evidence of systematic production of sounds such as
squeals, vocants and growls even in the first year of life. To demonstrate repetitiveness in usage
of squeals, vocants and growls, 50 infant utterance sequences were extracted from each of 28
recording sessions (1400 utterances in all) of 20-min from 12 infants in the first year. Converting
time-based data to simple events, we used Lag Sequential Analysis (68, 69) to examine lag 1
within-bout sequences to assess category repetitiveness.
The contingency table in Supporting Figure 2 below presents sample data from a
recording session with one 5-month-old: Rows are antecedent events, columns consequent
events, and cells contain two lag 1 statistics: Raw frequencies and adjusted residuals, which are
differences between observed and expected frequencies adjusted for base occurrence rates to an
SD scale (z-scores of tendencies for vocal types to follow one another). Repetitiveness is
indicated by larger positive residual values on the diagonal and smaller or negative residuals off-
diagonal. Accumulating diagonal residuals from the tables in this study, the data demonstrated
category repetitiveness.
Supporting Figure 2: Lag sequential analysis example from one infant, showing notable tendencies for repetition of particular vocal types. See text for explanation.
Temporal clumping is the tendency for relative frequency of any vocal category to differ
from one recording session to another. Data from longitudinal studies (70, 71) in the first year
have revealed many examples of temporal clumping with huge session to session variation in
18
V GR SQ Totals
V 18/1.58 5/0.66 7/–2.11 30
GR 5/0.36 2/0.80 2/–0.95 9
SQ 6/–1.96 1/–1.31 12/2.98 19
Totals 29 8 21
frequency of occurrence of squeal, vocant, growl, as well as variation in frequency of occurrence
of canonical babbling. Even acoustically measured parameters such as final and non-final
syllable durations show strong recording session effects, and indicating temporal clumping (72).
Infant repetitiveness and temporal clumping in vocal play appear to form a basis for
parental recognition of infant systematic production and control over vocal categories such as
squeal, growl and vocant (52). It can be argued that repetitiveness of particular sound types in
“bouts” provides proof to parents that infants vocalize non-randomly and that, consequently,
negotiation is possible over functional roles for infant sounds. The fact that sequential
dependencies of infant categories often occur during apparent vocal play (as when the infant
appears to be vocalizing to her or himself rather than in interaction with a caregiver) provides
evidence that infant exploration is endogenously motivated and contextually flexible, both of
which are implications of the very idea of play.
Recurrence quantification analysis to indicate temporal clumping of vocal categories.
Recent methodological developments have provided additional possibilities for statistically
evaluating repetitiveness and temporal clumping. Recurrence quantification analysis (RQA) (73)
addresses the tendency of events (e. g., vocalizations) to occur and re-occur at intervals.
Supporting Figure 3 provides an example of the advantages of RQA. While Lag Sequential
Analysis is well suited to local temporal analyses at a small number of lags, RQA treats
recurrence patterns across all lags simultaneously, with visualization of both repetitiveness and
temporal clumping at varying time scales. RQA begins with a recurrence plot, as in Supporting
Figure 3 (supplied by our collaborator, Rick Dale, of the Univ. of Calif. Merced), where x and y
axes each represent 500 sec of infant vocalization. Points reflect coordinates (x,y) at which the
infant produced the same vocal category, color coded for category. Points on the diagonal
19
represent the occurrence of vocants, growls and squeals, providing lag 0 reference points, and all
other points represent instances of recurrence at varying lags with respect to those lag 0 points,
displayed symmetrically off the main diagonal. The plot shows that most vocalizations in this
session were vocants, but late in the session, there were separate clumps (indicated by the red
and blue square-like structures at the upper right) of growls and squeals, interleaved with
vocants. This visualization is an example of the strong temporal clustering that we see routinely
in infant protophones in the first year of life, a further indication of systematic infant production
of the categories.
Supporting Figure 3: The figure represents a recurrence plot revealing a pattern of temporal clumping of vocal types in time by a single infant in one recording. The blue block at the upper right shows squeals clumping late in the session.
Parent-infant interaction in vocal category formation and parent recognition of
categories. Interviews with parents reveal awareness of infant vocal categories, and interaction
data confirm that parents attempt to elicit and assign communicative functions (especially
affective states) to them (74), that is they tend to interpret infant sounds as expressions of state or
communicative intent. Parents in many cultures actively engage infants vocally (75–77) in
patterns suggesting the dyad is a dynamical system shaping acoustic content of infant vocal
20
categories and their functions (78). Since vocal category formation and flexible usage are
required for speech, caregiver-infant interaction may play a key formative role in the speech
capacity (79–82), with the three salient categories of vocalization that are investigated here
serving as anchors for much of that interaction.
Much investigation has addressed dyadic interaction during precanonical stages (83–89),
yet the work has usually not taken account of the very categories of infant vocalization that
appear to be the focus of communication, e.g., squeals, growls, and vocants. Much of the work
distinguishes only between distress sounds (which tends to collapse cry or cry-like sounds with
negatively charged protophones) and non-distress sounds (which tends to collapse laughter or
laughter-like sounds with positively charged protophones). This approach also often groups
affectively neutral sounds (which constitute the great bulk of the protophones and the great bulk
of all sounds produced in infancy) along with happy sounds into a single category called
“positive”, and thus provides no basis for recognition of parental focus on the protophone
categories as distinct from laughter. The traditional research has focused on turn-taking (90),
rhythms of interaction deemed predictive of language (88), or disturbance of interaction (86, 91,
92), usually with no attention to the infant vocal categories that are primary anchors of adult
vocal communication with infants.
Additional research, with more direct focus on protophones, has examined parents’
selective responsivity to more speech-like sounds, as well as infant tendencies to produce
speech-like sounds based on parent vocal actions that encourage such sounds (80, 93, 94). From
our own laboratory, a case study was conducted using Cross Recurrence Quantification Analysis,
a variation of RQA, along with rhythmic spectral analysis. The research identified events of
infant-caregiver F0 and amplitude matching during bidirectional dyadic interaction (as opposed
21
to events when either infants or mothers were not responding to the others’ directed
communications), further supporting the notion that F0-related protophones, such as squeals,
growls, and vocants shape and/or are shaped by parent-infant interaction (95).
On the possible role of gesture in the origins of language
Given our focus on vocalization, a comment is in order regarding the widespread opinion
that the origin of human language is based significantly upon the early evolution of gestural as
opposed to vocal capabilities (96–99). This idea receives support in part from the fact that apes
learn sign language from humans much more easily than spoken language (100), from research
suggesting that (even without training) the gestural communication of apes appears to be
complex (101), and from research suggesting that a key factor in very early human development
of language is pointing and the joint attention that it undergirds (99).
We take all these points seriously, in addition to considering the possibility suggested by
a reviewer of this work that even in the realm of affective communication, apes may have more
flexibility in the gestural than the vocal modality. Yet our view emphasizes that both gesture and
vocal communication are involved in ape communication, and more important for our eclectic
view, both gesture and vocal communication emerge in early human development. Especially
notable for us is the apparent (but not very well-documented) tendency for human
communication in the first half year to be predominantly vocal and facial, rather than gestural.
Pointing, which we agree plays a very important role in establishing foundations for symbolic
(and inherently triadic) communication, does not emerge in human infants until the second half
year, and its appearance may be in part dependent upon prior development of extensive dyadic
communication in early vocal and facial interaction (102). Notably also, pointing is not itself
symbolism—it does not, in the absence of additional gesture, represent ideas or concepts, but
22
instead merely draws attention to entities in the here and now. Spoken words in human languages
play a primary role in genuine symbolism (which requires abstract representation), and although
it is not clear how early this predominance of vocal symbolism is established in development, it
is clear that only in cases of deafness or other disorders of communication does signed
symbolism play a predominant role in human communication.
Thus, as has been noted by many skeptics of the gestural origins idea, it is necessary
under any hypothesis of a role for gesture in the early evolution of language-like capabilities to
explain how and why the vocal modality eventually became predominant. We remain convinced
that one issue that should not be ignored in these speculations is the order of events in modern
human development. And here, it appears that the earliest phases of development do not support
a primarily gestural origin.
Still, we favor a view that emphasizes a role for various modalities in communication
throughout human life (103). Gesture and vocalization together are the rule rather than the
exception in human communication (104), and the face also plays a critical role. We think it is
preferable to emphasize multimodality not only in theories of current human communication but
also in speculations about likely evolutionary scenarios. The present paper focuses upon vocal
and facial communication (in part because our research has been keyed on the first months of
life), but we advocate developmental descriptions that incorporate information across additional
modalities. As coding technologies continue to improve, this should become increasingly
possible.
23
SUPPORTING METHODS
Infants and recordings
Parents of infants 2–3 months of age were recruited through word of mouth and child-
birth education classes. A consent form and questionnaire were provided to interested
individuals. Families returning the questionnaire and meeting inclusion criteria were contacted
for an interview. All procedures were approved by The University of Memphis Institutional
Review Board for the Protection of Human Subjects. Nine parents and their infants participated
in our longitudinal study based on this recruitment (Supporting Table 1).
Recordings in our laboratory occurred regularly throughout the first year of the infant’s
life. The laboratory consisted of two rooms: a recording/play room, and a control/equipment
room. The recording room was equipped with furniture and toys as in a child’s play room. Four
digital cameras, remote controlled, were mounted in 4 corners of the recording room. Two
cameras were chosen for recording at any point in time by switches in the control room. The
microphone-in-vest designed by Buder and Stoel-Gammon (105, 106) for prior research provides
a constant mouth-to-microphone distance of about 7 cm for wireless transmission of the child
voice. A similar microphone was worn during recordings by the caregiver at the lapel,
transmitting to the control room on a separate channel. Two channels of video/audio signals were
split and fed to 3 computer systems in the control room allowing AVI and compressed storage in
high fidelity audio (digitization at 48 kHz except for some of the earliest recordings which were
digitized at 44.1 kHz). The audio signals acquired of the infant voice in this circumstance were
of very high quality, presumably primarily due to the small microphone to mouth distance. The
availability of a separate synchronized channel of audio based on the parent microphone made
the differentiation of child and adult voices workable from spectrographic displays even in cases
24
of overlap. The assistant in the control room monitored video and audio, and assisted the parent
as necessary.
3-5 months 6-7 months 10-12 months Total Utts.
Infant Age in mo. Num. Utts. Age in mo. Num. Utts. Age in mo. Num. Utts.
1 3.4 334 6.5 227 10.0 242 8032 3.1, 5.6 304, 330 - - 10.4 180 8143 5.4 563 7.5 299 11.3 272 11344 3.8 284 6.7 99 10.4 147 5305 - - 7.4 199 10.3, 12.9 330, 208 7376 3.3 289 7.3 250 12.8 178 7177 4.7 370 6.4 223 10.8 274 8678 3.4 406 7.4 182 11.6 148 7369 4.2 282 6.7 254 11.8 121 654
MEAN/sess. 4.1 350.9 7.0 216.6 11.2 210.1
Total Utts. 3162 1733 2100 6995
Supporting Table 1: Data from 9 infants were utilized. Each was recorded at an early, a middle, and a late age in the first year. In accord with our age-range criteria, the first and second recordings of participant 2 were assigned to the first age group, and participant 5’s first recording was assigned to the second age group, while both the second and third recordings were assigned to the third age group.
For calibration, a small tone generator affixed to a foot-long rod was placed with its
speaker directly adjacent to the infant’s mouth at each recording. A sound pressure level meter
affixed to the other end allowed calibration at each session for amplitude (62) based on a
protocol for adult recordings developed by Winholtz & Titze (107).
Each dyad’s recording day usually yielded 60 min of recording, from three 20-min
sessions. In one type of session parents were present and were instructed to interact with their
infants vocally and otherwise in a normal fashion, as at home. In another type of session, the 25
parents and an experimenter were both present and conversed regarding a questionnaire during
most of the session, while the infant was playing in the same room, often interacting with the
parent, but the goal was to allow the infant to vocalize independently. The third session type was
intended to consist of infants alone. In practice the outcome was mixed. Sometimes infants were
alone in the recording room part of the time, but most of the time they protested the intended
alone condition, and we ended up having the parent or an experimenter interact with the infant to
pacify him/her.
Selection of data for the present study
This longitudinal effort produced many recordings from which a selection was made
based on developmental level of the infants and availability of “good” recordings, i.e., ones
where infants produced a typical amount of vocalization according to the parents and where no
untoward technical recording events substantially limited the sound or video quality on either
channel. The selection included both interactive and questionnaire sessions with typical amounts
of vocalization at each of the three ages.
For the present analysis, utterances were selected if they were relatively salient, not too
low in intensity for reliable judgment and not too short in duration. This decision was based on
the theoretical assumption that utterances that are not salient presumably have little influence in
the interaction between infant and caregiver, nor do they seem likely to be noticed in parent
judgments of infant state or fitness. A second reason was that tests of coder reliability showed
clearly that agreement across coders was sharply reduced for utterances that had been judged
very short or very low in amplitude. A single judge went through the entire sample and
intuitively gauged utterances that were too short or too low, and these, representing 22% of the
original total, were excluded from the analysis. In addition, in order for the cross-classification of
26
vocal type and facial affect to be possible, the face of the infant had to be seen. Utterances where
the infant’s face was not visible during the utterance on either of the two cameras were coded as
CantSee and were not included in the analysis. After the exclusions, there remained 6995
utterances for analysis.
Cries occurred in the samples in a natural way, but it should be acknowledged that if
infants cried persistently, sessions were sometimes terminated in order to allow feeding, naptime,
or consoling. As a result, the number of cry utterances in our samples may have been more
limited than would occur in more naturalistic sampling, for example, in cases of all-day
recording (108–110).
Coding software
The coding was conducted in software (AACT, Action Analysis, Coding, and Training),
developed by Intelligent Hearing Systems of Miami, FL in collaboration with the Memphis
research team. AACT is interfaced with TF32 acoustic analysis software (111) so both a
spectrographic/waveform display and the video signal from either of the two recording channels
can be simultaneously played (see Supporting Figure 4 for a screenshot of AACT).
The audio and video signals are synchronized to frame accuracy and are displayed in
TF32 with a scrolling cursor that allows the user to see the temporal relation between audio and
video at all times. Audio signals from both the infant and the parent microphone are separately
recorded and synchronized with the video channels. Audio signals can be localized with
accuracy limited only by sampling rate (i.e., with substantially better than ms precision). The
system facilitates locating utterances in audio, then using the determined onset and offset times
for utterances as references in any field of coding by simply clicking on the utterance label
27
presented with the time information in chronological order on the coding screen. Thus a
particular utterance can be played in audio or video or both.
Supporting Figure 4: The figure displays a screen shot from the AACT software (Action Analysis Coding and Training) from Intelligent Hearing Systems (IHS) of Miami, FL, as implemented in our laboratories. The spectrographic display is in TF32 by Paul Milenkovic 39, with adaptations for the AACT environment by Milenkovic and Rafael Delgado of IHS. Video can be displayed (Windows Media Player is invoked for this purpose) for either of two channels of recording, and the audio cursor follows the video with frame accuracy when the recording is played. The cursor can also be dragged on the TF32 screen, and the video will follow frame by frame. When the left and right cursors are both placed, a code can be selected by menu on the coding screen (at the right side of the image), and it will then appear in the coding stream on the right and as a label between the cursors on the TF32 screen on the left. Once a code has been established, its location can be played repeatedly in audio/video in a looping function (Do Loop). The two channels of audio (note the two waveforms at the top of the screen) correspond to a wireless microphone worn by the infant in the vest at chest level as can be seen in the image, and another microphone worn by the parent at the lapel. The coding screen on the right allows two fields (dimensions) of coding to be displayed simultaneously.
The primary fields (or coding dimensions) of interest here (user determinable—40 such
fields are available) are facial affect and vocal type, which for the primary coding in the present
study were conducted in video only and audio only respectively, during separate coding sessions
for each coder. The current research would have been extremely difficult to conduct without
AACT. Our collaboration with IHS (especially Rafael Delgado) and Paul Milenkovic has
produced these innovations—it is the only coding system we know of that allows high quality 28
spectrographic display synchronized to frame accuracy with video from multiple channels.
Additional channels of information (e.g., carrying physiological parameters such as respiration or
heart rate) can also be synchronized and displayed as addition panels in TF32 during AACT
coding.
Utterance location for coding
Utterance location was accomplished in a first step in our coding where cursors in TF32
(Supporting Figure 5) were placed around each voiced “breath-group” in accord with criteria
first defined by Lynch et al. (112) in the Oller laboratory in Miami, and refined more recently in
collaboration with Buder in the Memphis laboratories. We sought thus to focus on a
physiologically-based unit that can be referenced across time and that has a clear relation with
the notion utterance in adult speech. In order for two vocal events to be deemed two separate
utterances, no breath need be actually heard between them, but the perceiver must judge that
there is time enough between them for a breath to occur and that there is no glottal hold or other
consonantal interrupt throughout the time period separating the parts of them. If there is a
perceived glottal hold or consonantal closure, the vocal events are treated as a single utterance in
our coding approach. Thus utterances in this definition can contain multiple prominences or
syllable-like units, and there can be silences within these utterances (corresponding to perceived
consonant-like elements), though the silences are rarely longer than 300 ms. Ingressive segments
between syllable-like energy prominences within protophones or within cry or laugh were, under
this definition of breath-group, treated as part of inter-utterance intervals.
Coding training and coding procedure for both vocal type and facial affect
The primary or “master” coding occurred in stages where the first was often conducted
by a relatively novice laboratory assistant. In such cases at least a second stage was always
29
conducted by a senior coder with years of experience in our laboratories, and in this second
stage, many of the codes of the novices were changed. Thereafter, many utterances (where
discrepancies between codes had been observed) were checked multiply with senior coders, until
a final consensus was reached. Facial affect coding and vocal type coding were conducted in
separate sessions, facial affect with video only, vocal type with audio only.
Supporting Figure 5: The image illustrates our approach to “utterance” selection. Cursors surround one of the five utterances in this segment presented in TF32. The utterances were all perceived to constitute separate “breath groups”, which is to say that for each utterance, a breath was perceived to have been taken (an ingress occurred) after the utterance concluded or at least it was perceived that there was time enough for a breath to have been taken. The time periods associated with each of the utterances can be gauged by the segmentation marks and labels at the bottom of the screen, and precise temporal information is displayed at the top of the screen.
Relatively little training was required to achiever reasonable levels of interjudge
agreement on coding in this infraphonological domain. The reason would appear to be that the 30
vocal and facial affect categories are biologically significant units to which all normal humans
respond in similar ways (as Darwin’s principle of variability predicts). If our theoretical
assumptions are on target, the reactions of various coders should be similar, and the training of
coding on these infant actions should be relatively easy, because these vocal/facial events are
anchors for communication between caregiver and infant regarding infant well-being and state in
the first months of life. The training of coding in these domains requires, we surmise, primarily
activating latent awareness of infant communicative actions and ensuring that the labels used in
the coding software are understood by the coders.
Definitions of vocal types and facial affect types in the study
For Vocal Type coding, no definition was given for cry or laugh, since it was assumed
that these terms would be applied appropriately without training. However, coders were given a
“reflexivity” instruction—cries and laughs were to be coded only if the coder perceived
(intuitively of course) the infant to have produced the sound reflexively. The reliability results
suggest that the master coding (which was finalized by consensus as indicated above) was
conducted with a relatively high threshold for coding vocalizations as reflexive. Reliability
coding on the other hand, appears to have involved lower thresholds (yielding more utterances
judged cries and/or laughs) for some of the listeners, perhaps because reliability coding often
occurred with even less training than in the case of preparations to participate in the master
coding. Also, no consultation occurred among coders or trainers once any reliability coding
session had begun. Four individuals coded vocal type for hundreds of utterances from the
recordings, independently of the master coding, and seven individuals coded facial affect
independently of the master coding. For data on intercoder agreement, see below.
31
For vocal type, coders were instructed to listen with audio only (video screens were not
available at all), to click on each utterance in sequence, and to:
a. Code an utterance as Cry or Laugh if you judge the utterance to be a reflexively produced version of one of these.
b. Code an utterance as FullV (here called “Vocant”) if it is predominantly and most saliently produced in modal phonation, in the mid pitch range of the infant.
c. Code an utterance as Squeal if it is notably higher in pitch than the normal range of the infant.
d. Code an utterance as Growl if one of two conditions is met: either the most salient pitch is notably lower than the normal range, or the pitch is in the normal range but the utterance is produced with very high tension (for example, pressed voice) yielding considerable dysphonation.
e. If an utterance seems to show a combination of features, make your choice based on what you think is most salient. An utterance that is strongly both squeal-like and growl-like, must be judged as one or the other.
It is worthy of note that the vocal categories, being applied always at the utterance level,
should be thought of as prominent individual “features” of utterances. Thus an individual
utterance may be mixed, having multiple features as indicated in (e) above. Of similar
importance is the recognition that even advanced utterances with very speech-like characteristics,
such as multisyllabic canonical babbling, can be characterized as squeal, vocant or growl, on the
basis of vocal quality features.
For facial affect coding, coders were instructed to:
a. Code Pos if you see smiling or grinning any time during the utterance. b. Code Neg if you see frowning or grimacing any time during the utterance.c. Code Neutral if you see neither frowning nor smiling during the utterance. d. Code CantSee only if you cannot see the baby’s facial affect during the utterance.
The facial affect categories were originally planned to be trained to an Ekman standard
with many categories and considerable attention to particular musculo-facial features (113). But
empirical results and our theoretical goals inclined us to simplify. For both facial affect and vocal
type, the coders were encouraged thus to act intuitively rather than to struggle with technicalities.
32
Positivity and negativity were assumed to be determinable on the basis of any video evidence
that the observer took to indicate clear divergence from neutrality. To encourage intuitive
judgment the coders were discouraged from listening or viewing more than three times. For
facial affect, there were two channels of video available, so coders were allowed to first look to
see if the channel selected gave a clear view of the child’s face, and if not to switch to the other
channel. If a better view was obtained, three viewings were then allowed on the second channel.
If neither view produced a workable image of the child’s face, the code CantSee was prescribed.
Positive, neutral and negative affect as a proxy for function
To simplify the current study on functional flexibility, we instructed coders to categorize
infant vocalizations for affect in a three forced-choice task where the options were positive,
neutral and negative. The following is a justification for this simplified coding of affect,
supplementing the discussion above under Affect and context in the judgment of function in
infant vocalizations.
Note that we make no attempt here to draw an operational distinction between affect and
emotion, although we recognize affect as an expression that presumably reveals emotional states.
The relation between them however is a matter of debate in the literature on emotion—see e.g.,
Damasio’s discussion of this issue (144)—and the complexity of the relation between them and
additional concepts such as feelings or consciousness surely cannot be resolved here. Our results
indicate that there is a consistent, if not perfect, relation between positive, neutral and negative
affect and both interpreted illocutionary forces and perlocutionary (see Main text Figure 3 and
Supporting Results: The role of affect expression in the functional interpretation of infant
protophones). When we imply or say that positive, neutral or negative emotional states
correspond to the three affect conditions, we are merely extrapolating from the assumption that
33
the entire chain of states and acts—emotions, which are related to affect expressions, which are
related to illocutionary acts, which are related to perlocutionary effects—must be consistently
maintained in order for the natural selection of affect expressions to occur.
Nonhuman primate calls often include positive valence—e.g. , celebratory vocalizations
akin to human laughter (114–116)—or negative valence—as in threats, warnings and distress
calls (4, 33, 117). While it might be presumed that positive and negative valences for such
vocalizations are stable in nonhuman primates for such calls, in fact there has been little
investigation of facial affect during such vocalizations (118).
While it might be reasonable to seek to characterize infant sounds (cry and laugh as wells
as protophones) in terms of a large set of illocutionary categories (including for example
expressions of exultation, sadness, disgust, relief, comfort, anger, distress, etc.), there are good
reasons to limit our categorization for the purposes of this paper to just positive, neutral and
negative. First, in spite of considerable speculation and evidence about possible innateness of a
large number of emotional expressions (113), prominent emotion theorists are unpersuaded that a
large set of emotional expression categories is well organized in the young infant. Instead they
tend to support the view that emotion expressions begin in relatively undifferentiated form (12,
14), with a very small number of clear affect types. These emotional expression types are
presumed to become elaborated and more neatly tied to circumstances as the infant matures.
Also, consistent with our reasoning in the section on Affect and context in the judgment of infant
vocalizations, it is sensible to assume that mature human observers can make reliable judgments
about functions of infant vocalizations, since evolution surely provided a system of vocal
communication that is adaptive.
34
Consequently we resolved to adopt a simplified categorization of affect types,
implemented by mature human observers, for the present work. Positive, negative and neutral
facial affect can be categorized with relatively good agreement across observers even in early
infancy on the basis of intuitive judgments with very little training. These judgments can be
made under instruction in ways that presumably mimic the kinds of judgments caregivers make
about infant expressions (is the infant comfortable, in distress, joyful?). It has been reasoned that
such judgments are made constantly by parents, even if unconsciously, and that they form a basis
for fitness judgments that have played a major role in human evolution of vocal communication
(119). While our simplified affect coding system does not itself specify “functions” of
vocalizations illocutionarily nor does it indicate specifically which of several possible emotional
types identifiable in adults may be involved in a particular vocalization (e.g., a negative
communication might be deemed an expression of anger, sadness, disgust or distress), it does
limit the field of possible emotional expressions to classes of functions related to positivity,
negativity or neutrality. Furthermore it allows direct comparison of flexibility of communicative
vehicles across protophones and stereotyped species-specific signals in infants.
Illocutionary force coding
In response to an anonymous critique, we resolved to provide direct evidence that facial
affect accompanying protophones predicts illocutionary force. Before this article was originally
submitted for publication, an illocutionary coding scheme had already been developed and a
subset of the data had already been coded. Data from this prior work are reported in Figures 3a–
3b of the Main Text.
35
Coding for illocutionary force was conducted with simultaneous audio and video, and
without access to the facial affect or vocal type codes. The codes available in the illocutionary
field can be characterized in three groupings A, B and C, as follows:
A) Converse: codes corresponding to exultation or to vocalizations that had the apparent goal of initiating or continuing a comfortable protoconversation: i. Continue: Continuation of protoconversation or of vocalization interchange in a
game (such as peekaboo).ii. Elicit Turn: Initiation of protoconversation with the caregiver or elicitation of a
turn from the caregiver.iii. Exultation.iv. Imitation of the caregiver voice.v. Show, Offer, Accept: Vocalizations during offers, acceptances or showing of
objects in play.B) Complain: Codes corresponding to social negativity:
i. Complaint. ii. Plea for help. iii. Refusal, especially of objects in play.
C) Indeterminate: Codes corresponding to neither A) nor B): i. Object-directed.ii. Vocal play. iii. No force: no discernible illocutionary force.
Perlocutionary force coding
In response to the same anonymous critique, we resolved to provide direct evidence that
facial affect accompanying protophones predicts perlocutionary effects as indicated by caregiver
responses. For this effort, we engaged in two rounds of new coding. In the first round, nine
recording sessions (representing all ages and all infants) were reviewed, with both audio and
video signals available for coding, but perlocutionary effects were coded only if the infant
utterances had been deemed to have either negative or positive affect during facial affect coding
(neutral utterances were not considered. The data from this first round are reported in
Supporting Results: The role of affect expression in the functional interpretation of infant
protophones, Facial affect and perlocutionary effects of infant protophones.
36
Reasoning thereafter that utterances with neutral facial affect might also be important in
interpretation, we conducted a second round of perlocutionary coding, in which six sessions (six
infants, all ages) were coded, but this time we considered utterances with all three types of facial
affect. Facial affect, vocal type and illocutionary codes were not available during the second
round of perlocutionary coding. Simultaneous audio and video were available to the coders to
make their judgments.
The recording sessions for both the first and second rounds of perlocutionary coding were
selected in a semi-random fashion where an infant could be selected only once within a round
and all three ages were represented equally. In total, data from 14 different sessions were thus
coded (one session appeared by chance in both rounds), resulting in data on perlocutionary
effects from more than a quarter of the total sample.
The coding of perlocutionary effect involved observation of complex events, and the
primary coding in the two rounds is subject to the concern that the observers may have been
biased to categorize parent reactions by virtue of also hearing and seeing the infant actions. As a
check against this possible bias, we engaged in an additional coding evaluation taking advantage
of the fact that parents often spoke during the interactions about their reactions to the infants’
vocalizations. Three of the sessions for the second round of perlocutionary coding reported in
Figures 3c–3d of the Main text had been coded by one observer and three by another. To
prepare for the coding check on possible bias, the first observer extracted (from the three
sessions he coded) parent utterances in audio alone from the up-to-four-second period of
perlocutionary observation, taking the precaution of eliminating (not extracting) any such
utterance where the child’s voice could be heard. All the parent utterances (N = 157) meeting
this criterion for the three sessions were extracted. The second “blind” coder then, having not
37
coded these sessions, was presented with the parent utterances in audio only and was asked to
judge perlocutionary force based on these utterances, a circumstance where the infant utterance
(and the visual setting) could not have played a role in the judgments.
The lack of visual information could have placed a notable limitation on the ability of the
blind coder to judge perlocution. Even so, as can be seen below in Supporting Results: The
role of affect expression in the functional interpretation of infant protophones, Facial affect
and perlocutionary effects of infant protophones, the results show that both the blind coder and
the original coder categorized parent reactions in strong and highly reliable accord with the
threefold groupings of perlocutionary force as predicted by infant facial affect associated with
infant protophones.
In all perlocutionary coding we focused on caregiver reaction during the short period of
time (up to 4 sec) following an infant utterance, with the perlocutionary judgment focused on
events ending before the onset of the next infant utterance. We limited the perlocutionary coding
in time based in part on the temporal relation among the infant utterances and in part based on
the need for there to be some minimum amount of time for coders to evaluate parental reactions,
which often included sentences expressing the parents’ opinions about infant state. Thus if infant
utterances were produced in a rapid series, no perlocutionary judgment was allowed until the end
of the sequence, the judgment was focused on events beginning after that last utterance in that
sequence had been initiated, and the judgment was treated as being associated with the facial
affect of that last utterance in the series (although it seemed often clear that the perlocution was
influenced by multiple utterances). We set a criterion where a rapid series would be deemed
ended at any gap > 450 ms without infant vocalization; thus the shortest possible time frame for
38
perlocutionary judgment was 450 ms plus the duration of the infant utterance, and the longest
time frame was 4 sec plus the duration of the infant utterance.
The codes available in the perlocutionary field to code the parental reactions can be
characterized in three groupings A, B and C, as follows:
A) Encourage: Codes corresponding to initiation or continuation of a comfortable protoconversation: i. Elicit turn: Initiation of protoconversation with the infant or elicitation of a turn.ii. Continue: Continuation of a protoconversation or game (such as peekaboo).iii. Imitation of an infant vocalization.iv. Praise.v. Smiling at the infant.vi. Patient waiting for the end of an infant of sounds with silences exceeding the 450
ms criterion, after which a parent response indicated she had been waiting to praise, exult, or otherwise encourage continuation of the interaction.
vii. Exultation by the caregiver over the infant utterance.vi. Offer, accept or show objects in play.
B) Change: Codes corresponding to evaluations of possible change in the situation for the infant, actions involving change, or attempts to change the infant state through vocalization: i. Evaluation of the infant state clearly indicating that the caregiver is considering
taking action to make the infant more comfortable (statements such as “I think she’s wet”, “Are you hurting, honey?”, “Do we need to change the situation now?”, etc.) or expressions of alarm (e.g., “Oh no!”).
ii. Change Situation: Physical actions to change the situation (picking the infant up and patting her back, moving the infant to a new location for play, taking the infant to the changing table, etc.).
iii. Soothe, Scold, or Negative Command: Vocal soothing (e.g., “oh you poor thing”), scolding (e.g., “that’s not nice”) or negative commands about the infant’s vocalization (e.g., “stop that”).
iv. Distract: Attempts to distract the infant (e.g., with a new toy).v. Frown at the infant.
C) Unclear: Codes corresponding to neither A) nor B): i. Other directed: Utterances not directed to the infant and not related to the infant
state (e.g., talking to some one else in the room about unrelated matters, e.g., “I need to make a phone call”).
ii. Unobservable (the parent’s reaction cannot be discerned because the parent cannot be seen in the video image and/or cannot be heard saying anything).
iii. Irrelevant: State irrelevant (the parent may say something to the infant, but it is not part of the protoconversation and reveals nothing about infant emotional state—e.g., “let’s not put our fingers in our mouths”).
39
Observer agreement levels for both vocal type and facial affect
Of course infant actions become more complicated as the infant matures (both in the
protophones and in the stereotyped species-specific signals), but regarding observer agreement
for vocal types present in the first months, the data show that adequate training can be conducted
in a few sessions and that observer agreement can thus reach levels we consider reasonable—
agreement is very high for reflexively produced cry vs. laughter (stereotyped species-specific
signals), and moderate for differentiation among the three protophones considered here (see
Main text). The protophones, of course, have been shown to include considerable overlap
among their acoustic features (see Supporting Figure 1 above). Differentiation of cry from
laughter was presumably high because nature appears to require these sounds to be maximally
separable, as seems to occur with homologous species-specific signal categories of non-human
primates and other animals (1, 2, 120). Stereotyped species-specific signals appear to have this
distinctiveness precisely because there is high survival priority upon allowing little room for
mistake about their illocutionary functions and how to respond to them (121).
Observer agreement on differentiations within the stereotyped species-specific signals
and within the protophones should not be taken to predict the level of differentiation of
stereotyped species-specific signals from protophones, as these are actually separate matters.
Consider how the cry category was defined. We instructed coders to designate utterances as cry
if and only if they seemed reflexive. The coders were thus expected not to code clearly volitional
utterances as cry, and would instead be expected to call them “vocant”, “squeal”, or “growl”. A
category “fuss” was not included, and in general fussy utterances would be coded as one of the
protophones, most often “vocant”. For cries identified in the master coding, the four reliability
coders working independently coded 88% of these as cries, but at the same time compared to the
40
master coding the reliability coders indicated more than twice as many utterances as cries (most
often designating utterances as cries that were called vocants in the master coding). This pattern
suggests a judgment criterion difference—the master coding appears to have more consistently
reflected a strict (high threshold) application of the reflexivity instruction, excluding many fussy
utterances from the cry category and including them instead in a protophone category, while the
reliability coders appear to have more liberally included fussy utterances in the cry category. For
laughter, one of the reliability coders appears to have set a very high criterion for coding of
laughter, such that only 5 utterances were so designated, while the primary coding had 54 laughs
for that reliability set. The other three reliability observers coded 61% of the master coded laughs
as laugh but coded just about twice as many items as laugh as had occurred in the master coding,
suggesting again a criterion difference, where the master coding had observed the reflexivity
criterion more rigidly. It seems clear that laugh was harder to differentiate from protophones than
cry was.
This pattern of reliability (observer agreement) results suggests that there was
considerable mixture of cry and laugh features with the protophones—infants appear not only to
acquire the ability to produce vocalizations free of emotional presetting (the protophones), but
they also acquire the ability to utilize cry and laughter characteristics relatively freely in
combination with protophones. Later in life, of course, humans have enormous vocal freedom
and can produce cry and laughter-like sounds on command, with some actors being able to do so
in a way that is essentially indistinguishable from the reflexive forms of the stereotyped species-
specific signals. We interpret this capability as a reflection of the remarkable human vocal
freedom that makes language possible. At the same time, even in adulthood, there are
circumstances of high emotional arousal or instantaneous urgent events where stereotyped
41
species-specific signals (e.g., cry, shrieking, laughter, moaning) appear to be elicited reflexively
or near reflexively, presumably reflecting our primate heritage (116) (e.g., such as when
laughing uncontrollably while being tickled). Our results presented in this paper suggest that
human infants begin to break away from the rigid mold of vocal expression in the first months of
life, both by developing protophones that have very high flexibility from the onset and by
acquiring across time the capacity to voluntarily manipulate features of the stereotyped species-
specific signals and begin to mix them with protophones.
42
SUPPORTING RESULTS
Audio-Video examples (Movies) illustrating the protophones and their variability in facial
affect expression
Examples of infant vocalizations illustrating the protophones of interest in the present
work and their flexibility to express various facial affect conditions are supplied with the
Supporting materials. These examples have been selected to illustrate that is it possible for
squeals, vocants and growls to be used with flexible facial affect expression in human infants.
We have selected the examples in “clusters” representing a particular protophone from a single
infant within a single recording. The first cluster of movies (CL1) consists of four examples of
growls drawn from a recording of a 6-month old girl who produced these four growls within a
three-minute period, displaying a full range of facial affect (positive, neutral and negative).
To observe this pattern (and for all the audio-video examples) we advise the reader to
listen first to all examples within each cluster without watching, attempting to determine affect
by audio. Then we recommend watching the video without audio for all four examples, again
judging affect. As a last step we recommend watching and listening simultaneously. To
maximize independence as a listener we also advise waiting until you have listened to and
judged all the examples in all the clusters before looking at the list where we have recorded the
facial affect judgments made by staff in our laboratory (the master coding of facial affect) at the
end of the Supporting Results, page 69.
Multiple observers in our laboratory agreed that all the examples in CL1 (Movie S1
through Movie S4) were growls, and they all agreed based on video judgments alone that one of
these growls had clear facial affect positivity while one had clear facial affect negativity. At the
same time, audio-based judgments of affect were variable across the observers. Audio judgments
43
generally show better than chance agreement across observers for facial affect on protophones,
but video judgments show much higher agreement, illustrating that infants are free to adapt the
protophones to various expressions of affect independent of particular sound characteristics of
the individual protophone utterance.
CL2 consists of three squeals (Movie S5 through Movie S7) with varying facial affect
from the same infant on a different day, also at 6 months. CL3 gives two examples of vocants
from the same infant at the same age, showing alternately positive and negative facial affect
(Movie S8 and Movie S9).
The remaining examples come from a different infant. First, at three months we present
CL4 with five examples of vocants with variable facial affect (Movie S10 through Movie S14).
Next is CL5 with two examples of squeals (Movie S15 and Movie S16) also at three months,
and last are three examples of squeals in CL6 from the same infant at 10 months (Movie S17
through Movie S19).
In all cases the examples within these clusters were drawn from a single infant within a
single recording, a pattern of selection intended to highlight the fact that facial affect can be very
flexibly associated with any of the protophones in normally developing infants. At the same time
it is important to emphasize that audio characteristics within protophones often do transmit affect
information in accord with the affect information transmitted facially. The key point of the
present paper is not that protophones as judged from audio are always uninformative regarding
affect (that is not true), but that they are often uninformative regarding affect, and sometimes
even yield contradictory judgments with respect to video-based facial affect judgments in forced
choice circumstances. Thus research in our laboratory has shown that a facially positive
utterance (based on video judgment) can often be judged negative based on audio or vice versa,
44
and it is often the case that judges express strong doubts with regard to any audio judgment of
affect for an utterance while being quite certain of their affect judgments for the same utterance
based on video. Judgments of affect based on video alone are much more likely to conform to
video plus audio judgments than are judgments based on audio alone, suggesting that for affect
judgment the facial configuration plays the predominant role.
Odds Ratio Analyses to illustrate the distinction between protophones and stereotyped
species-specific signals
Statistical reliability of the six patterns—protophones showing 1) far more positivity than
cry but 2) less than laugh, 3) far more neutrality than either cry or 4) laugh, 5) far less negativity
than cry and 6) more than laugh—can be shown with Odds Ratio Analyses. The procedure
requires three 22 tables for 18 comparisons (Supporting Table 2).
Neg Not (Pos + Neut)
Total Prop. Neg
Cry 278 12 290 0.96
Squeal 221 735 956 0.23
Total 499 747 1246
Odds Ratio = 77.0
Lower 99% CI = 35.1
Upper 99% CI = 169.1
Supporting Table 2: Example 22 table for Odds Ratio Analysis. Here squeals and cries are compared for negativity. Total numbers across the sample of cries and squeals judged negative by facial affect were entered in the Neg column and the sum of cries and squeals judged either positive or neutral were entered in the Not column. The Odds Ratio (OR) of 77 is the ratio of the two odds—(278÷12)/( 221÷735). An OR of 1 would indicate no tendency for either cries or squeals to be more negative. The data indicate that cries were 77 times more likely to be negative than squeals. The 99% confidence intervals around the OR do not include 1, indicating that with at least p < 0.01, this OR is statistically reliable.
Tables modeled after Supporting Table 2 were created to compute ORs for the 6
patterns of distinction between affect expression in protophones and in cry and laugh. Thus 18
45
tests (6 patterns x 3 protophones) of ORs were conducted (Supporting Table 3).
Prot > posthan cry
98.0915.62
35.325.65
48.767.75
615.85 220.88 306.63
Prot < posthan laugh
37.516.4
58.926.2
42.718.8
85.6 132.4 97.2
Prot > neutthan cry
27.011.6
57.625.0
44.018.9
62.9 132.7 102.5
Prot > neutthan laugh
15.46.5
32.914.1
25.210.7
36.3 76.6 59.2
Prot < negthan cry
77.035.1
176.081.5
147.566.8
169.1 380.2 325.5
Prot > negthan laugh
50.813.79
22.251.7
26.551.98
680.87 296.67 356.28
Squeal Vocant Growl
Prediction odds ratio 99% CI odds ratio 99% CI odds ratio 99% CI
Supporting Table 3: Odds Ratio Table for the whole dataset. The OR of 77 is seen near the lower left of the Table representing the test of squeals versus cries on negativity (the example in Supporting Table 2). Notice that for all 18 comparisons (six comparisons represented in the rows of the table for each of the three protophones), the OR was very large (smallest = 15.4, meaning protophones were 15.4 times more likely to be neutral than laughs) and that the 99% Confidence Intervals (CIs) never included 1, indicating that all 18 ORs had p < 0.01. The results thus indicate that protophones were vastly more flexible in affect expression than the stereotyped species-specific signals.
Consistency across infants for the six patterns showing functional flexibility in the
protophones but not in the stereotyped species-specific signals
The six patterns considered in the Main text and confirmed in Figure 2, applied to all the
infants (Supporting Figure 6). These individual infant patterns were also subjected to OR
46
analysis. One infant produced no cries, so the 18 comparisons as in Supporting Table 3 were
developed for 8 infants, and 9 were developed for the infant with no cries in her samples.
Supporting Figure 6: The consistency of the results obtained in the present work is supported by the fact that all the infants showed the six patterns. Compare the results to Figure 2 of the Main text. In almost every case, protophones showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by odds ratio analysis with 152 of 153 OR’s >1. Infant 6 produced no cries in her sessions, and also produced very little negativity in vocal expression, including no negative growls; in this one case laughs were found slightly more negative than growls.
This yielded a total of 153 OR comparisons. In 152 of those cases the OR was > 1, indicating
very strong support for the idea that infants showed much greater flexibility of affect expression
with protophones than with the stereotyped species-specific signals.
47
Functional flexibility of vocalization even at the youngest ages of the infants in the study
Figure 2 in the Main text illustrates that at all three ages the infants showed the
flexibility pattern strongly. The patterns applied also in infants at the youngest ages as indicated
in Supporting Figure 7 below, where data only from infants at 3 and 4 months are displayed.
.00
.10
.20
.30
.40
.50
.60
.70
.80
.90
1.00
Positivity Neutrality Negativity
Prop
ortio
n w
ithde
signa
ted
affec
t
Infants at 3 and 4 months only
CryLaughSquealVocantGrowl
Supporting Figure 7: Even at the youngest ages, the infants showed the six patterns of differentiation of protophones from the stereotyped species-specific signals, cry and laugh, in flexibility of expression. Compare the results to Figure 2 of the Main text. Protophones in the infants at the youngest ages showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by Odds Ratio Analysis. As seen in Figure 2 of the Main text, infants at all the ages showed the same strong pattern.
Observer agreement on six patterns of functional flexibility in protophones
Robust coder agreement was found when the Master coding used for all the analyses
above and in the Main text was compared with an independently coded subsample. New coders
after minimal training, coded 9 randomly selected sessions (1 for each infant, 3 for each age
48
group), for both vocal type and facial affect according to the same protocol (see Supporting
Methods) used with the Master coding, except that there was no review or changing of codes by
expert coder-supervisors as had occurred in the last phase of the Master coding. In Supporting
Figure 8 all 3 panels represent coding of the same subset of the data (21% of the total).
Supporting Figure 8: The reliability of the results obtained in the present work is notably supported by the fact that three completely independent codings of 21% of the dataset for both facial affect and vocal type all showed the same key results differentiating the stereotyped species-specific signals and the protophones. Compare the results to Figure 2 of the Main text. The Master coding Supporting Figure 8 represents judgments for the same 21% of the dataset that was judged by the two reliability coders. In each case, protophones showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by Odds Ratio Analysis for all three codings. The average OR for the comparison (18 comparisons for 3 coders each) exceeded 7 for all three coders, and the lowest OR in any of the 54 comparisons was 1.47 indicating all comparisons for all coders conformed to the same pattern
49
as in Figure 2 of the Main Text, where the ORs were also all > 1. On at least half the 18 ORs in Supporting Figure 8 for each coder the value of 1 was not included within the 95% CI, indicating strong statistical reliability of the findings across coders.
Additional Contingency Table Analyses illustrating individual differences on protophone
expression of affect, but not on stereotyped species-specific signal expression of affect
The results of Contingency Table Analysis revealing individual patterns for the
protophones are displayed in Figure 5 in the Main text. To amplify the findings reported there,
we provide data in Supporting Table 4, in which the notable individual variability among
infants on protophones for facial affect positivity is contrasted with the obvious similarity among
infants with regard to facial affect positivity for cries and laughs. We began by constructing two
contingency tables of raw data for utterances from each infant: A 22 table for each infant
represented cry/laugh by positive/negative, and a 32 table represented the protophones,
squeal/vocant/growl by positive/negative. Utterances judged neutral in facial affect were not
considered in these comparisons (in contrast with Figure 5 of the Main text), given that both cry
and laugh showed too few cases judged neutral to make quantitative comparison interesting.
Adjusted residuals were computed for each table. One of the nine infants produced one laugh
(positive facial affect) but no cries, so her data were not considered in this analysis. The goal was
to determine the distribution of infants on adjusted residual values for each vocal type in order to
compare individual differences across stereotyped species-specific signals as opposed to
protophones in facial affect expression. In this analysis positivity and negativity provide mirror
images of the same pattern—we display results for positivity only, but results for negativity can
be imagined by simply changing the top title to “Negative Facial Affect”, reversing the order of
the column labels (changing to this order: > 1.96, 1.96 to 0, 0 to – 1.96, < – 1.96) and leaving the
cell entries as they are.
50
Vocal typePositive Facial Affect
< – 1.96 – 1.96 to 0 0 to 1.96 >1.96
Laugh 0 0 0 8Cry 8 0 0 0
Squeal 5 1 2 1Vocant 1 3 2 3Growl 0 3 5 1
Supporting Table 4: All 8 infants who both cried and laughed in the samples showed very high adjusted residuals on facial affect positivity for contingency tables on stereotyped species-specific signals (cry/laugh) vs. positive/negative facial affect— that is, all infants showed more than 1.96 SD above the expected value for positivity of laughs and all showed more than 1.96 below the expected value for positivity of cries. In contrast, there was substantial variation among the infants for expression of positivity in protophones—vocants, for example, were expressed in 3 infants with positive facial affect at more than 1.96 SD above the expected value, but in one infant at more than 1.96 SD below the expected value; the other 5 infants fell in between on vocants, 2 showing weak tendencies above the expected value for positivity and 3 showing weak values below. In this analysis positivity and negativity provide mirror images of the same pattern—we display results for positivity only, but results for negativity can be imagined by simply changing the top title to “Negative Facial Affect”, reversing the order of the column labels (changing to this order: > 1.96, 1.96 to 0, 0 to – 1.96, < – 1.96) and leaving the cell entries as they are.
All these data show that the sharp contrast between flexibility of stereotyped species-
specific signals and protophones illustrated in Figure 2 of the Main text and in Supporting
Table 3 as well as Supporting Figures 6 and 7 also applied to individual differences—there
were scarce individual differences for facial affect expression with stereotyped species-specific
signals, cry and laugh, but notable individual differences for protophones.
Log-linear analyses: Individual differences in the expression of affect in infant vocalization
A typical log-linear analysis begins by defining a series of hierarchical models. The goal
of log-linear analysis is to identify the simplest model that still provides an acceptable fit to the
51
data. Each model in the series is less complex than the one before it: It has fewer terms—i.e., is
less constrained and so has more degrees of freedom— and consequently generates data that fit
the observed less well.
Goodness-of-fit—really, badness-of-fit—is assessed with the likelihood ratio chi-square
or G2: The bigger G2 is the worse the model fits—values for G2 are essentially the same as for
the more familiar 2, but for technical reasons G2 is preferred for log-linear analysis; see
Bakeman & Robinson (65). The first model in the series—the saturated model— constrains
expected frequencies to match the observed ones exactly and, for that reason, fits the data
perfectly: its G2 = 0 with 0 degrees of freedom. The question then becomes whether a more
parsimonious model will still fit acceptably.
Acceptable fit can be assessed in two ways. A common criterion is the significance of G2
for the model: If G2 is not significant, p > .05, the discrepancies between the observed cell
counts and those generated by the model are relatively small, and so we conclude that the model
fits acceptably. However, given large counts, this criterion may be too strict because even
relatively small deviations from expected will result in a G2 significantly different from zero.
A second criterion is the magnitude of Q2, which is a comparative fit or reduction in error index
analogous to the R2 of multiple regression. Knoke and Burke (145) suggested that any model
whose Q2 is greater than .90 provides satisfactory fit, even if its G2 differs significantly from
zero. Q2 assesses the proportion of badness-of-fit relative to a baseline model accounted for by
the model in question and is defined as:
Q2=Gbase
2 −Gmod el2
Gbase2
.
52
When the terms in a model account for over 90% of the baseline badness-of-fit, we conclude that
the model fit is acceptable and that the terms deleted to form the model are not consequential.
Cries. There was little individual variability in the tendency for cries to be coded as
negative in facial affect. Of the 27 samples, 19 contained at least one cry. For 13 of the 19 all
cries were coded negative; for three of the 19 all but 1 cry was coded negative; and for the
remaining three samples, 3 of 9, 1 of 2, and 0 of 2 cries were coded negative.
A 22 table—cry versus protophone by negative versus non-negative affect—was
formed for each of the 19 samples. A log-linear analysis of the 2219 table suggested that the
association between cry [C] and negative affect [N] was not moderated by sample [S]—that a
common pattern characterized all samples.
To test whether only the saturated model fit the data, we deleted the three-way term
[NCS], which left a model with three two-way terms: [NC] [NS] [CS]. This model constrains
expected frequencies to match the cross-classifications implied by the three two-way terms, but
not the three-way classification implied by the saturated term. The chi-square for this model
differed significantly from zero, G2 (18, N=5117) = 55.6, which is not surprising given the large
N, but its Q2—an R2 analog—was .96, suggesting that deleting the saturated term had little effect.
Conservatively, the base model used for the Q2 computation was [N] [CS]: other less-restricted
but still plausible base models—e.g., [N] [C] [S]—would have produced even higher values for
Q2.
An alternative would be to analyze the 228, negative by cry by infant table, pooling
samples over the eight infants who cried. The [NC] [NS] [CS] would fit this table somewhat
better, but it is more convincing that the [NC] [NS] [CS] not-moderated-by-sample model—not
moderated because the [NCS] term was not required for an acceptably fitting model—still fits
53
the larger 2219 table. This result, that the three-way interaction term is not needed for good
fit, indicates that the association between negativity and cry does not vary by sample.
Laughs. There was even less individual variability in the tendency for laughs to be
coded positive. Of the 27 samples, 15 contained at least one laugh. All laughs were coded
positive for nine of the 15 and for the remaining six, more laughs were coded positive than not.
A 22 table—laugh versus protophone by positive versus non-positive affect—was formed for
each of the 15 samples. A Log-linear analysis of the 2215 table suggested that the association
between laugh [L] and positive affect [P] was not moderated by sample [S]—that a common
pattern characterized all samples.
To test whether only the saturated model fit, we deleted the three-way term [PLS], which
left a model with three two-way terms: [PL] [PS] [LS]. This model constrains expected
frequencies to match the cross-classifications implied by the three two-way terms, but not the
three-way classification implied by the saturated term. The chi-square for this model did not
differ significantly from zero, G2(14, N=3645) = 17.6, p = .23, and its Q2 was .97, suggesting that
deleting the saturated term had little effect. Conservatively, the base model used for the Q2
computation was [P] [LS]: other less-restricted but still plausible base models—e.g., [P] [L] [S]
—would have produced even higher values for Q2.
An alternative would be to analyze the 229, positive by laugh by infant table, pooling
samples over infant. The [PL] [PS] [LS] model would fit this table somewhat better, but it is
more convincing that the [PL] [PS] [LS] not-moderated-by-sample model—not moderated
because the [PLS] term was not required for an acceptably fitting model—still fits the larger
2215 table. Again, the result that the three-way interaction term is not needed for good fit
indicates that the association between positivity and laugh does not vary by sample.
54
Protophones. There was considerable individual variability in the facial affect coded for
the protophones. For this analysis, a 33 table—protophone (squeal, vocant, growl) by facial
affect (positive, neutral, negative)—was formed for each of the nine infants. Then, for each
infant, expected frequencies were computed for each of the nine protophone-affect pairs. Only
two of the nine associations showed commonality across the nine tables formed for each infant:
The expected frequency for neutral given squeal was less than expected for all nine and the
expected frequency for neutral given vocant was greater than expected for all nine. These results
are displayed in Figure 5 of the Main text.
The pattern of individual variability among infants for expression of affect by
protophones was supported by Log-linear analysis. For the protophone [V] by facial affect [A]
by infant [B] table only the saturated model fit, indicating that the association between
protophone and facial affect was moderated by infant. The model with the saturated term deleted
did not fit acceptably—its G2(32, N=6535) = 135.3, p < .001 and Q2 = .83—indicating that only
the saturated model provided acceptable fit. Unlike with cry or laugh, the three-way interaction
term included in the saturated model was needed for acceptable fit to the data.
An alternative would be to analyze the 3327, positive by laugh by sample table. But
the model with the saturated term removed would fit this table even worse.
55
Log-linear Results (computed by ILOG (65))
Supporting Table 5. Negative by Cry by Sample.
[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[NCS] 0.0 0 1.000 — 1.00 —[NC][NS][CS] 55.6 18 <.001 NCS 55.6 18 <.001 .96 .04[NS][CS] 624.8 19 <.001 NC 569.2 1 <.001 .51 .45[CS][N] 1267.8 37 <.001 NS 643.1 18 <.001 .00 .51
Supporting Table 6. Positive by Laugh by Sample
[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[PLS] 0.0 0 1.000 — 1.00 —[PL][PS][LS] 17.6 14 .227 PLS 17.7 14 .227 .97 .03[PS][LS] 280.6 15 <.001 PL 263.7 1 <.001 .51 .46[LS][P] 574.4 29 <.001 PS 293.8 14 <.001 .00 .51
Supporting Table 7. Affect by Vocal Type (Protophone) by Baby (Infant)
[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[AVB] 0.0 0 1.000 — 1.00 —[AV][AB][VB] 135.3 32 <.001 AVB 135.3 32 <.001 .83 .17[AB][VB] 275.0 36 <.001 AV 139.7 4 <.001 .66 .17[VB][A] 811.0 52 <.001 AB 536.0 16 <.001 .00 .66
The role of affect expression in the functional interpretation of infant protophones
Facial affect and illocutionary force of infant utterances. The results reported in the
Main text, Figures 3a–3b, provide evidence that facial affect associated with protophones
strongly predicted the illocutionary forces attributed to infant utterances. To understand how
these data were derived and to review the coding categories in detail, see Supporting Methods:
Illocutionary force coding. Over 80% of the utterances corresponding to the illocutionary
56
grouping Converse in Figures 3a–3b had been coded illocutionarily as “Continue: Continuation
of protoconversation or vocalization in a game (such as peekaboo)”; 99% of the utterances
corresponding to the Complain grouping in Figures 3a–3b had been coded as Complain or Plea
for help; and over 75% of the utterances corresponding to the Indeterminate grouping had been
coded as Object Directed.
Facial affect and perlocutionary effect of infant utterances. We had not coded
perlocutionary force prior to the review of our paper. In response to a reviewer criticism, we
initiated a series of efforts to evaluate the expectation of systematic caregiving responses to
infant communications of varying affect, consistent with the extensive literature in parent-infant
interaction cited in the Main text. Specifically, it was anticipated that caregivers would respond
to infant affect expression in ways that would indicate systematic attempts to maintain or elicit
comfortable, happy protoconversation, while attempting to change the situation in cases where
the infant’s affect was negative. Codes used in perlocutionary coding are found in Supporting
Methods: Perlocutionary force coding.
We began by reviewing one session of recording from each of the nine infants
representing all three ages, and over 400 caregiver responses. In this effort, perlocutionary
effects were observed only for protophones that had been coded as positive or negative for facial
affect (neutral utterances were not included). As can be seen in Supporting Figure 9, the
caregivers responded in very different ways to protophones produced with positive vs. negative
facial affect. Positive facial affect in protophones yielded parental encouragement to continue the
interaction in the positive vein. Typically the parents praised the infant or attempted to elicit
another turn in conversation by the infant. In contrast, negative affect in vocalization produced
spoken evaluations of the infant’s possible discomfort, soothing, scolding, attempts to distract
57
the infant, and changes effected by the parent in the infant situation (such as picking the infant up
or initiating changing of the diaper). The intonations of the parents in these responses often
provided strong cues to support perlocutionary coding, e.g., in differentiation of, for example,
soothing (to negative protophones) as opposed to eliciting turns (to positive protophones). Such
differentiating intonations in parent vocalizations have been shown to be very similar cross-
linguistically in infant-directed speech (122–124).
.00
.10
.20
.30
.40
.50
.60
Prop
ortio
n of
Ca
regi
ver R
espo
nses
PosNeg
Supporting Figure 9: Perlocutionary effects of infant protophones with positive or negative facial affect. Responses of caregivers to infant protophones that had positive or negative facial affect were coded in terms of the categories indicated on the abscissa (and see Supporting Methods: Perlocutionary effect coding). The positive protophones (as determined by facial affect) yielded the following predominant responses all in the grouping “Encourage”: 1) Encourage: Continue (vocal, facial and postural actions by the parent that encouraged continuation of the positive interaction); 2) Encourage: Elicit Turn (actions designed to elicit a conversational turn from the infant); 3) Encourage: Praise (including phrases such as “that’s such a nice sound”, “oh that’s pretty”, and so on); 4) Encourage: Exultation by the parent (including smiling, laughing, saying “wow” or “hooray”, etc.); and 5) Encourage: Imitation of the infant utterance. The negative protophones yielded the following reactions all in the grouping “Change”: 1) Change: Evaluate change, that is verbal evaluations of possible need for change in the situation (utterances that manifest the parents’ interest in figuring out what was causing the infants’ negativity, for example questions such as “what’s the matter?” or statements such as “I think you may be wet”, etc.) 2) Change: Distract (attempts by the parent to distract the infant,
58
and thus change the focus of the infant to something positive, such as by holding a new toy up for the infant to see); 3) Change: Scold (including saying “no”, “stop that” or “shh”); 4) Change: Soothe (often manifest in soothingly intoned “oh” or “poor baby”, etc.); and 5) Change: Change situation, physical actions on the part of the parent to change the infant situation (such as picking the infant up, taking the infant to the changing table, etc.). Finally about 10% of the infant protophones, both positive and negative, resulted in no observable response on the part of the caregiver (the Unclear grouping).
After the first round of perlocutionary coding, it was decided that protophones with
neutral affect might also be revealing in terms of impact on perlocutionary effects. Consequently
a second round of coding was conducted as reported in the Main text, Figures 3c–3d. The
results in those figures confirmed the findings of Supporting Figure 9, and also indicated that
infant protophones with neutral affect showed intermediate outcomes between those for
protophones with positive or negative facial affect—neutral protophones produced less
Encouragement to continue protoconversation in responses of parents than positive ones, but
more than negative ones. Also Figures 3c–5d show protophones with neutral affect produced
fewer spoken parental evaluations of possible change in the situation or attempts to change the
situation than protophones with negative affect but more than protophones with positive affect.
An important possible concern about the data on perlocutionary effects is that they could
have been influenced by coder biases, since the coders could both see and hear the infant as well
as the parent during the coding of perlocutionary force. A third round of coding of perlocutionary
force was conducted to check for such potential bias. It involved judgments by a “blind” coder
(see Supporting Methods: Perlocutionary force coding), who was presented with audio of
parent utterances only—in particular the parent utterances that occurred right after infant
protophones and that provided the basis for judgment of perlocutionary effect—to be compared
with those of an original coder, who had judged perlocution based on both audio and video, and
who had been able to see and hear the infant as well as the parent.
59
The results displayed in Supporting Figure 10 show that the original coder and the blind
coder judged parent reactions in very much the same way. In both cases the tendency for infant
facial affect in protophones to predict the perlocutionary effect as indicated by parent reaction
(and in accord with the extensive literature in parent-infant interaction) was strong and supported
by highly significant odds ratios.
Supporting Figure 10: A check against possible bias in perlocutionary coding: The figure shows responses of caregivers to infant protophones as determined by an original coder and a “blind” coder. 157 utterances of caregivers that had been involved in the perlocutionary judgments of the original coder of three sessions from three infants at three different ages were extracted from the recordings and presented to the blind coder for perlocutionary judgment based on audio of the parent only (any utterance with infant voice was not included in this dataset). Both coders showed strong predictability between the facial affect of the infant utterance (77 positive, 66 neutral, 20 negative) that had preceded the caregiver utterance and the caregiver’s perlocutionary reaction occurring after the infant communication.
The results of these analyses make clear that facial affect accompanying infant
protophones transmits to parents useful cues regarding infant state and well-being. Parents need
to recognize their infants’ states in order to care for them. Natural selection thus appears to
sponsor the evolution and development of reliable signals by infants and reliable responses to
those signals by parents. The special reason for interest in affect expression accompanying the
60
protophones is that they show flexible association with affect, unlike cry and laughter, which are
much more stable in affect expression during infancy.
Of course there is information in the protophones themselves about infant affect, but our
research shows that affect information in protophones is much more reliably identified through
the accompanying facial expression than through vocal features of the protophones. Thus,
intercoder agreement for nine samples judged by two independent observers was much higher on
facial affect of protophones, judged in video only (kappa = .77) than on vocal affect, judged in
audio only (kappa = .38). Moreover, logistic regression was employed to predict illocutionary
force by affect for the two judges on four sessions that had been coded both for illocutionary
force and separately for facial affect, vocal affect, or affect judged with both facial and vocal
cues. Here the data suggested that in all six comparisons (three for each of the two coders), facial
affect or facial plus vocal affect played a strong and significant role independent of vocal affect
alone in predicting illocutionary force. So, we argue, the protophones manifest the capacity of
the human infant to utilize vocalization very flexibly in expression of emotion through facial
affect or a combination of facial and vocal affect. This flexibility is required in all aspects of
speech, a capacity without which spoken language would be impossible. A key challenge is to
determine whether this capacity for vocal flexibility is new in the hominin line, or whether it
may be rooted in our primate lineage.
Robustness of functional flexibility of protophones across contexts
Context of occurrence of infant vocalizations and non-human primate vocalizations.
Research in both infant vocalizations and non-human primate vocalizations profits from
evaluation of the contexts of usage of sounds. Here we use the term “context” broadly to
encompass aspects of the physical or social environment at or near the time of the occurrence of
61
the vocalization in question. In chimpanzee and bonobo research, for example, the term has been
used to refer to circumstances including whether the group is engaged in travel, preparing for
travel, hunting, whether individuals are engaged in conflict within the group or across groups,
the presence in the environment of an interesting food source and its quality, the presence of
danger, of allies, of higher ranked individuals, of sexually interesting individuals, and so on (45–
47, 49, 125–131). Recent research has addressed context of usage also in grunts of chimpanzees
across the lifespan (32), illustrating that grunts begin in development as presumable effort sounds
associated with physical movement or straining, but later are utilized in contexts that imply (at
least) greeting and come to be differentiated at least in terms of social status of the
communicators. The value of assessing contexts of usage includes (but is not limited to) the
possibility of determining unifying functions (either illocutionary or perlocutionary) that apply
across a variety of contexts (132) as well as perhaps illustrating greater flexibility of usage of
calls than was envisioned by the classical ethologists.
The issue of determining function is tricky of course because contexts can overlap and a
single vocalization may be used in several of the contexts invoked in this kind of research at the
same time. Thus a chimpanzee bark might occur during travel, but several other circumstances of
potential relevance might also apply to the bark, such as the nearby presence of an ally, the
occurrence of audible calls produced by another group of chimpanzees, the nearness of a
potential food source, and so on. How might we know, then, which aspects of the circumstance
are relevant to determining the function of the signal, if there is a unitary function? The question
could be formulated empirically in terms of 1) any aspect of circumstances (endogenous or
exogenous) that regularly correlates with a chimpanzee bark (along with perhaps any effect the
62
producer might appear to desire or anticipate) and 2) how and with what likelihood any listener
might respond in particular ways to a bark.
In humans these judgments can also be difficult, and importantly the kinds of contexts
that are relevant in non-human primate research are quite different in many instances from those
relevant for humans (especially nowadays). Fortunately we have the advantage of being able to
listen to the things people say as they respond to vocalizations, and to question receivers about
their interpretations. With research in human infancy, the opinions and observed reactions of
adult interactors can be very useful, we think, as illustrated above in the perlocution findings.
However, raw observations of “context” that may help provide insight about the functions of the
vocalizations were also available in our data, with some sessions having been coded prior to the
review of the original submission of this paper. These special codings included observations on
gaze direction and on physical/social contexts. The observations provide perspective on the
robustness of the vocal flexibility we have observed with regard to facial affect.
Gaze direction. We coded gaze direction of the infant during each vocalization where at
least one of the two video angles allowed such judgment (more than 99% of the total dataset).
The coding indicated whether infants vocalized while looking toward another person (most often
the parent) or while not looking at another person. 45% of the vocalizations occurred in the
former circumstance, as can be seen in the tabulated data in Supporting Figure 11. Vocalization
with gaze not directed to another person occurred in both social interaction (e.g., when infant and
parent were looking at a toy together) and in non-social circumstances (e.g., solitary infant play).
The bar graphs in Supporting Figure 11 can be usefully compared with Figure 1 of the
Main text where the total dataset is also represented in a single panel. It can be seen that cries
were overwhelmingly deemed negative, while laughs were overwhelmingly deemed positive in
63
both gaze circumstances. Also similarly to Figure 1 of the Main text, all the protophones for
both gaze directions showed considerable numbers of utterances with positive affect as well as
considerable numbers with negative affect. The patterns thus indicate in both gaze circumstances
the robustness of the flexibility of protophones with respect to facial affect by comparison with
the inflexibility of cries and laughs.
Supporting Figure 11: All the data were coded for gaze direction. 55% of the vocalizations occurred in circumstances not involving gaze toward a person, as can be seen in the table at the top of the figure. Also the table shows that laughter occurred much more frequently while looking at a person than while not. For cries and the three protophones, there were no major differences in frequency of occurrence between directed and non-directed contexts. The bar graphs indicate that both when infants looked at another person and when they did not, their vocalizations showed patterns of facial affect that included the primary patterns of the data from Figure 1 of the Main text. While cries and laughs showed overwhelmingly negative and positive facial affect respectively in both gaze circumstances, all the protophones showed considerable numbers of instances of all three types of facial affect. Thus while circumstances of
64
gaze did affect the pattern of data (laughter occurred much more frequently when looking at a person than while not, and positive affect in protophones also occurred considerably more frequently when looking at a person), the data strongly support the robust flexibility of the protophones in terms of facial affect compared with the relative inflexibility of cries and laughs.
While neutral vocalizations were most common for both gaze directions in the
protophones, there were more positively valenced protophones when the infant was looking at a
person, and for squeals, these outnumbered even the instances of neutral affect. The tendency for
positive infant facial expressions to occur preferentially in circumstances of eye contact has been
widely reported (133–135), and blind infants, perhaps predictably based on this tendency, smile
infrequently (136). The tendency for positive affect of protophones to occur most commonly
with gaze toward a person also corresponds to the strong tendency for laughter to occur most
commonly in the same circumstance. These data offer support for the speculation that laughter as
well as smiling may provide key foundations for human sociality (116, 137–139). The data also
suggest that the observation of gaze direction during the protophones supports the suggestion
that positive affect may be a key factor in human communication between parents and infants, a
factor that allows parents considerable opportunity to evaluate infant well-being and to foster it
in the context of face-to-face emotional regulation (44, 140, 141).
Five contexts. To provide further perspective on the robustness of the patterns of facial
affect accompanying the infant vocalizations, we analyzed the coding of each twenty-minute
recording in terms of one of five circumstances. 98% of the infant utterances had been coded as
pertaining to one of the following: 1) Separated: The infant was not engaged in interaction with
anyone—the parent and an experimenter were usually talking to each other, while the infant was
on the other side of the room playing, or on other occasions the infant and parent might be close
together, but their vocalizations were not part of an interaction or bid for interaction with each
other; 2) Interacting on lap: The infant was on an adult’s lap and the adult and infant were 65
engaged in vocal interaction, or the infant was vocally bidding for the adult to interact;
3) Interacting on changing table: The infant was lying on the changing table with the parent
changing the diaper while the parent was engaged in vocal interaction with the infant, or while
the infant was vocally bidding for interaction with the parent; 4) Interacting from high chair: The
infant was seated in the high chair and was engaged in vocal interaction with an adult or was
vocally bidding for interaction with an adult; and 5) Interacting on floor: The infant was on the
floor and was engaged in vocal interaction with an adult or was vocally bidding for interaction
with an adult. Each entire 20-min segment was coded in these terms, with codes being changed
at any point where the existing code was no longer valid for at least 10 sec. Gaze direction was
not an explicit aspect of the coding of the five contexts. Still, in some instances the judgment of
vocal interaction or bidding for it could have been influenced by gaze direction.
The tabulated data in Supporting Figure 12 show that more than ¼ of all vocalizations
occurred in the separated condition, and this was true of both the protophones and the cries. On
the other hand, as can be seen in the stacked bar graph at the upper right, both cries and laughs
were distributed proportionally very differently from the protophones. All three protophones
were distributed similarly across the five contexts. In sharp contrast only 10% of the laughter
occurred in the separated condition (a pattern that conforms to the inherent sociality of laughter),
though more than a ¼ of the protophones occurred in that circumstance. Also, 66% of the cries
occurred during the separated condition (the infant manifesting distress) or during interaction on
the changing table (presumably from discomfort associated with being wet), even though only
35% of the protophones occurred in those circumstances. In general the pattern suggests that the
protophones occurred in ways that were largely unaffected by the context (all three types
occurring in roughly in the same proportions across the contexts), while the laughs and cries
66
were more predictable by context, and much more differentiable from each other, as should be
expected of signals naturally selected for particular functions or selected to express particular
emotions.
Supporting Figure 12: All the vocalizations were coded in terms of five contexts described in the text. Tabulated data at the upper left show that more than ¼ of the vocalizations occurred when the infant was separated. Nearly half occurred during interaction on the floor. The stacked bar graph at the upper right shows that laughs and cries distributed differently from each other with regard to the contexts, while the three protophones showed a consistent pattern unlike either laugh or cry. Providing another indication of robustness in the flexibility of facial affect expression for the protophones, the standard bar graphs (formatted as in Figure 1 of the Main text) show that while cries and laughs were accompanied by overwhelmingly negative and positive facial affect respectively in all five contexts, all the protophones showed considerable numbers of instances of all three types of facial affect in all five contexts. Thus irrespective of these contexts, the protophones appear to manifest the human infant capability to utilize vocalization and facial affect with substantial freedom of association.
67
The standard (not stacked) bar graphs in Supporting Figure 12 can be usefully
compared with Figure 1 of the Main text in the same way the gaze direction data could be
compared. It can be seen that for all five contexts, cries were overwhelmingly deemed negative,
while laughs were overwhelmingly deemed positive. Also similarly to Figure 1 of the Main
text, all the protophones in all five contexts showed considerable numbers of utterances with
positive affect as well as considerable numbers with negative affect. The patterns thus indicate,
in five contexts, the robustness of the flexibility of protophones with respect to facial affect by
comparison with the inflexibility of cries and laughs. We take this flexibility of affective
expression in vocalizations to be a critical foundation for language. Given that we have
quantified the extent of the flexibility in this article, we hope in the future to see evidence of the
extent to which this kind of flexibility may be manifest in vocalizations of non-human primates,
and perhaps to collaborate in the requisite research.
68
Facial Affect codes based on the master coding for the audio-video examples
CL1 (Growls)Movie S1 NeutralMovie S2 NeutralMovie S3 PositiveMovie S4 Negative
CL2 (Squeals)Movie S5 PositiveMovie S6 NeutralMovie S7 Negative
CL3 (Vocants)Movie S8 NegativeMovie S9 Positive
CL4 (Vocants)Movie S10 NeutralMovie S11 PositiveMovie S12 NegativeMovie S13 PositiveMovie S14 Negative
CL5 (Squeals)Movie S15 NegativeMovie S16 Positive
CL6 (Squeals)Movie S17 NeutralMovie S18 PositiveMovie S19 Negative
69
SUPPORTING REFERENCES
1. Lorenz K (1951) Ausdrucksbewegungen höherer Tiere. Naturwissenschaften 38:113–116.2. Tinbergen N (1951) The study of instinct (Oxford University Press, Oxford).3. Cheney DL & Seyfarth RM (1999) Mechanisms underlying vocalizations of primates. The design
of animal communication, eds Hauser MD & Konishi M (MIT Press, Cambridge, MA), pp 629–644.
4. Hauser M (1996) The evolution of communication (MIT, Cambridge, MA).5. Jürgens U (1982) A neuroethological approach to the classification of vocalization in the squirrel
monkey. Primate Communication, eds Snowdon CT, Brown CH, & Petersen MR (Cambridge University Press, Cambridge, UK), pp 50–62.
6. Darwin C (1872) The Expression of Emotions in Man and Animals (University of Chicago Press, Chicago).
7. Owren MJ, Amoss RT, & Rendall D (2011) Two organizing principles of vocal production: Implications for nonhuman and human primates. American Journal of Primatology 73:530–544.
8. Sutton D (1979) Mechanisms underlying learned vocal control in primates. Neurobiology of social communication in primates: an evolutionary perspective, eds Steklis HD & Raleigh MJ (Academic Press, New York), pp 45–67.
9. Acebo C & Thoman EB (1995) Role of infant crying in the early mother infant dialogue. Physiology & Behavior 57(3):541–547.
10. Green JA, Jones LE, & Gustafson GE (1987) Perception of cries by parents and nonparents: relation to cry acoustics. Developmental Psychology 23(3):370–382.
11. Gustafson GE & Green JA (1991) Developmental coordination of cry sounds with visual regard and gestures. Infant Behavior and Development 14:51–57.
12. Lafreniere PJ (2000) Emotional Development: A Biosocial Perspective (Wadsworth Press, Belmont, CA).
13. Lester BM & Boukydis CFZ (1992) No language but a cry. Nonverbal vocal communication, eds H. Papoušek, U. Jürgens, & M. Papoušek (Cambridge University Press, New York), pp 145–173.
14. Sroufe LA (1996) Emotional Development: The Organization of Emotional Life in the Early Years (Cambridge University Press, New York).
15. Oller DK (1981) Infant vocalizations: Exploration and reflexivity. Language behavior in infancy and early childhood, ed Stark RE (Elsevier North Holland, New York), pp 85–104.
16. Stark RE (1981) Infant vocalization: A comprehensive view. Infant Medical Health Journal 2(2):118–128.
17. Scheiner E, Hammerschmidt K, Jürgens U, & Zwirner P (2002) Acoustic analyses of developmental changes and emotional expression in the preverbal vocalizations of infants. Journal of Voice 16:509–529.
18. Scheiner E, Hammerschmidt K, Jürgens U, & Zwirner P (2006) Vocal expression of emotions in normally hearing and hearing-impaired infants. Journal of Voice 20(4):585–604.
19. Molemans I (2011) Sounds like babbling: A Longitudinal investigation of aspects of the prelexical speech repertoire in young children acquiriing Dutch: Normally hearing children and hearing impaired children with a cochlear implant Ph.D. (University of Antwerp, Antwerp, Belgium).
20. Molemans I, Van den Berg R, Van Severen L, & Gillis S (2011) How to measure the onset of babbling reliably. Journal of Child Language:1–30.
21. Konopczynski G (1985) Acquisition du langage. La période charnière et sa structuration mélodique. Bulletin d’audiophonologie. Annales scientifiques de l’Université de Franche-Comté. 11:63–92.
70
22. Koopmans-van Beinum FJ & van der Stelt JM (1986) Early stages in the development of speech movements. Precursors of early speech, eds Lindblom B & Zetterstrom R (Stockton Press., New York), p 37 50.
23. Stark RE (1980) Stages of speech development in the first year of life. Child Phonology, vol. 1, eds Yeni-Komshian G, Kavanagh J, & Ferguson C (Academic Press, New York), pp 73–90.
24. Zlatin-Laufer MA & Horii Y (1977) Fundamental frequency characteristics of infant non-distress vocalization during the first 24 weeks. Journal of Child Language 4:171–184.
25. Locke JL (2008) Lipsmacking and babbling: Syllables, Sociality, and Survival. The Syllable in Speech Production, eds Davis BL & Zajdo K (Erlbaum, New York), pp 111–132.
26. MacNeilage PF, Davis BL, Kinney A, & Matyear CL (2000) The motor core of speech: A comparison of serial organization patterns in infants and languages. Child Development 71(1):153–163.
27. Blount BG (1985) "Girney" vocalizations among Japanese macaque females: Context and function. Primates 26(4):424–435.
28. Becker M, Buder EH, & Ward J (1999) Description of the growl vocalization in small-eared bushbaby mothers, Otolemur garnetti. American Journal of Primatology 49:32.
29. Becker ML, Buder EH, & Ward JP (1998) Vocalizations associated with mother-infant interactions in the small-eared bushbaby. American Journal of Primatology 45(2):166–167 (abstract).
30. Dunbar RIM (1996) Gossiping, grooming and the evolution of language (Harvard University Press, Cambridge, MA).
31. Morris D (1967) The naked ape (Dell, New York).32. Laporte MNC & Zuberbühler K (2011) The development of a greeting signal in wild
chimpanzees. Developmental Science 14(5):1220–1234.33. Seyfarth RM, Cheney DL, & Marler P (1980) Vervet monkey alarm calls: Semantic
communication in a free-ranging primate. Animal Behaviour 28:1070–1094.34. Struhsaker TT (1967) Auditory communication among vervet monkeys (Cercopithecus aethiops).
Social communication among primates, ed Altmann SA (Chicago Univ. Press, Chicago, IL), pp 281–324.
35. Austin JL (1962) How to do things with words (Oxford Univ. Press, London).36. Marler P, Evans CS, & Hauser M (1992) Animal signals: reference, motivation or both?
Nonverbal vocal communication, eds H.Papoušek, U.Jürgens, & M.Papoušek (Cambridge University Press, New York), pp 66–86.
37. Tomasello M, Carpenter M, Call J, Behne T, & Moll H (2005) Understanding and sharing intentions: The origins of cultural cognition. Behavioral & Brain Sciences 28:675–735.
38. Trevarthen C (1974) Conversations with a two-month-old. New Scientist 2:230–235.39. Trevarthen C (1979) Communication and cooperation in early infancy. A description of primary
intersubjectivity. Before speech: The beginnings of human communication, ed Bullowa M (Cambridge University Press, London), pp 321–347.
40. Ainsworth MDS (1969) Object relations, dependency, and attachment: A theoretical review of the infant-mother relationship. Child Development 40:969–1025.
41. Bowlby J (1969) Attachment and Loss (Basic Books, New York).42. Feldman R, Greenbaum CW, & Yirmiya N (1999) Mother-infant synchrony as an antecedent of
the emergence of self-control. Development Psychology 35(5):223–231.43. Forbes EE, Cohn JF, Allen NB, & Lewinsohn PM (2004) Infant Affect during Parent-Infant
Interaction at 3 and 6 Months: Differences Between Mothers and Fathers and Influence of Parent History of Depression. Infancy 5(1):61–84.
44. Schore AN (2001) Effects of a secure attachment relationship on right brain development, affect regulation, and infant mental health Infant mental health journal 22(1–2):7–66.
71
45. Clay Z & Zuberbühler K (2009) Food-associated calling sequences in bonobos. Animal Behaviour 77:1387–1396.
46. Laporte MNC & Zuberbühler K (2010) Vocal greeting behaviour in wild chimpanzee females. Animal Behaviour 80:467–473.
47. Slocombe KE, et al. (2009) Production of food-associated calls in wild male chimpanzees is dependent on the composition of the audience. Behavioral Ecology and Sociobiology 64(12):1959–1966.
48. Crockford C & Boesch C (2003) Context-specific calls in wild chimpanzees, Pan troglodytes verus: Analysis of barks. Animal Behaviour 66:115–125.
49. Crockford C, Herbinger I, Vigilant L, & Boesh C (2004 ) Wild chimpanzees produce group-specific calls: a case for vocal learning? . Ethology 110:221–243.
50. Parr L, Waller BM, & Heintz M (2008) Facial expression categorization by chimpanzees using standardized stimuli. Emotion 8(2):216–231.
51. Parr L, Waller BM, & Vick S (2007) New developments in understanding emotional facial signals in chimpanzees. Current Directions In Psychological Science 16(3):117–122.
52. Oller DK (2000) The Emergence of the Speech Capacity (Lawrence Erlbaum Associates, Mahwah, NJ) p 428.
53. Stark RE, Bernstein LE, & Demorest ME (1993) Vocal communication in the first 18 months of life. Journal of Speech & Hearing Research 36:548–558.
54. Buder EH (1996) Experimental phonology with acoustic phonetic methods: Formant measures from child speech. Proceedings of the UBC International Conference on Phonological Acquisition, eds Bernhardt B, Gilbert J, & Ingram D), pp 254–265.
55. Buder EH & Stoel-Gammon C (1998) Acquisition of language-specific word-initial unvoiced stops: VOT, intensity, and spectral shape in American English and Swedish. Proceedings of the 16th International Congress on Acoustics and 135th meeting of The Acoustical Society of America, pp 2987–2988.
56. Kehoe M, Stoel-Gammon C, & Buder EH (1995) Acoustic correlates of stress in young children's speech. Journal of Speech and Hearing Research 38:338–350.
57. Stoel-Gammon C & Buder EH (1998) The effects of postvocalic voicing on the duration of high front vowels in Swedish and American English: Developmental data. Proceedings, 16th International congress on Acoustics and 135th meeting acoustical society of America, eds Kuhl PK & Crum LA (Acoustical Society of America, Woodbury, NY), Vol 4, pp 2989–2990.
58. Stoel-Gammon C, Buder EH, & Kehoe MM (1995) Acquisition of vowel duration: A comparison of Swedish and English. Proceedings of the XIIIth International Congress of Phonetic Sciences, eds Elenius K & Branderud P (KTH and Stockholm University, Stockholm), Vol 4, pp 30–37.
59. Stoel-Gammon C, Williams K, & Buder EH (1994) Cross-language differences in phonological acquisition: Swedish and American /t. Phonetica 51:146–158.
60. Buder EH, Strand EA, & Iddings S (March) A quantitative and graphic acoustic analysis of phonatory instability in ALS dysarthria. Motor Speech Conference.
61. Hartelius L, Nord L, & Buder EH (1995) Acoustic analysis of dysarthria associated with multiple sclerosis. Clinical Linguistics & Phonetics 9(2):95–120.
62. Buder EH, Chorna L, Oller DK, & Robinson R (2008) Vibratory Regime Classification of Infant Phonation. Journal of Voice 22:553–564.
63. Kwon K, Buder EH, Oller DK, & Chorna LB (2007) Classifying Infant Vocalizations Based on Fundamental Frequency (f0). (International Child Phonology Conference, Seattle, WA).
64. Hosmer D & Lemeshow S (2000) Applied Logistic Regression (John Wiley & Sons, New York) 2 Ed.
65. Bakeman R & Robinson BF (1994) Understanding log-linear analysis with ILOG: An interactive approach (Lawrence Erlbaum Associates, Hillsdale, NJ).
72
66. Warlaumont AS, Oller DK, Buder EH, Dale R, & Kozma R (2010) Data-driven automated acoustic analysis of human infant vocalizations using neural network tools. 127:2563–2577.
67. Kent RD & Murray A (1982) Acoustic features of infant vocal utterances at 3, 6, and 9 months. Journal of the Acoustical Society of America 72:353–365.
68. Bakeman R, Adamson LB, & Strisik P (1989) Lags and logs: Statistical approaches to interaction. Interaction in Human Development, eds Bornstein MH & Bruner J (Erlbaum, Hillsdale, NJ), pp 241–260.
69. Bakeman R & Gottman JM (1997) Observing interaction: An introduction to sequential analysis (Cambridge University Press, Cambridge) 2 Ed.
70. Lewedag VL (1995) Patterns of onset of canonical babbling among typically developing infants. Doctoral Dissertation (University of Miami, Coral Gables, FL).
71. Oller DK & Bull D (1984) Vocalizations of deaf infants. (International Conference on Infant Studies, New York).
72. Oller DK, et al. (2007) Diversity and contrastivity in prosodic and syllabic development. Proceedings of the International Congress of Phonetic Sciences, eds Trouvain J & Barry W (International Phonetics Society, Saarbrucken, Germany), pp 303–308.
73. Webber CL & Zbilut JP (2005) Recurrence quantification analysis of nonlinear dynamical systems. Tutorials in Contemporary Nonlinear Methods for the Behavioral Sciences, eds Riley MA & Van Orden GC (National Science Foundation Program in Perception, Action, and Cognition, Web Book).
74. Papoušek M & Papoušek H (1989) Forms and functions of vocal matching in interactions between mothers and their precanonical infants. First Language 9:137–158.
75. Fernald A (1992) Human maternal vocalizations to infants as biologically relevant signals: An evolutionary perspective. The Adapted Mind: Evolutionary Psychology and the Generation of Culture, eds Barkow JH, Cosmides L, & Tooby J (Oxford University Press, Oxford), pp 345–382.
76. Fernald A (1992) Meaningful melodies in mothers' speech to infants. Nonverbal vocal communication: Comparative and developmental approaches. Studies in emotion and social interaction., eds Papoušek H, Jürgens U, & Papoušek M (Cambridge University Press, New York), pp 262–282.
77. Fernald A & O'Neill DK (1993) Peekaboo across cultures: How mothers and infants play with voices, faces and expectations. Parent-Child Play: Descriptions and Implications, ed MacDonald K), pp 259–286.
78. Owren MJ & Goldstein MH (2008) Scaffolds for babbling: Innateness and learning in the emergence of contextually flexible vocal production in human infants. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 169–192.
79. Bornstein MH & Lamb ME (1992) Development in infancy: An introduction (McGraw-Hill, Inc., New York) third Ed.
80. Goldstein MH, King AP, & West MJ (2003) Social interaction shapes babbling: Testing parallels between birdsong and speech. Proceedings of the National Academy of Sciences 100(13):8030–8035.
81. Hsu HC & Fogel A (2003) Social regulatory effects of infant non-distress vocalization on maternal behavior. Developmental Psychology 39(6):97–991.
82. van der Stelt JM (1993) Finally a word: a sensori-motor approach of the mother-infant system in its development towards speech (Uitgave IFOTT, Amsterdam, the Netherlands) p 226.
83. Bakeman R & Adamson LB (1984) Coordinating attention to people and objects in mother-infant and peer-infant interaction. Child Development 55:1278–1289.
84. Beebe B, Alson D, Jaffe J, Feldstein S, & Crown C (1988) Vocal congruence in mother-infant play. Journal of Psycholinguistic Research 17:245–259.
73
85. Cohn JF & Tronick EZ (1987) Mother-infant face-to-face interaction: The sequence of dyadic states at 3, 6, and 9 months. Developmental Psychology 23:68–77.
86. Field T, Healy B, Goldstein S, & Guthertz M (1990) Behavior-state matching and synchrony in mother-infant interactions of nondepressed versus depressed dyads. Developmental Psychology 26:7–14.
87. Hsu HC & Fogel A (2001) Infant vocal development in a dynamic mother-infant communication system. Infancy 2(1):87–109.
88. Jaffe J, Beatrice B, Stanley F, Crown CL, & Jasnow MD (2001) Rhythms of dialogue in infancy: Coordinated timing in development (Univ of Chicago Press, Chicago).
89. Stern DN (1974) Mother and infant at play: The dyadic interaction involving facial, vocal, and gaze behaviors. The effect of the infant on its caregiver, eds Lewis M & Rosenblum LA (Wiley, New York), pp 187–213.
90. Ginsburg GP & Kilbourne BK (1988) Emergence of vocal alternation in mother-infant interchanges. Journal of Child Language 15:221–235.
91. Cohn JF, Campbell SB, Reinaldo M, & Hopkins J (1990) Face-to-face interactions of postpartum depressed and nondepressed mother-infant pairs at 2 months. Developmental Psychology 26(1):15–23.
92. Zlochower AJ & Cohn JF (1996) Vocal timing in face-to-face interaction of clinically depressed and nondepressed mothers and their 4 month-old infants. infant Behavior & Development 19:371–374.
93. Bloom K (1988) Quality of adult vocalizations affects the quality of infant vocalizations. Journal of Child Language 15:469–480.
94. Goldstein MH & Schwade JA (2008) Social feedback to infants’ babbling facilitates rapid phonological learning. Psychological Science 19:515–522.
95. Buder EH, Warlaumont AS, Oller DK, & Chorna LB (2010) Dynamic indicators of mother-infant prosodic and illocutionary coordination. in Proceedings of Speech Prosody (Speech Prosody, Chicago).
96. Arbib MA, Liebal K, & Pika S (2008) Primate Vocalization, Gesture, and the Evolution of Human Language. Current Anthropology:1053–1076.
97. Corballis MC (2002) From Hand to Mouth: The Origins of Language (Princeton University Press, Princeton, NJ).
98. Hewes GW (1973) Primate communication and the gestural origin of language. Current Anthropology 14(1–2):5–24.
99. Tomasello M (1996) The gestural communication of chimpanzees and human children. (Waseda University International Conference Center, Tokyo).
100. Gardner RA, Gardner BT, & Van Cantfort TE eds (1989) Teaching sign language to chimpanzees (Suny Press).
101. Call J (2008) How apes use gestures: The issue of flexibility. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 235–252.
102. Trevarthen C (2001) Infant Intersubjectivity: Research, Theory, and Clinical Applications. Journal of Child Psychology and Psychiatry (42):3–48.
103. McNeill D, Bertenthal B, Cole J, & Gallagher S (2005) Gesture-first, but no gestures? Behavioral & Brain Sciences 28:138–139.
104. McNeill D (1992) Hand and mind: What gestures reveal about thought (University of Chicago Press, Chicago).
105. Buder EH & Stoel-Gammon C (2002) Young children's acquisition of vowel duration as influenced by language: Tense/lax and final stop consonant voicing effects. Journal of the Acoustical Society of America 111:1854–1864.
74
106. Buder EH & Stoel-Gammon C (1993) Obtaining valid and reliable acoustic measures of children's vowel productions. American Speech-Language and Hearing Association.
107. Winholtz WS & Titze IR (1997) Conversion of a head-mounted microphone signal into calibrated SPL units. Journal of Voice 11:417–421.
108. Oller DK (2010) All-day recordings to investigate vocabulary development: A case study of a trilingual toddler Communication Disorders Quarterly 31(4):213–222.
109. Oller DK, et al. (2010) Automated Vocal Analysis of Naturalistic Recordings from Children with Autism, Language Delay and Typical Development. Proceedings of the National Academy of Sciences 107(30):13354–13359.
110. Zimmerman F, et al. (2009) Teaching By Listening: The Importance of Adult-Child Conversations to Language Development. Pediatrics 124:342–349.
111. Milenkovic P (2001) TF32 (University of Wisconsin-Madison, Madison, WI).112. Lynch MP, Oller DK, Steffens ML, & Buder EH (1995) Phrasing in prelinguistic vocalizations.
Developmental Psychobiology 28:3–23.113. Ekman P & Friesen W (1978) The Facial Action Coding System (Consulting Psychologists Press,
Palo Alto, CA).114. Kojima S & Nagumo S (1996) Early vocal development in a chimpanzee infant. Primate
Institute, Inuyama.115. Panksepp J (2000) The riddle of laughter: Neuronal and psychoevolutionary underpinnings of
joy. Current directions in Psychological Science 9:183–186.116. Provine RR (1996) Laughter. American Scientist 84:38–45.117. Cheney DL & Seyfarth RM (1996) Function and intention in the calls of non-human primates.
Proceedings of the British Academy 88:59–76.118. Slocombe KE, Waller B, & Liebal K (2011) The language void: The need for multimodality in
primate communication research. Animal Behaviour 81(5):919–924.119. Locke JL (2006) Parental selection of vocal behavior: Crying, cooing, babbling, and the evolution
of language. Human Nature 17:155–168.120. Hauser MD (1996) The evolution of communication (MIT, Cambridge, MA).121. Griebel U & Oller DK (2008) Evolutionary forces favoring contextual flexibiliity. Evolution of
Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 9–40.
122. Fernald A (1989) Intonation and communicative intent in mothers'speech to infants: Is the melody the message? Child Development 60:1497–1510.
123. Papoušek M (1994) Vom ersten Schrei zum ersten Wort: Anfänge der Sprachentwickelung in der vorsprachlichen Kommunikation (Verlag Hans Huber, Bern).
124. Papoušek M, Bornstein MH, Nuzzo C, Papoušek H, & Symmes D (1990) Infant responses to prototypical melodic contours in parental speech. Infant Behavior and Development 13:539–545.
125. Crockford C, Wittig RM, Mundry R, & Zuberbühler K (2012 ) Wild Chimpanzees Inform Ignorant Group Members of Danger Current Biology 22:142–146.
126. Crockford C & Boesch C (2005) Call combinations in wild chimpanzees. Behaviour 142(4):397–421.
127. Clay Z & Zuberbühler K (2011) Bonobos Extract Meaning from Call Sequences. PLoS One 6(4):1–10.
128. Goodall J (1986) The chimpanzees of Gombe (The Belknap Press of Harvard University Press, Cambridge, MA).
129. Marler P (1976) Social organization, communication and graded signals: The chimpanzee and the gorilla. Growing points in ethology, eds Bateson PPG & Hinde RA (Cambridge University Press, Cambridge, UK), pp 239–280.
75
130. Slocombe KE & Zuberbühler K (2005) Agonistic screams in wild chimpanzees (Pan troglodytes schweinfurthii) vary as a function of social role. Journal of Comparative Psychology 119(1):67–77.
131. Zuberbühler K, Ouattara K, Bitty A, Lemasson A, & Noë R (2009) The primate roots of human language: Primate vocal behaviour and cognition in the wild. Becoming eloquent: Advances in the emergence of language, human cognition, and modern cultures, eds d'Errico F & Hombert J-M (John Benjamins, Amsterdam), pp 235–266.
132. Notman H & Rendall D (2005) Contextual variation in chimpanzee pant hoots and its implications for referential communication. Animal Behaviour 70:177–190.
133. Messinger D, Fogel A, & Dickson KL (2001) All smiles are positive, but some smiles are more positive than others. Developmental Psychology 37(5):642–653.
134. Sroufe L & Waters E (1976) The ontogenesis of smiling and laughter: A perspective on the organization of development in infancy. Psych Rev 83:173–189.
135. Sroufe LA (1995) Emotional Development:The Organization of Emotional Life in the Early Years (Cambridge University Press, Cambridge).
136. Fraiberg S (1979) Blind infants and their mothers: An examination of the sign system. Before Speech: The beginning of interpersonal communication, ed Bullowa M (Cambridge University Press, Cambridge, UK), pp 147–169.
137. Davila Ross M, Owren MJ, & Zimmermann E (2010) The evolution of laughter in great apes and humans. Communicative & Integrative Biology 3(2):191–194.
138. Dunbar RIM (2004) Language, music and laughter in evolutionary perspective. The Evolution of Communication Systems: A Comparative Approach, eds Oller DK & Griebel U (MIT Press), pp 257–274.
139. Sroufe LA & Wunsch J (1972) The development of laughter in the first year of life. Child Development 43:1326–1344.
140. Feldman R (2007) Parent–Infant Synchrony: Biological Foundations and Developmental Outcomes. Current directions in psychological science 16(6):340–345.
141. Pipp S & Harmon RJ (1987) Attachment as regulation: a commentary. Child Development 58:648–652.
142. Oller DK & Griebel U (2008) The origins of syllabification in human infancy and in human evolution. Syllable Development: The Frame/Content Theory and Beyond, eds Davis B & Zajdo K (Lawrence Erlbaum and Associates, Mahwah, NJ), pp 368-386.
143. Oller DK & Griebel U (2008) Complexity and flexibility in infant vocal development and the earliest steps in the evolution of language. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 141-168.
144. Damasio A (1999) The feeling of what happens: Body and emotion in the making of consciousness (Harcourt Brace and Co., New York).
145. Knoke D & Burke PJ (1980) Log-linear Models (Sage, London, UK).
76