10.1073/pnas.1300337110/-/DCSup… · Web viewSupporting Information Appendix for. Functional Flexibility of Infant Vocalization . and the Emergence of Language. D. Kimbrough Oller,

Supporting Information Appendix for

Functional Flexibility of Infant Vocalization

and the Emergence of Language

D. Kimbrough Oller, Eugene H. Buder, Heather L. Ramsdell,

Anne S. Warlaumont, Lesya Chorna, Roger Bakeman

*Correspondence to [email protected]

1

TABLE OF CONTENTS

SUPPORTING BACKGROUND pp. 3–23

Categories of vocalizations in humans and other primatesAffect and context in the judgment of function in infant vocalizationsCharacteristics of the protophones of interest in the present workEvidence of the existence of precanonical protophone categories and their systematic

production in human infancyAcoustic analysis studies and early vocal categoriesRepetition and its role in recognition of vocal categoriesRecurrence quantification analysis to illustrate temporal clumping of

vocal categoriesParent-infant interaction in vocal category formation and parent

recognition of categoriesOn the possible role of gesture in the origins of language

SUPPORTING METHODS pp. 24–42

Infants and recordingsSelection of data for the present study Coding softwareUtterance location for coding Coding training and coding procedures for both vocal type and facial affect Definitions of vocal types and facial affect types in the studyPositive, neutral and negative affect as a proxy for functionIllocutionary force coding Perlocutionary effect coding Observer agreement levels for both vocal type and facial affect

SUPPORTING RESULTS pp. 43–69

Audio-Video examples (Movies) illustrating the protophones and their variability in facial affect expression

Odds Ratio analyses to illustrate the distinction between protophones and stereotyped species-specific signals

Consistency across infants for the six patterns showing functional flexibility in the protophones but not in the stereotyped species-specific signals

Functional flexibility of vocalization even at the youngest ages of the infants in the study

Observer agreement on six patterns of functional flexibility in protophones Additional contingency table analyses illustrating individual differences on

protophone expression of affect, but not on stereotyped species-specific signal expression of affect

Log-linear analyses: Individual differences in the expression of affect in infant vocalization

The role of affect expression in the functional interpretation of infant protophones Facial affect and illocutionary force of infant protophonesFacial affect and perlocutionary effects of infant protophones

Robustness of functional flexibility of protophones across contextsFacial affect codes based on the master coding for the audio-video examples

SUPPORTING REFERENCES pp. 70–76

2

SUPPORTING BACKGROUND

Categories of vocalizations in humans and other primates

There are two general categories of vocalizations in nonhuman primates and apparently

in virtually all mammals: 1) vegetative sounds and 2) stereotyped species-specific calls that have

been termed “fixed signals” by the classical ethologists (1, 2)—see comments below on more

recent interpretations of, and research on such calls . Vegetative sounds such as coughs, sneezes,

and burps are the product of bodily functions and show little, if any sign of having been naturally

selected as communicative vehicles although they can be exploited by listeners as information

about the producer (e.g., a cough can betray the location of prey to a predator). Because of the

lack of obvious selection for communication, vegetative sounds are usually viewed as relatively

unimportant in theories of the evolution and development of language.

Species-specific calls, on the other hand, are assumed to have been selected for their

social communicative value, and they are often a focus in the search for origins of language (3–

5). Examples of species-specific vocalizations in human infants are cry and laughter. For non-

human primates, they include distress cries, contact calls, warning calls and so on, often

characterized as relatively stereotyped sound types, each category with a specific social function

(4). The functions of animal signals in the context of this view have been thought to be universal

within species, present very early in life regardless of hearing status or other experiential factors,

and not fundamentally modifiable in the form of their production. This viewpoint was essentially

expressed by Darwin (6) and has been reiterated in numerous empirical research studies in the

past century—see a current review of the relevant primate literature by Owren and colleagues

(7), but as will be seen below, it appears the view will need to be softened in light of recent

findings.

3

In accord with this traditional view, primate species-specific calls are often seen as

relatively inflexible in the following sense: The function that each call serves is thought to

remain essentially the same on each occasion of usage and is not modified to take on that of

another call by natural experience or by specific training. Thus for example, a warning call is

thought always to be a warning call and has not been shown to be transformable into a contact

call. Conditioning can play a role in vocal production of nonhuman primates, but the literature

has reported substantial limits on the kinds of shifts in usage of primate vocalizations that occur

in conditioning studies; for example a change in the rate of production of a food call may be

conditionable, but this is not the same as conditioning the food call to be used as a warning call

(8). The prevailing view is that subcortical and limbic structures control nonhuman primate

sounds and that such sounds are largely involuntary.

Human infants in the first months of life, like all primates, show both vegetative sounds

and at least two vocals types that are analogous to the species-specific calls of other primates.

These are cry and laughter, both of which have been extensively studied (9–14). However,

human infants also show sounds that are neither vegetative nor stereotyped species-specific calls,

and importantly they are also not speech. We call these sounds protophones, and they occur in all

normal infants (15, 16) at frequencies that appear to substantially exceed those of the stereotyped

species-specific calls (17, 18). In the past these sounds have often been labeled “babbling”, but

some authors (even very recent ones) have limited the term “babbling” to canonical babbling

only (19, 20), the protophone category that includes well-formed syllables as in [baba] or

[mama], for example. Canonical babbling begins typically by 7 or 8 months of life. Precanonical

protophones of the first months of life include categories formed out of vocal exploration

primarily at the laryngeal level—thus the well-timed articulations that are required in canonical

4

syllables are not involved. Among the most prominent of the precanonical protophones are

vocants, squeals and growls—the literature on vocal development repeatedly notes the existence

of these types (although the terminology sometimes varies) (21–24).

Human speech, in contrast with the species-specific vocalizations produced by primates

generally, is produced with substantial cortical as well as subcortical and limbic involvement and

is supremely voluntary and modifiable. Owren et al. page 541 summarize:

"… the vocal flexibility and volitional control that is so often sought in primates is

largely absent while being strikingly clear in humans (7) "

This difference may also apply to protophones, given their close relationship to speech.

It is important to emphasize, however, that while no one seems to doubt that humans

have greater vocal flexibility than non-human primates, the degree and developmental course of

flexibility of vocalization in non-human primates remains an open question. Cheney and

Seyfarth have summarized research on nonhuman primate calls (3), with two key points:

1. Even production of calls is flexible to some extent, since, for example, alarm calls are

not necessarily obligatory, but depend on social context, and a variety of changes

occur in calls across development.

2. Comprehension of calls is much more subject to modification by experience than

production of calls.

In light of recent research, these points should be supplemented by the observation that

some sounds in non-human primates may be more flexible in usage than those that appear to

have provided the primary basis for the assumed limited flexibility, as proposed in classical

ethology. Loud calls are easy to hear or record at the sorts of distances from which many

observations are made in primate research (alarm calls fit this mold), but sounds that tend to

5

occur at lower amplitudes (sometimes referred to as close-calls) are harder to observe and may

indeed be more flexible in usage. Notable attention has been paid to lipsmacking (25, 26),

“girneys” (27), and other calls that are thought to be potentially more flexible in terms of social

usage (28, 29). Lipsmacking may be of particular interest, because although it involves no

phonation, it does include quiet sounds that are often produced during grooming, a circumstance

that has been hypothesized to have been a possible key setting for natural selection of language-

like behavior (30, 31). It is clear that more research is needed on low intensity vocalizations in

non-human primates because they appear to occur with little or no obvious affective biasing and

with low arousal. Consequently they may be particularly good candidates for evaluation as

possible homologues with human infant protophones. In chimpanzees it appears that grunts may

be particularly good candidates for such exploration—recent research has already shown

interesting changes in utilization across contexts and ages of individuals (32).

Affect and context in the judgment of function in infant vocalizations

To put our discussion of functional flexibility in perspective, let us consider how

functions of vocalization are assessed in research on both human infants and related species. A

standard approach for attempting to interpret functions of communications in non-humans is to

evaluate the physical and social “context” in which the communications occur. For example,

since different alarm calls occur in some species when a particular predator is spotted—vervet

monkeys providing the paradigm example (33, 34)—it makes sense to treat the context (in this

case the specific kind of impending possible predation) as strongly suggestive of function. The

call produced in response to spotting the predator can be said to have the immediate function of

“alarm”, “warning”, or “danger” signaling. This sort of function is the “illocutionary force”—in

accord with our extended interpretation of Austin’s (35) term—and it denotes the social act

6

produced by the signaler in the act of signaling. Our group has proposed to treat signals that are

inherently communicative—i.e., have been naturally selected to be signals—as having

illocutionary force whether the signals are produced intentionally or not (142,143), even though

Austin may have restricted his usage of the term illocution to intentional social acts. Austin’s

work focused on mature linguistic communication and to our knowledge never addressed a

possible application of the speech act distinctions to human infancy or animal communication.

Our extension of the Austinian term “illocution” to include naturally selected signals that may be

produced with little or no explicit social intention is important in our view because it facilitates

the explication of the origins of communication in human infancy and in animal communication.

The naturally selected warning call (where the illocutionary force in accord with our

usage is “warning”) in vervet monkeys produces an increased probability that after hearing the

warning, an individual receiver or the group will execute appropriate escape behaviors or at least

look for the source of possible danger. This sort of function is the “perlocutionary force” (again

Austin’s coinage), encompassing the effects that occur as a result of listeners having interpreted

or reacted to the signal (and presumably its illocutionary force). If a species has different types of

alarm calls that occur in response to different types of predators (as has been reported for vervet

monkeys and a variety of other species), it makes sense also to interpret the calls in terms of a

referential function, although this referentiality may be a perlocutionary effect executed by

listeners rather than being an intentional illocutionary act of the producer.

Our interpretation emphasizes the primary role of the sender in illocutionary force,

although of course the receiver in successful linguistic communication also interprets the

illocutionary force of the sender. Even in infant vocal communication, parent receivers and

laboratory coders interpret illocutionary force (e.g., they may say “he’s complaining” in response

7

to infant fussy sounds or cry). Recognizing the additional role of the receiver in illocutionary

force is important because there are clear cases where illocutionary force of the sender is

misinterpreted. For example, Person 1: “I thought you were making a request to turn up the heat

when you said it was cool in our house”. Person 2: “No, I like it cool. I was complimenting you

on your choice of temperature”. Similarly perlocutionary effect is determined by the reaction of

the receiver, and we thus view perlocution as being primarily in the receiver’s domain, although

the sender may intend by speaking to produce a particular perlocutionary effect or class of

effects. In spite of the sender’s purpose, the actual perlocutionary effect may differ from that

intended. For example, Person 1: “I command you to turn the heat down.” Person 2: “And I

refuse to do so because your imperious attitude insults me”.

It appears that many signals of prelinguistic human infants and animals are naturally

selected to transmit particular illocutionary forces (or classes of them), and that selection is

dependent upon perlocutionary outcomes that make those illocutions, on balance, advantageous

for both sender and receiver, in accord with the reasoning of Maynard Smith and Harper (citation

20 in the Main text). So for the naturally selected communications of human infancy and

presumably for animal communication, matching between the sender’s intended perlocutionary

effect and the one that actually occurs in the receiver must, we reason, occur more frequently

than mismatching.

While Austin’s terms have not usually been invoked in the context of literature on animal

communication, the importance of creating a bridge to literature in human infancy motivates a

recognition of the special role for the sender in illocution and for the receiver in perlocution.

Returning then to vervet alarm calls, the understanding of how they work has been emphasized

as involving an important distinction between roles of sender and receiver(s). Given that vervet

8

receivers may interpretively derive referentiality to specific predators even though that reference

is not directly intended by the sender, this kind of communication has been said to involve

“functional referentiality” (36), to suggest a distinction from the more abstract and flexible

referentiality that usually occurs in human language. In the typical case of mature human

language, referentiality is not just “functional”, but is instead clearly intended by the sender. We

invoke Austin’s term “perlocutionary effect” in this context, then, to help provide the desirable

bridge to human infancy literature where the term is used to emphasize the key role of the

receiver in interpreting actions in ways that may suggest reference, even though the sender

(human infant or monkey) may have intended no reference.

Regardless of how we view the source of referentiality, context provides useful

information about function of alarm calls, because according to published reports, context can be

uniquely associated with a sequence of predator spotting, alarm calling, and escape behaviors by

listeners. The alarm call and its sequelae can therefore be interpreted to form a “triadic” event

involving 1) signaler(s), 2) receiver(s), and 3) predator, with the predator (which supplies the

key element of external “context”) being the direct or indirect focus of the signaler(s) and

receiver(s) in the course of the event.

But external context often does not provide such clear determination of function. This

appears to be especially true when communication is not triadic, as is the case in much social

communication. In the case of the human infant in the first half year of life, there is no sign of

the sort of triadic communication suggested by vervet alarm calls. Human infants have no alarm

calls, and very early in development there are none of the signs of joint attention that occur later

in human development—no pointing, no sharing of gaze jointly in a triadic fashion, and no clear

way that vocalizations or gestures designate objects or entities outside the self (37–39). Instead,

9

in the first half year, communication is typically monadic, e.g., the infant cries, with no obvious

communicative intent (although we can, in accord with our extended usage of Austin’s term,

refer to the cry as having an illocutionary force of “distress expression” or “plea”), but the parent

responds, yielding a potentially beneficial perlocutionary effect; or communication is dyadic,

e.g., the parent and infant interact vocally sharing affect and turn taking, with the parent

optionally responding to needs of the infant if needs are perceived in the dyadic interaction, but

the interaction is confined to social interaction. In these cases infants signal only about

themselves and/or the interactor, but not about external entities or events, and any response of

caregivers appears to be based on their interpretation of infant state and the conditions of the

environment. The key point is that the vocalizations of the infant in these cases provide little if

any direct information about the environment.

In such monadic or dyadic communicative events, the illocutionary force of the infant is

usually not determinable by external context, since by definition the illocutionary force of the

signal pertains to internal states and/or intentions. A hungry infant may cry or produce

protophone sounds with negative facial affect, but the same is true of an infant with a belly ache,

an infant that needs sleep, or an infant who has been stuck with a hypodermic needle. Thus

external context can only be a partial indicator of illocutionary function. Similarly, a happy

infant may produce positively valenced protophones, but external context often does not reveal

or determine the illocutionary function of the sounds (although tickling provides a case where

illocution may be more easily inferred from context). Often it appears to be the perlocutionary

effect revealed in the response of the caregiver (resulting actions or states of mind of the

caregiver revealed by things the caregiver says based on the infant signal) that provides the best

indication of how the function of young infant vocal signals should be interpreted. While we

10

have not observed it to be so, we grant that it is possible that individual infants may have

propensities to produce certain types of protophones or certain protophones with particular types

of facial affect more commonly in some situations than in others. For example, one can imagine

an infant having a particular tendency to produce positive squeals as part of a vocal routine at the

changing table. However, even if we were to find such a pattern, it still would not be easily

determined what aspect of the context might be producing the pattern, as the candidate aspects

are large in number (is it that the infant is being made more comfortable, or that the infant likes

the proximity of the person doing the changing, or is it something about the lights on the ceiling,

etc.?) Our approach has been to use affect as a key determiner of function in part because it can

be observed directly, and in part because infant affect is known to play a key role in parental

caregiving and thus in the functional outcomes of infant signaling (40–44).

It stands to reason that the same sorts of difficulties in determining illocutionary

functions of signals apply in the case of non-human primate infants. It also appears that with

more mature human and non-human primates (as opposed to infants), the difficulties of

determining illocutionary functions based on external context may become even more severe

given the apparently increasing variety of contexts within which vocalizations tend to occur, as

indicated in recent studies in chimpanzees and bonobos (32, 45–49).

Perlocutionary effects can also be difficult to determine whenever the external context is

complex in and around the time and place the signals are produced. It is worthy of note that the

basis for interpretation of functions in human and non-human cases is importantly different

because mature humans (caregivers or adult observers) can be used as informants about their

internal states, i.e., about how and why they produce signals and about how and why they react

as they do to the signals of others, and parents often provide such information spontaneously

11

during interaction with their infants (“I think he’s sleepy”, “what a happy sound!”, etc.). Thus the

study of signal functions can be assessed with additional tools in the human case, to supplement

the analysis of external context, and in many cases the spontaneous vocalizations of parents

during interaction constitute a key element to help interpret the external context. Thus, in the

case of the human infant, we reason that caregivers can and should be used as important

informants.

We also reason that infant vocalizations (in both humans and non-human primates) must

have evolved and developed to be interpretable in terms of adaptive functions, and consequently

caregivers must have (evolved to have) the capability to recognize infant vocalizations

functionally. In particular, illocutionary forces of very young human infant vocalizations can at

least be classified broadly in ways that correspond with judgments of affect that can be elicited

from mature human observers (potential caregivers).

We have proposed, consistent with this line of reasoning, that (at least in very young

human infants) facial affect should provide relatively stable evidence from which to determine

mutually exclusive classes of possible illocutionary functions and perlocutionary effects that can

be judged reliably by mature observers. If a particular vocal type can be utilized with the full

range of affect from positive through neutral to negative, we contend there must be

accompanying variation in function. Vocalizations occurring with positive affect correspond to

one class of possible functions and those occurring with negative affect correspond to a different

class of possible functions.

Consider an example. We have observed that adult caregivers respond to a squeal given

with negative facial affect in a way similar to their response to cry, by seeking to understand if

something is wrong with the infant (is the infant uncomfortable, hungry…?) and often by taking

12

action to correct the problem. We might say the function (illocutionary force sensu Austin) of the

negative squeal is “complaint” or “expression of distress” (and our coders of illocutionary force

easily adapt to coding in those terms), and one real world effect (perlocutionary effect sensu

Austin) is an increased probability that the discomfort will be alleviated through actions taken by

caregivers. In contrast, we have observed that adult caregivers do not respond to a squeal

produced with positive facial affect as a complaint, but rather they respond in a way that is

similar to how they respond to infant laughter, treating the vocalization as a social act expressing

joy and/or affiliation, and encouraging further positive interaction. The positive squeal might

thus be characterized as an “expression of joy or fun”, an “exultation” or an “encouragement for

positive interaction and bonding” (illocutionary force), and the effect of the positive squeal

(perlocutionary effect) might be characterized as an increased probability of continued positive

interaction and social support from caregivers.

The key point here is that while we may not be sure how to uniquely portray the

functions (either illocutionary or perlocutionary forces) involved in these acts, we can be sure

that the functions are not the same for the positive and negative affective versions of the same

sounds. Thus, the determination that there exists any category of infant vocalizations that can be

utilized with positive, neutral, and negative facial affect illustrates that human infants possess

functional flexibility of vocalization very early in life, and given that this kind of flexibility is a

foundational requirement for language, it provides a quantifiable very early indicator that should

be possible to compare usefully across humans and other species where vocal type and affect can

be judged. The recent development of a facial affect coding scheme for chimpanzee modeled

after the Ekman method for human facial affect judgment suggests that this possibility is within

reach (50, 51).

13

The preceding argument is not, however, intended to imply that external context should

be abandoned as a possible source of information for interpretation of human infant

vocalizations. On the contrary, we recently have evaluated several empirically observable factors

concurrently with vocalization and facial affect. These include gaze direction of the infant and a

variety of contextual circumstances (interaction on a high chair, interaction on the changing

table, etc.) including one where the infant is not interacting at all (separated). Furthermore, in

response to a review we have evaluated perlocutionary effects in the form of responses of parents

to infant vocalizations with varying facial affect. Results of analyses using these “contexts” are

reported under Supporting Results: The role of affect expression in the functional

interpretation of infant protophones, and Supporting Results: Robustness of functional

flexibility of protophones across contexts.

Characteristics of the protophones of interest in the present work

In this paragraph we summarize definitions that have previously been provided for

vocants, squeals and growls (52), the protophones considered in the present research. Vocants

are vowel-like sounds produced with normal phonation (sometimes called “voicing” or “vocal

fold vibration”), the type of phonation that is used typically in speech. In vocants the

fundamental frequency or pitch of the voice is within the typical range for the speaker (or infant

producer of the sound). Squeals are high-pitched sounds, produced above the typical range for

the speaker, often in falsetto register. Growls are sounds produced with harsh voice, typically

perceived as low in pitch for the speaker. Further definition of these protophones and how they

are coded can be found below in Supporting Methods: Coding training and coding procedures for

both vocal type and facial affect). For examples of these protophones go to Supporting Results:

Audio-video examples (Movies).14

In addition to protophones, infants have stereotyped species-specific signals. These

(when they are produced reflexively) have similar properties to those of stereotyped species-

specific signals in other primates. Cry and laughter are the most prominent members of this

class, and they provide a solid reference point in human infants against which to compare the

protophones and how they function.

The study of functions of protophones has yielded somewhat chaotic conclusions (17, 53)

in part because protophones function so differently from human species-specific calls, and

perhaps also from non-human primate calls. There exists a temptation to seek consistent

“meanings” or illocutionary forces (35) for each protophone category (perhaps because the

stereotyped species-specific signals seem to have consistent functions, and because words have

relatively stable “meanings”), yet it is the freedom of protophones to assume differing forces and

functions that, in our view, highlights the significance of these sounds as precursors to and

foundations for the speech capacity, a capacity that requires all words and sentences to have such

freedom (52).

It might seem surprising at first blush that squeals and growls can be affectively variable

given the common impression that we squeal with delight and growl in anger. In the early years

of our research in vocal development, we registered some surprise on this point as we were

beginning to notice that early in life, affect variability is common for squeals and growls. Is it

possible that squealing and growling are more affectively biased in adults than in infants?

Perhaps so, but no one to our knowledge has quantified the degree of the bias (these kinds of

sounds do not occur terribly often in adults), and it is obvious that the biases can be violated by

any adult (or older child) who chooses to do so. For example, a growl can clearly be produced in

15

the midst of gustatory or other pleasures, and shrieking squeals (with very negative facial affect)

sometimes occur in fear or when expressing horror or repugnance.

Evidence of the existence of precanonical protophone categories and their systematic

production in human infancy

Acoustic analysis studies and early vocal categories. Buder and colleagues have

addressed acoustic analysis of infant vocalizations (54–59), characterization of phonatory signals

(60, 61), and a variety of additional topics on acoustics related to infant vocal categories such as

vocant, growl, and squeal (62, 63).

Supporting Figure 1: Acoustic results of fundamental frequency analysis for infant utterances from the sample judged to be squeals (yellow), vocants (blue), or growls (violet). See text above for explanation. The three dimensions displayed are the log of the mean, the log of the standard deviation, and the log of the highest fundamental frequency for each utterance.

Supporting Figure 1 provides sample data from a preliminary classification of vocant,

growl, and squeal using polynomial logistic regression (64) to evaluate candidate F0 statistics:

Mean, highest, and SD from 1352 auditorily coded utterances of 3 infants (at 3, 7, and 11

16

months from 2 recording sessions of 20-min at each age for each infant): The figure above

illustrates longitudinal data from one of these infants. Three F0 variables (mean, highest, and SD)

yielded higher than 70% correct classification. The figure displays the data in a 3D log F0-

measure space (vocants plotted in blue with circles, growls in violet with + signs, and squeals in

yellow with triangles), with projections of group sample ellipses encompassing 0.68 probability

of data inclusion on the 2D facets: While each measure contributed to the model, mean F0 was

paramount and the growl/squeal distinction was strongest in this space.

Log-linear modeling (65), also determined that variables for child, age, and session are

needed to fit the observed frequencies of the categories. The significant session variable

illustrates temporal clumping (see below), systematic change in likelihood that particular sound

types occur across recording sessions even after other factors have been controlled, arguably

providing further evidence of active vocal exploration and suggesting protophone category

formation by the infant.

Complementary work from our group (66) has used automated methods—neural

network classifiers to both classify and visualize infant vocalizations as squeals, vocants and

growls. When base rate of occurrence of the three vocalization types was normalized, leave-one-

out cross-validation classification of these protophones was 55% correct, which was significantly

higher than the chance rate of 33.3%. Visualizations supported the idea that squeals tend to be

higher pitch and/or produced with greater harshness than vocants. It seems likely that in the

future automated classification will provide a key objective supplement to human coding.

Repetition and its role in recognition of vocal categories. Research on infant vocal

categories has often focused on auditory/acoustic characteristics (62, 67), collapsed to means and

SDs. This approach obliterates sequential information, though there are clear sequential

17

dependencies in infant sounds that provide evidence of systematic production of sounds such as

squeals, vocants and growls even in the first year of life. To demonstrate repetitiveness in usage

of squeals, vocants and growls, 50 infant utterance sequences were extracted from each of 28

recording sessions (1400 utterances in all) of 20-min from 12 infants in the first year. Converting

time-based data to simple events, we used Lag Sequential Analysis (68, 69) to examine lag 1

within-bout sequences to assess category repetitiveness.

The contingency table in Supporting Figure 2 below presents sample data from a

recording session with one 5-month-old: Rows are antecedent events, columns consequent

events, and cells contain two lag 1 statistics: Raw frequencies and adjusted residuals, which are

differences between observed and expected frequencies adjusted for base occurrence rates to an

SD scale (z-scores of tendencies for vocal types to follow one another). Repetitiveness is

indicated by larger positive residual values on the diagonal and smaller or negative residuals off-

diagonal. Accumulating diagonal residuals from the tables in this study, the data demonstrated

category repetitiveness.

Supporting Figure 2: Lag sequential analysis example from one infant, showing notable tendencies for repetition of particular vocal types. See text for explanation.

Temporal clumping is the tendency for relative frequency of any vocal category to differ

from one recording session to another. Data from longitudinal studies (70, 71) in the first year

have revealed many examples of temporal clumping with huge session to session variation in

18

V GR SQ Totals

V 18/1.58 5/0.66 7/–2.11 30

GR 5/0.36 2/0.80 2/–0.95 9

SQ 6/–1.96 1/–1.31 12/2.98 19

Totals 29 8 21

frequency of occurrence of squeal, vocant, growl, as well as variation in frequency of occurrence

of canonical babbling. Even acoustically measured parameters such as final and non-final

syllable durations show strong recording session effects, and indicating temporal clumping (72).

Infant repetitiveness and temporal clumping in vocal play appear to form a basis for

parental recognition of infant systematic production and control over vocal categories such as

squeal, growl and vocant (52). It can be argued that repetitiveness of particular sound types in

“bouts” provides proof to parents that infants vocalize non-randomly and that, consequently,

negotiation is possible over functional roles for infant sounds. The fact that sequential

dependencies of infant categories often occur during apparent vocal play (as when the infant

appears to be vocalizing to her or himself rather than in interaction with a caregiver) provides

evidence that infant exploration is endogenously motivated and contextually flexible, both of

which are implications of the very idea of play.

Recurrence quantification analysis to indicate temporal clumping of vocal categories.

Recent methodological developments have provided additional possibilities for statistically

evaluating repetitiveness and temporal clumping. Recurrence quantification analysis (RQA) (73)

addresses the tendency of events (e. g., vocalizations) to occur and re-occur at intervals.

Supporting Figure 3 provides an example of the advantages of RQA. While Lag Sequential

Analysis is well suited to local temporal analyses at a small number of lags, RQA treats

recurrence patterns across all lags simultaneously, with visualization of both repetitiveness and

temporal clumping at varying time scales. RQA begins with a recurrence plot, as in Supporting

Figure 3 (supplied by our collaborator, Rick Dale, of the Univ. of Calif. Merced), where x and y

axes each represent 500 sec of infant vocalization. Points reflect coordinates (x,y) at which the

infant produced the same vocal category, color coded for category. Points on the diagonal

19

represent the occurrence of vocants, growls and squeals, providing lag 0 reference points, and all

other points represent instances of recurrence at varying lags with respect to those lag 0 points,

displayed symmetrically off the main diagonal. The plot shows that most vocalizations in this

session were vocants, but late in the session, there were separate clumps (indicated by the red

and blue square-like structures at the upper right) of growls and squeals, interleaved with

vocants. This visualization is an example of the strong temporal clustering that we see routinely

in infant protophones in the first year of life, a further indication of systematic infant production

of the categories.

Supporting Figure 3: The figure represents a recurrence plot revealing a pattern of temporal clumping of vocal types in time by a single infant in one recording. The blue block at the upper right shows squeals clumping late in the session.

Parent-infant interaction in vocal category formation and parent recognition of

categories. Interviews with parents reveal awareness of infant vocal categories, and interaction

data confirm that parents attempt to elicit and assign communicative functions (especially

affective states) to them (74), that is they tend to interpret infant sounds as expressions of state or

communicative intent. Parents in many cultures actively engage infants vocally (75–77) in

patterns suggesting the dyad is a dynamical system shaping acoustic content of infant vocal

20

categories and their functions (78). Since vocal category formation and flexible usage are

required for speech, caregiver-infant interaction may play a key formative role in the speech

capacity (79–82), with the three salient categories of vocalization that are investigated here

serving as anchors for much of that interaction.

Much investigation has addressed dyadic interaction during precanonical stages (83–89),

yet the work has usually not taken account of the very categories of infant vocalization that

appear to be the focus of communication, e.g., squeals, growls, and vocants. Much of the work

distinguishes only between distress sounds (which tends to collapse cry or cry-like sounds with

negatively charged protophones) and non-distress sounds (which tends to collapse laughter or

laughter-like sounds with positively charged protophones). This approach also often groups

affectively neutral sounds (which constitute the great bulk of the protophones and the great bulk

of all sounds produced in infancy) along with happy sounds into a single category called

“positive”, and thus provides no basis for recognition of parental focus on the protophone

categories as distinct from laughter. The traditional research has focused on turn-taking (90),

rhythms of interaction deemed predictive of language (88), or disturbance of interaction (86, 91,

92), usually with no attention to the infant vocal categories that are primary anchors of adult

vocal communication with infants.

Additional research, with more direct focus on protophones, has examined parents’

selective responsivity to more speech-like sounds, as well as infant tendencies to produce

speech-like sounds based on parent vocal actions that encourage such sounds (80, 93, 94). From

our own laboratory, a case study was conducted using Cross Recurrence Quantification Analysis,

a variation of RQA, along with rhythmic spectral analysis. The research identified events of

infant-caregiver F0 and amplitude matching during bidirectional dyadic interaction (as opposed

21

to events when either infants or mothers were not responding to the others’ directed

communications), further supporting the notion that F0-related protophones, such as squeals,

growls, and vocants shape and/or are shaped by parent-infant interaction (95).

On the possible role of gesture in the origins of language

Given our focus on vocalization, a comment is in order regarding the widespread opinion

that the origin of human language is based significantly upon the early evolution of gestural as

opposed to vocal capabilities (96–99). This idea receives support in part from the fact that apes

learn sign language from humans much more easily than spoken language (100), from research

suggesting that (even without training) the gestural communication of apes appears to be

complex (101), and from research suggesting that a key factor in very early human development

of language is pointing and the joint attention that it undergirds (99).

We take all these points seriously, in addition to considering the possibility suggested by

a reviewer of this work that even in the realm of affective communication, apes may have more

flexibility in the gestural than the vocal modality. Yet our view emphasizes that both gesture and

vocal communication are involved in ape communication, and more important for our eclectic

view, both gesture and vocal communication emerge in early human development. Especially

notable for us is the apparent (but not very well-documented) tendency for human

communication in the first half year to be predominantly vocal and facial, rather than gestural.

Pointing, which we agree plays a very important role in establishing foundations for symbolic

(and inherently triadic) communication, does not emerge in human infants until the second half

year, and its appearance may be in part dependent upon prior development of extensive dyadic

communication in early vocal and facial interaction (102). Notably also, pointing is not itself

symbolism—it does not, in the absence of additional gesture, represent ideas or concepts, but

22

instead merely draws attention to entities in the here and now. Spoken words in human languages

play a primary role in genuine symbolism (which requires abstract representation), and although

it is not clear how early this predominance of vocal symbolism is established in development, it

is clear that only in cases of deafness or other disorders of communication does signed

symbolism play a predominant role in human communication.

Thus, as has been noted by many skeptics of the gestural origins idea, it is necessary

under any hypothesis of a role for gesture in the early evolution of language-like capabilities to

explain how and why the vocal modality eventually became predominant. We remain convinced

that one issue that should not be ignored in these speculations is the order of events in modern

human development. And here, it appears that the earliest phases of development do not support

a primarily gestural origin.

Still, we favor a view that emphasizes a role for various modalities in communication

throughout human life (103). Gesture and vocalization together are the rule rather than the

exception in human communication (104), and the face also plays a critical role. We think it is

preferable to emphasize multimodality not only in theories of current human communication but

also in speculations about likely evolutionary scenarios. The present paper focuses upon vocal

and facial communication (in part because our research has been keyed on the first months of

life), but we advocate developmental descriptions that incorporate information across additional

modalities. As coding technologies continue to improve, this should become increasingly

possible.

23

SUPPORTING METHODS

Infants and recordings

Parents of infants 2–3 months of age were recruited through word of mouth and child-

birth education classes. A consent form and questionnaire were provided to interested

individuals. Families returning the questionnaire and meeting inclusion criteria were contacted

for an interview. All procedures were approved by The University of Memphis Institutional

Review Board for the Protection of Human Subjects. Nine parents and their infants participated

in our longitudinal study based on this recruitment (Supporting Table 1).

Recordings in our laboratory occurred regularly throughout the first year of the infant’s

life. The laboratory consisted of two rooms: a recording/play room, and a control/equipment

room. The recording room was equipped with furniture and toys as in a child’s play room. Four

digital cameras, remote controlled, were mounted in 4 corners of the recording room. Two

cameras were chosen for recording at any point in time by switches in the control room. The

microphone-in-vest designed by Buder and Stoel-Gammon (105, 106) for prior research provides

a constant mouth-to-microphone distance of about 7 cm for wireless transmission of the child

voice. A similar microphone was worn during recordings by the caregiver at the lapel,

transmitting to the control room on a separate channel. Two channels of video/audio signals were

split and fed to 3 computer systems in the control room allowing AVI and compressed storage in

high fidelity audio (digitization at 48 kHz except for some of the earliest recordings which were

digitized at 44.1 kHz). The audio signals acquired of the infant voice in this circumstance were

of very high quality, presumably primarily due to the small microphone to mouth distance. The

availability of a separate synchronized channel of audio based on the parent microphone made

the differentiation of child and adult voices workable from spectrographic displays even in cases

24

of overlap. The assistant in the control room monitored video and audio, and assisted the parent

as necessary.

3-5 months 6-7 months 10-12 months Total Utts.

Infant Age in mo. Num. Utts. Age in mo. Num. Utts. Age in mo. Num. Utts.

1 3.4 334 6.5 227 10.0 242 8032 3.1, 5.6 304, 330 - - 10.4 180 8143 5.4 563 7.5 299 11.3 272 11344 3.8 284 6.7 99 10.4 147 5305 - - 7.4 199 10.3, 12.9 330, 208 7376 3.3 289 7.3 250 12.8 178 7177 4.7 370 6.4 223 10.8 274 8678 3.4 406 7.4 182 11.6 148 7369 4.2 282 6.7 254 11.8 121 654

MEAN/sess. 4.1 350.9 7.0 216.6 11.2 210.1

Total Utts. 3162 1733 2100 6995

Supporting Table 1: Data from 9 infants were utilized. Each was recorded at an early, a middle, and a late age in the first year. In accord with our age-range criteria, the first and second recordings of participant 2 were assigned to the first age group, and participant 5’s first recording was assigned to the second age group, while both the second and third recordings were assigned to the third age group.

For calibration, a small tone generator affixed to a foot-long rod was placed with its

speaker directly adjacent to the infant’s mouth at each recording. A sound pressure level meter

affixed to the other end allowed calibration at each session for amplitude (62) based on a

protocol for adult recordings developed by Winholtz & Titze (107).

Each dyad’s recording day usually yielded 60 min of recording, from three 20-min

sessions. In one type of session parents were present and were instructed to interact with their

infants vocally and otherwise in a normal fashion, as at home. In another type of session, the 25

parents and an experimenter were both present and conversed regarding a questionnaire during

most of the session, while the infant was playing in the same room, often interacting with the

parent, but the goal was to allow the infant to vocalize independently. The third session type was

intended to consist of infants alone. In practice the outcome was mixed. Sometimes infants were

alone in the recording room part of the time, but most of the time they protested the intended

alone condition, and we ended up having the parent or an experimenter interact with the infant to

pacify him/her.

Selection of data for the present study

This longitudinal effort produced many recordings from which a selection was made

based on developmental level of the infants and availability of “good” recordings, i.e., ones

where infants produced a typical amount of vocalization according to the parents and where no

untoward technical recording events substantially limited the sound or video quality on either

channel. The selection included both interactive and questionnaire sessions with typical amounts

of vocalization at each of the three ages.

For the present analysis, utterances were selected if they were relatively salient, not too

low in intensity for reliable judgment and not too short in duration. This decision was based on

the theoretical assumption that utterances that are not salient presumably have little influence in

the interaction between infant and caregiver, nor do they seem likely to be noticed in parent

judgments of infant state or fitness. A second reason was that tests of coder reliability showed

clearly that agreement across coders was sharply reduced for utterances that had been judged

very short or very low in amplitude. A single judge went through the entire sample and

intuitively gauged utterances that were too short or too low, and these, representing 22% of the

original total, were excluded from the analysis. In addition, in order for the cross-classification of

26

vocal type and facial affect to be possible, the face of the infant had to be seen. Utterances where

the infant’s face was not visible during the utterance on either of the two cameras were coded as

CantSee and were not included in the analysis. After the exclusions, there remained 6995

utterances for analysis.

Cries occurred in the samples in a natural way, but it should be acknowledged that if

infants cried persistently, sessions were sometimes terminated in order to allow feeding, naptime,

or consoling. As a result, the number of cry utterances in our samples may have been more

limited than would occur in more naturalistic sampling, for example, in cases of all-day

recording (108–110).

Coding software

The coding was conducted in software (AACT, Action Analysis, Coding, and Training),

developed by Intelligent Hearing Systems of Miami, FL in collaboration with the Memphis

research team. AACT is interfaced with TF32 acoustic analysis software (111) so both a

spectrographic/waveform display and the video signal from either of the two recording channels

can be simultaneously played (see Supporting Figure 4 for a screenshot of AACT).

The audio and video signals are synchronized to frame accuracy and are displayed in

TF32 with a scrolling cursor that allows the user to see the temporal relation between audio and

video at all times. Audio signals from both the infant and the parent microphone are separately

recorded and synchronized with the video channels. Audio signals can be localized with

accuracy limited only by sampling rate (i.e., with substantially better than ms precision). The

system facilitates locating utterances in audio, then using the determined onset and offset times

for utterances as references in any field of coding by simply clicking on the utterance label

27

presented with the time information in chronological order on the coding screen. Thus a

particular utterance can be played in audio or video or both.

Supporting Figure 4: The figure displays a screen shot from the AACT software (Action Analysis Coding and Training) from Intelligent Hearing Systems (IHS) of Miami, FL, as implemented in our laboratories. The spectrographic display is in TF32 by Paul Milenkovic 39, with adaptations for the AACT environment by Milenkovic and Rafael Delgado of IHS. Video can be displayed (Windows Media Player is invoked for this purpose) for either of two channels of recording, and the audio cursor follows the video with frame accuracy when the recording is played. The cursor can also be dragged on the TF32 screen, and the video will follow frame by frame. When the left and right cursors are both placed, a code can be selected by menu on the coding screen (at the right side of the image), and it will then appear in the coding stream on the right and as a label between the cursors on the TF32 screen on the left. Once a code has been established, its location can be played repeatedly in audio/video in a looping function (Do Loop). The two channels of audio (note the two waveforms at the top of the screen) correspond to a wireless microphone worn by the infant in the vest at chest level as can be seen in the image, and another microphone worn by the parent at the lapel. The coding screen on the right allows two fields (dimensions) of coding to be displayed simultaneously.

The primary fields (or coding dimensions) of interest here (user determinable—40 such

fields are available) are facial affect and vocal type, which for the primary coding in the present

study were conducted in video only and audio only respectively, during separate coding sessions

for each coder. The current research would have been extremely difficult to conduct without

AACT. Our collaboration with IHS (especially Rafael Delgado) and Paul Milenkovic has

produced these innovations—it is the only coding system we know of that allows high quality 28

spectrographic display synchronized to frame accuracy with video from multiple channels.

Additional channels of information (e.g., carrying physiological parameters such as respiration or

heart rate) can also be synchronized and displayed as addition panels in TF32 during AACT

coding.

Utterance location for coding

Utterance location was accomplished in a first step in our coding where cursors in TF32

(Supporting Figure 5) were placed around each voiced “breath-group” in accord with criteria

first defined by Lynch et al. (112) in the Oller laboratory in Miami, and refined more recently in

collaboration with Buder in the Memphis laboratories. We sought thus to focus on a

physiologically-based unit that can be referenced across time and that has a clear relation with

the notion utterance in adult speech. In order for two vocal events to be deemed two separate

utterances, no breath need be actually heard between them, but the perceiver must judge that

there is time enough between them for a breath to occur and that there is no glottal hold or other

consonantal interrupt throughout the time period separating the parts of them. If there is a

perceived glottal hold or consonantal closure, the vocal events are treated as a single utterance in

our coding approach. Thus utterances in this definition can contain multiple prominences or

syllable-like units, and there can be silences within these utterances (corresponding to perceived

consonant-like elements), though the silences are rarely longer than 300 ms. Ingressive segments

between syllable-like energy prominences within protophones or within cry or laugh were, under

this definition of breath-group, treated as part of inter-utterance intervals.

Coding training and coding procedure for both vocal type and facial affect

The primary or “master” coding occurred in stages where the first was often conducted

by a relatively novice laboratory assistant. In such cases at least a second stage was always

29

conducted by a senior coder with years of experience in our laboratories, and in this second

stage, many of the codes of the novices were changed. Thereafter, many utterances (where

discrepancies between codes had been observed) were checked multiply with senior coders, until

a final consensus was reached. Facial affect coding and vocal type coding were conducted in

separate sessions, facial affect with video only, vocal type with audio only.

Supporting Figure 5: The image illustrates our approach to “utterance” selection. Cursors surround one of the five utterances in this segment presented in TF32. The utterances were all perceived to constitute separate “breath groups”, which is to say that for each utterance, a breath was perceived to have been taken (an ingress occurred) after the utterance concluded or at least it was perceived that there was time enough for a breath to have been taken. The time periods associated with each of the utterances can be gauged by the segmentation marks and labels at the bottom of the screen, and precise temporal information is displayed at the top of the screen.

Relatively little training was required to achiever reasonable levels of interjudge

agreement on coding in this infraphonological domain. The reason would appear to be that the 30

vocal and facial affect categories are biologically significant units to which all normal humans

respond in similar ways (as Darwin’s principle of variability predicts). If our theoretical

assumptions are on target, the reactions of various coders should be similar, and the training of

coding on these infant actions should be relatively easy, because these vocal/facial events are

anchors for communication between caregiver and infant regarding infant well-being and state in

the first months of life. The training of coding in these domains requires, we surmise, primarily

activating latent awareness of infant communicative actions and ensuring that the labels used in

the coding software are understood by the coders.

Definitions of vocal types and facial affect types in the study

For Vocal Type coding, no definition was given for cry or laugh, since it was assumed

that these terms would be applied appropriately without training. However, coders were given a

“reflexivity” instruction—cries and laughs were to be coded only if the coder perceived

(intuitively of course) the infant to have produced the sound reflexively. The reliability results

suggest that the master coding (which was finalized by consensus as indicated above) was

conducted with a relatively high threshold for coding vocalizations as reflexive. Reliability

coding on the other hand, appears to have involved lower thresholds (yielding more utterances

judged cries and/or laughs) for some of the listeners, perhaps because reliability coding often

occurred with even less training than in the case of preparations to participate in the master

coding. Also, no consultation occurred among coders or trainers once any reliability coding

session had begun. Four individuals coded vocal type for hundreds of utterances from the

recordings, independently of the master coding, and seven individuals coded facial affect

independently of the master coding. For data on intercoder agreement, see below.

31

For vocal type, coders were instructed to listen with audio only (video screens were not

available at all), to click on each utterance in sequence, and to:

a. Code an utterance as Cry or Laugh if you judge the utterance to be a reflexively produced version of one of these.

b. Code an utterance as FullV (here called “Vocant”) if it is predominantly and most saliently produced in modal phonation, in the mid pitch range of the infant.

c. Code an utterance as Squeal if it is notably higher in pitch than the normal range of the infant.

d. Code an utterance as Growl if one of two conditions is met: either the most salient pitch is notably lower than the normal range, or the pitch is in the normal range but the utterance is produced with very high tension (for example, pressed voice) yielding considerable dysphonation.

e. If an utterance seems to show a combination of features, make your choice based on what you think is most salient. An utterance that is strongly both squeal-like and growl-like, must be judged as one or the other.

It is worthy of note that the vocal categories, being applied always at the utterance level,

should be thought of as prominent individual “features” of utterances. Thus an individual

utterance may be mixed, having multiple features as indicated in (e) above. Of similar

importance is the recognition that even advanced utterances with very speech-like characteristics,

such as multisyllabic canonical babbling, can be characterized as squeal, vocant or growl, on the

basis of vocal quality features.

For facial affect coding, coders were instructed to:

a. Code Pos if you see smiling or grinning any time during the utterance. b. Code Neg if you see frowning or grimacing any time during the utterance.c. Code Neutral if you see neither frowning nor smiling during the utterance. d. Code CantSee only if you cannot see the baby’s facial affect during the utterance.

The facial affect categories were originally planned to be trained to an Ekman standard

with many categories and considerable attention to particular musculo-facial features (113). But

empirical results and our theoretical goals inclined us to simplify. For both facial affect and vocal

type, the coders were encouraged thus to act intuitively rather than to struggle with technicalities.

32

Positivity and negativity were assumed to be determinable on the basis of any video evidence

that the observer took to indicate clear divergence from neutrality. To encourage intuitive

judgment the coders were discouraged from listening or viewing more than three times. For

facial affect, there were two channels of video available, so coders were allowed to first look to

see if the channel selected gave a clear view of the child’s face, and if not to switch to the other

channel. If a better view was obtained, three viewings were then allowed on the second channel.

If neither view produced a workable image of the child’s face, the code CantSee was prescribed.

Positive, neutral and negative affect as a proxy for function

To simplify the current study on functional flexibility, we instructed coders to categorize

infant vocalizations for affect in a three forced-choice task where the options were positive,

neutral and negative. The following is a justification for this simplified coding of affect,

supplementing the discussion above under Affect and context in the judgment of function in

infant vocalizations.

Note that we make no attempt here to draw an operational distinction between affect and

emotion, although we recognize affect as an expression that presumably reveals emotional states.

The relation between them however is a matter of debate in the literature on emotion—see e.g.,

Damasio’s discussion of this issue (144)—and the complexity of the relation between them and

additional concepts such as feelings or consciousness surely cannot be resolved here. Our results

indicate that there is a consistent, if not perfect, relation between positive, neutral and negative

affect and both interpreted illocutionary forces and perlocutionary (see Main text Figure 3 and

Supporting Results: The role of affect expression in the functional interpretation of infant

protophones). When we imply or say that positive, neutral or negative emotional states

correspond to the three affect conditions, we are merely extrapolating from the assumption that

33

the entire chain of states and acts—emotions, which are related to affect expressions, which are

related to illocutionary acts, which are related to perlocutionary effects—must be consistently

maintained in order for the natural selection of affect expressions to occur.

Nonhuman primate calls often include positive valence—e.g. , celebratory vocalizations

akin to human laughter (114–116)—or negative valence—as in threats, warnings and distress

calls (4, 33, 117). While it might be presumed that positive and negative valences for such

vocalizations are stable in nonhuman primates for such calls, in fact there has been little

investigation of facial affect during such vocalizations (118).

While it might be reasonable to seek to characterize infant sounds (cry and laugh as wells

as protophones) in terms of a large set of illocutionary categories (including for example

expressions of exultation, sadness, disgust, relief, comfort, anger, distress, etc.), there are good

reasons to limit our categorization for the purposes of this paper to just positive, neutral and

negative. First, in spite of considerable speculation and evidence about possible innateness of a

large number of emotional expressions (113), prominent emotion theorists are unpersuaded that a

large set of emotional expression categories is well organized in the young infant. Instead they

tend to support the view that emotion expressions begin in relatively undifferentiated form (12,

14), with a very small number of clear affect types. These emotional expression types are

presumed to become elaborated and more neatly tied to circumstances as the infant matures.

Also, consistent with our reasoning in the section on Affect and context in the judgment of infant

vocalizations, it is sensible to assume that mature human observers can make reliable judgments

about functions of infant vocalizations, since evolution surely provided a system of vocal

communication that is adaptive.

34

Consequently we resolved to adopt a simplified categorization of affect types,

implemented by mature human observers, for the present work. Positive, negative and neutral

facial affect can be categorized with relatively good agreement across observers even in early

infancy on the basis of intuitive judgments with very little training. These judgments can be

made under instruction in ways that presumably mimic the kinds of judgments caregivers make

about infant expressions (is the infant comfortable, in distress, joyful?). It has been reasoned that

such judgments are made constantly by parents, even if unconsciously, and that they form a basis

for fitness judgments that have played a major role in human evolution of vocal communication

(119). While our simplified affect coding system does not itself specify “functions” of

vocalizations illocutionarily nor does it indicate specifically which of several possible emotional

types identifiable in adults may be involved in a particular vocalization (e.g., a negative

communication might be deemed an expression of anger, sadness, disgust or distress), it does

limit the field of possible emotional expressions to classes of functions related to positivity,

negativity or neutrality. Furthermore it allows direct comparison of flexibility of communicative

vehicles across protophones and stereotyped species-specific signals in infants.

Illocutionary force coding

In response to an anonymous critique, we resolved to provide direct evidence that facial

affect accompanying protophones predicts illocutionary force. Before this article was originally

submitted for publication, an illocutionary coding scheme had already been developed and a

subset of the data had already been coded. Data from this prior work are reported in Figures 3a–

3b of the Main Text.

35

Coding for illocutionary force was conducted with simultaneous audio and video, and

without access to the facial affect or vocal type codes. The codes available in the illocutionary

field can be characterized in three groupings A, B and C, as follows:

A) Converse: codes corresponding to exultation or to vocalizations that had the apparent goal of initiating or continuing a comfortable protoconversation: i. Continue: Continuation of protoconversation or of vocalization interchange in a

game (such as peekaboo).ii. Elicit Turn: Initiation of protoconversation with the caregiver or elicitation of a

turn from the caregiver.iii. Exultation.iv. Imitation of the caregiver voice.v. Show, Offer, Accept: Vocalizations during offers, acceptances or showing of

objects in play.B) Complain: Codes corresponding to social negativity:

i. Complaint. ii. Plea for help. iii. Refusal, especially of objects in play.

C) Indeterminate: Codes corresponding to neither A) nor B): i. Object-directed.ii. Vocal play. iii. No force: no discernible illocutionary force.

Perlocutionary force coding

In response to the same anonymous critique, we resolved to provide direct evidence that

facial affect accompanying protophones predicts perlocutionary effects as indicated by caregiver

responses. For this effort, we engaged in two rounds of new coding. In the first round, nine

recording sessions (representing all ages and all infants) were reviewed, with both audio and

video signals available for coding, but perlocutionary effects were coded only if the infant

utterances had been deemed to have either negative or positive affect during facial affect coding

(neutral utterances were not considered. The data from this first round are reported in

Supporting Results: The role of affect expression in the functional interpretation of infant

protophones, Facial affect and perlocutionary effects of infant protophones.

36

Reasoning thereafter that utterances with neutral facial affect might also be important in

interpretation, we conducted a second round of perlocutionary coding, in which six sessions (six

infants, all ages) were coded, but this time we considered utterances with all three types of facial

affect. Facial affect, vocal type and illocutionary codes were not available during the second

round of perlocutionary coding. Simultaneous audio and video were available to the coders to

make their judgments.

The recording sessions for both the first and second rounds of perlocutionary coding were

selected in a semi-random fashion where an infant could be selected only once within a round

and all three ages were represented equally. In total, data from 14 different sessions were thus

coded (one session appeared by chance in both rounds), resulting in data on perlocutionary

effects from more than a quarter of the total sample.

The coding of perlocutionary effect involved observation of complex events, and the

primary coding in the two rounds is subject to the concern that the observers may have been

biased to categorize parent reactions by virtue of also hearing and seeing the infant actions. As a

check against this possible bias, we engaged in an additional coding evaluation taking advantage

of the fact that parents often spoke during the interactions about their reactions to the infants’

vocalizations. Three of the sessions for the second round of perlocutionary coding reported in

Figures 3c–3d of the Main text had been coded by one observer and three by another. To

prepare for the coding check on possible bias, the first observer extracted (from the three

sessions he coded) parent utterances in audio alone from the up-to-four-second period of

perlocutionary observation, taking the precaution of eliminating (not extracting) any such

utterance where the child’s voice could be heard. All the parent utterances (N = 157) meeting

this criterion for the three sessions were extracted. The second “blind” coder then, having not

37

coded these sessions, was presented with the parent utterances in audio only and was asked to

judge perlocutionary force based on these utterances, a circumstance where the infant utterance

(and the visual setting) could not have played a role in the judgments.

The lack of visual information could have placed a notable limitation on the ability of the

blind coder to judge perlocution. Even so, as can be seen below in Supporting Results: The

role of affect expression in the functional interpretation of infant protophones, Facial affect

and perlocutionary effects of infant protophones, the results show that both the blind coder and

the original coder categorized parent reactions in strong and highly reliable accord with the

threefold groupings of perlocutionary force as predicted by infant facial affect associated with

infant protophones.

In all perlocutionary coding we focused on caregiver reaction during the short period of

time (up to 4 sec) following an infant utterance, with the perlocutionary judgment focused on

events ending before the onset of the next infant utterance. We limited the perlocutionary coding

in time based in part on the temporal relation among the infant utterances and in part based on

the need for there to be some minimum amount of time for coders to evaluate parental reactions,

which often included sentences expressing the parents’ opinions about infant state. Thus if infant

utterances were produced in a rapid series, no perlocutionary judgment was allowed until the end

of the sequence, the judgment was focused on events beginning after that last utterance in that

sequence had been initiated, and the judgment was treated as being associated with the facial

affect of that last utterance in the series (although it seemed often clear that the perlocution was

influenced by multiple utterances). We set a criterion where a rapid series would be deemed

ended at any gap > 450 ms without infant vocalization; thus the shortest possible time frame for

38

perlocutionary judgment was 450 ms plus the duration of the infant utterance, and the longest

time frame was 4 sec plus the duration of the infant utterance.

The codes available in the perlocutionary field to code the parental reactions can be

characterized in three groupings A, B and C, as follows:

A) Encourage: Codes corresponding to initiation or continuation of a comfortable protoconversation: i. Elicit turn: Initiation of protoconversation with the infant or elicitation of a turn.ii. Continue: Continuation of a protoconversation or game (such as peekaboo).iii. Imitation of an infant vocalization.iv. Praise.v. Smiling at the infant.vi. Patient waiting for the end of an infant of sounds with silences exceeding the 450

ms criterion, after which a parent response indicated she had been waiting to praise, exult, or otherwise encourage continuation of the interaction.

vii. Exultation by the caregiver over the infant utterance.vi. Offer, accept or show objects in play.

B) Change: Codes corresponding to evaluations of possible change in the situation for the infant, actions involving change, or attempts to change the infant state through vocalization: i. Evaluation of the infant state clearly indicating that the caregiver is considering

taking action to make the infant more comfortable (statements such as “I think she’s wet”, “Are you hurting, honey?”, “Do we need to change the situation now?”, etc.) or expressions of alarm (e.g., “Oh no!”).

ii. Change Situation: Physical actions to change the situation (picking the infant up and patting her back, moving the infant to a new location for play, taking the infant to the changing table, etc.).

iii. Soothe, Scold, or Negative Command: Vocal soothing (e.g., “oh you poor thing”), scolding (e.g., “that’s not nice”) or negative commands about the infant’s vocalization (e.g., “stop that”).

iv. Distract: Attempts to distract the infant (e.g., with a new toy).v. Frown at the infant.

C) Unclear: Codes corresponding to neither A) nor B): i. Other directed: Utterances not directed to the infant and not related to the infant

state (e.g., talking to some one else in the room about unrelated matters, e.g., “I need to make a phone call”).

ii. Unobservable (the parent’s reaction cannot be discerned because the parent cannot be seen in the video image and/or cannot be heard saying anything).

iii. Irrelevant: State irrelevant (the parent may say something to the infant, but it is not part of the protoconversation and reveals nothing about infant emotional state—e.g., “let’s not put our fingers in our mouths”).

39

Observer agreement levels for both vocal type and facial affect

Of course infant actions become more complicated as the infant matures (both in the

protophones and in the stereotyped species-specific signals), but regarding observer agreement

for vocal types present in the first months, the data show that adequate training can be conducted

in a few sessions and that observer agreement can thus reach levels we consider reasonable—

agreement is very high for reflexively produced cry vs. laughter (stereotyped species-specific

signals), and moderate for differentiation among the three protophones considered here (see

Main text). The protophones, of course, have been shown to include considerable overlap

among their acoustic features (see Supporting Figure 1 above). Differentiation of cry from

laughter was presumably high because nature appears to require these sounds to be maximally

separable, as seems to occur with homologous species-specific signal categories of non-human

primates and other animals (1, 2, 120). Stereotyped species-specific signals appear to have this

distinctiveness precisely because there is high survival priority upon allowing little room for

mistake about their illocutionary functions and how to respond to them (121).

Observer agreement on differentiations within the stereotyped species-specific signals

and within the protophones should not be taken to predict the level of differentiation of

stereotyped species-specific signals from protophones, as these are actually separate matters.

Consider how the cry category was defined. We instructed coders to designate utterances as cry

if and only if they seemed reflexive. The coders were thus expected not to code clearly volitional

utterances as cry, and would instead be expected to call them “vocant”, “squeal”, or “growl”. A

category “fuss” was not included, and in general fussy utterances would be coded as one of the

protophones, most often “vocant”. For cries identified in the master coding, the four reliability

coders working independently coded 88% of these as cries, but at the same time compared to the

40

master coding the reliability coders indicated more than twice as many utterances as cries (most

often designating utterances as cries that were called vocants in the master coding). This pattern

suggests a judgment criterion difference—the master coding appears to have more consistently

reflected a strict (high threshold) application of the reflexivity instruction, excluding many fussy

utterances from the cry category and including them instead in a protophone category, while the

reliability coders appear to have more liberally included fussy utterances in the cry category. For

laughter, one of the reliability coders appears to have set a very high criterion for coding of

laughter, such that only 5 utterances were so designated, while the primary coding had 54 laughs

for that reliability set. The other three reliability observers coded 61% of the master coded laughs

as laugh but coded just about twice as many items as laugh as had occurred in the master coding,

suggesting again a criterion difference, where the master coding had observed the reflexivity

criterion more rigidly. It seems clear that laugh was harder to differentiate from protophones than

cry was.

This pattern of reliability (observer agreement) results suggests that there was

considerable mixture of cry and laugh features with the protophones—infants appear not only to

acquire the ability to produce vocalizations free of emotional presetting (the protophones), but

they also acquire the ability to utilize cry and laughter characteristics relatively freely in

combination with protophones. Later in life, of course, humans have enormous vocal freedom

and can produce cry and laughter-like sounds on command, with some actors being able to do so

in a way that is essentially indistinguishable from the reflexive forms of the stereotyped species-

specific signals. We interpret this capability as a reflection of the remarkable human vocal

freedom that makes language possible. At the same time, even in adulthood, there are

circumstances of high emotional arousal or instantaneous urgent events where stereotyped

41

species-specific signals (e.g., cry, shrieking, laughter, moaning) appear to be elicited reflexively

or near reflexively, presumably reflecting our primate heritage (116) (e.g., such as when

laughing uncontrollably while being tickled). Our results presented in this paper suggest that

human infants begin to break away from the rigid mold of vocal expression in the first months of

life, both by developing protophones that have very high flexibility from the onset and by

acquiring across time the capacity to voluntarily manipulate features of the stereotyped species-

specific signals and begin to mix them with protophones.

42

SUPPORTING RESULTS

Audio-Video examples (Movies) illustrating the protophones and their variability in facial

affect expression

Examples of infant vocalizations illustrating the protophones of interest in the present

work and their flexibility to express various facial affect conditions are supplied with the

Supporting materials. These examples have been selected to illustrate that is it possible for

squeals, vocants and growls to be used with flexible facial affect expression in human infants.

We have selected the examples in “clusters” representing a particular protophone from a single

infant within a single recording. The first cluster of movies (CL1) consists of four examples of

growls drawn from a recording of a 6-month old girl who produced these four growls within a

three-minute period, displaying a full range of facial affect (positive, neutral and negative).

To observe this pattern (and for all the audio-video examples) we advise the reader to

listen first to all examples within each cluster without watching, attempting to determine affect

by audio. Then we recommend watching the video without audio for all four examples, again

judging affect. As a last step we recommend watching and listening simultaneously. To

maximize independence as a listener we also advise waiting until you have listened to and

judged all the examples in all the clusters before looking at the list where we have recorded the

facial affect judgments made by staff in our laboratory (the master coding of facial affect) at the

end of the Supporting Results, page 69.

Multiple observers in our laboratory agreed that all the examples in CL1 (Movie S1

through Movie S4) were growls, and they all agreed based on video judgments alone that one of

these growls had clear facial affect positivity while one had clear facial affect negativity. At the

same time, audio-based judgments of affect were variable across the observers. Audio judgments

43

generally show better than chance agreement across observers for facial affect on protophones,

but video judgments show much higher agreement, illustrating that infants are free to adapt the

protophones to various expressions of affect independent of particular sound characteristics of

the individual protophone utterance.

CL2 consists of three squeals (Movie S5 through Movie S7) with varying facial affect

from the same infant on a different day, also at 6 months. CL3 gives two examples of vocants

from the same infant at the same age, showing alternately positive and negative facial affect

(Movie S8 and Movie S9).

The remaining examples come from a different infant. First, at three months we present

CL4 with five examples of vocants with variable facial affect (Movie S10 through Movie S14).

Next is CL5 with two examples of squeals (Movie S15 and Movie S16) also at three months,

and last are three examples of squeals in CL6 from the same infant at 10 months (Movie S17

through Movie S19).

In all cases the examples within these clusters were drawn from a single infant within a

single recording, a pattern of selection intended to highlight the fact that facial affect can be very

flexibly associated with any of the protophones in normally developing infants. At the same time

it is important to emphasize that audio characteristics within protophones often do transmit affect

information in accord with the affect information transmitted facially. The key point of the

present paper is not that protophones as judged from audio are always uninformative regarding

affect (that is not true), but that they are often uninformative regarding affect, and sometimes

even yield contradictory judgments with respect to video-based facial affect judgments in forced

choice circumstances. Thus research in our laboratory has shown that a facially positive

utterance (based on video judgment) can often be judged negative based on audio or vice versa,

44

and it is often the case that judges express strong doubts with regard to any audio judgment of

affect for an utterance while being quite certain of their affect judgments for the same utterance

based on video. Judgments of affect based on video alone are much more likely to conform to

video plus audio judgments than are judgments based on audio alone, suggesting that for affect

judgment the facial configuration plays the predominant role.

Odds Ratio Analyses to illustrate the distinction between protophones and stereotyped

species-specific signals

Statistical reliability of the six patterns—protophones showing 1) far more positivity than

cry but 2) less than laugh, 3) far more neutrality than either cry or 4) laugh, 5) far less negativity

than cry and 6) more than laugh—can be shown with Odds Ratio Analyses. The procedure

requires three 22 tables for 18 comparisons (Supporting Table 2).

Neg Not (Pos + Neut)

Total Prop. Neg

Cry 278 12 290 0.96

Squeal 221 735 956 0.23

Total 499 747 1246

Odds Ratio = 77.0

Lower 99% CI = 35.1

Upper 99% CI = 169.1

Supporting Table 2: Example 22 table for Odds Ratio Analysis. Here squeals and cries are compared for negativity. Total numbers across the sample of cries and squeals judged negative by facial affect were entered in the Neg column and the sum of cries and squeals judged either positive or neutral were entered in the Not column. The Odds Ratio (OR) of 77 is the ratio of the two odds—(278÷12)/( 221÷735). An OR of 1 would indicate no tendency for either cries or squeals to be more negative. The data indicate that cries were 77 times more likely to be negative than squeals. The 99% confidence intervals around the OR do not include 1, indicating that with at least p < 0.01, this OR is statistically reliable.

Tables modeled after Supporting Table 2 were created to compute ORs for the 6

patterns of distinction between affect expression in protophones and in cry and laugh. Thus 18

45

tests (6 patterns x 3 protophones) of ORs were conducted (Supporting Table 3).

Prot > posthan cry

98.0915.62

35.325.65

48.767.75

615.85 220.88 306.63

Prot < posthan laugh

37.516.4

58.926.2

42.718.8

85.6 132.4 97.2

Prot > neutthan cry

27.011.6

57.625.0

44.018.9

62.9 132.7 102.5

Prot > neutthan laugh

15.46.5

32.914.1

25.210.7

36.3 76.6 59.2

Prot < negthan cry

77.035.1

176.081.5

147.566.8

169.1 380.2 325.5

Prot > negthan laugh

50.813.79

22.251.7

26.551.98

680.87 296.67 356.28

Squeal Vocant Growl

Prediction odds ratio 99% CI odds ratio 99% CI odds ratio 99% CI

Supporting Table 3: Odds Ratio Table for the whole dataset. The OR of 77 is seen near the lower left of the Table representing the test of squeals versus cries on negativity (the example in Supporting Table 2). Notice that for all 18 comparisons (six comparisons represented in the rows of the table for each of the three protophones), the OR was very large (smallest = 15.4, meaning protophones were 15.4 times more likely to be neutral than laughs) and that the 99% Confidence Intervals (CIs) never included 1, indicating that all 18 ORs had p < 0.01. The results thus indicate that protophones were vastly more flexible in affect expression than the stereotyped species-specific signals.

Consistency across infants for the six patterns showing functional flexibility in the

protophones but not in the stereotyped species-specific signals

The six patterns considered in the Main text and confirmed in Figure 2, applied to all the

infants (Supporting Figure 6). These individual infant patterns were also subjected to OR

46

analysis. One infant produced no cries, so the 18 comparisons as in Supporting Table 3 were

developed for 8 infants, and 9 were developed for the infant with no cries in her samples.

Supporting Figure 6: The consistency of the results obtained in the present work is supported by the fact that all the infants showed the six patterns. Compare the results to Figure 2 of the Main text. In almost every case, protophones showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by odds ratio analysis with 152 of 153 OR’s >1. Infant 6 produced no cries in her sessions, and also produced very little negativity in vocal expression, including no negative growls; in this one case laughs were found slightly more negative than growls.

This yielded a total of 153 OR comparisons. In 152 of those cases the OR was > 1, indicating

very strong support for the idea that infants showed much greater flexibility of affect expression

with protophones than with the stereotyped species-specific signals.

47

Functional flexibility of vocalization even at the youngest ages of the infants in the study

Figure 2 in the Main text illustrates that at all three ages the infants showed the

flexibility pattern strongly. The patterns applied also in infants at the youngest ages as indicated

in Supporting Figure 7 below, where data only from infants at 3 and 4 months are displayed.

.00

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.00

Positivity Neutrality Negativity

Prop

ortio

n w

ithde

signa

ted

affec

t

Infants at 3 and 4 months only

CryLaughSquealVocantGrowl

Supporting Figure 7: Even at the youngest ages, the infants showed the six patterns of differentiation of protophones from the stereotyped species-specific signals, cry and laugh, in flexibility of expression. Compare the results to Figure 2 of the Main text. Protophones in the infants at the youngest ages showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by Odds Ratio Analysis. As seen in Figure 2 of the Main text, infants at all the ages showed the same strong pattern.

Observer agreement on six patterns of functional flexibility in protophones

Robust coder agreement was found when the Master coding used for all the analyses

above and in the Main text was compared with an independently coded subsample. New coders

after minimal training, coded 9 randomly selected sessions (1 for each infant, 3 for each age

48

group), for both vocal type and facial affect according to the same protocol (see Supporting

Methods) used with the Master coding, except that there was no review or changing of codes by

expert coder-supervisors as had occurred in the last phase of the Master coding. In Supporting

Figure 8 all 3 panels represent coding of the same subset of the data (21% of the total).

Supporting Figure 8: The reliability of the results obtained in the present work is notably supported by the fact that three completely independent codings of 21% of the dataset for both facial affect and vocal type all showed the same key results differentiating the stereotyped species-specific signals and the protophones. Compare the results to Figure 2 of the Main text. The Master coding Supporting Figure 8 represents judgments for the same 21% of the dataset that was judged by the two reliability coders. In each case, protophones showed 1) more positivity than cry but 2) less than laugh, 3) more neutrality than either cry or 4) laugh, and 5) less negativity than cry but 6) more than laugh. All these trends were strongly supported by Odds Ratio Analysis for all three codings. The average OR for the comparison (18 comparisons for 3 coders each) exceeded 7 for all three coders, and the lowest OR in any of the 54 comparisons was 1.47 indicating all comparisons for all coders conformed to the same pattern

49

as in Figure 2 of the Main Text, where the ORs were also all > 1. On at least half the 18 ORs in Supporting Figure 8 for each coder the value of 1 was not included within the 95% CI, indicating strong statistical reliability of the findings across coders.

Additional Contingency Table Analyses illustrating individual differences on protophone

expression of affect, but not on stereotyped species-specific signal expression of affect

The results of Contingency Table Analysis revealing individual patterns for the

protophones are displayed in Figure 5 in the Main text. To amplify the findings reported there,

we provide data in Supporting Table 4, in which the notable individual variability among

infants on protophones for facial affect positivity is contrasted with the obvious similarity among

infants with regard to facial affect positivity for cries and laughs. We began by constructing two

contingency tables of raw data for utterances from each infant: A 22 table for each infant

represented cry/laugh by positive/negative, and a 32 table represented the protophones,

squeal/vocant/growl by positive/negative. Utterances judged neutral in facial affect were not

considered in these comparisons (in contrast with Figure 5 of the Main text), given that both cry

and laugh showed too few cases judged neutral to make quantitative comparison interesting.

Adjusted residuals were computed for each table. One of the nine infants produced one laugh

(positive facial affect) but no cries, so her data were not considered in this analysis. The goal was

to determine the distribution of infants on adjusted residual values for each vocal type in order to

compare individual differences across stereotyped species-specific signals as opposed to

protophones in facial affect expression. In this analysis positivity and negativity provide mirror

images of the same pattern—we display results for positivity only, but results for negativity can

be imagined by simply changing the top title to “Negative Facial Affect”, reversing the order of

the column labels (changing to this order: > 1.96, 1.96 to 0, 0 to – 1.96, < – 1.96) and leaving the

cell entries as they are.

50

Vocal typePositive Facial Affect

< – 1.96 – 1.96 to 0 0 to 1.96 >1.96

Laugh 0 0 0 8Cry 8 0 0 0

Squeal 5 1 2 1Vocant 1 3 2 3Growl 0 3 5 1

Supporting Table 4: All 8 infants who both cried and laughed in the samples showed very high adjusted residuals on facial affect positivity for contingency tables on stereotyped species-specific signals (cry/laugh) vs. positive/negative facial affect— that is, all infants showed more than 1.96 SD above the expected value for positivity of laughs and all showed more than 1.96 below the expected value for positivity of cries. In contrast, there was substantial variation among the infants for expression of positivity in protophones—vocants, for example, were expressed in 3 infants with positive facial affect at more than 1.96 SD above the expected value, but in one infant at more than 1.96 SD below the expected value; the other 5 infants fell in between on vocants, 2 showing weak tendencies above the expected value for positivity and 3 showing weak values below. In this analysis positivity and negativity provide mirror images of the same pattern—we display results for positivity only, but results for negativity can be imagined by simply changing the top title to “Negative Facial Affect”, reversing the order of the column labels (changing to this order: > 1.96, 1.96 to 0, 0 to – 1.96, < – 1.96) and leaving the cell entries as they are.

All these data show that the sharp contrast between flexibility of stereotyped species-

specific signals and protophones illustrated in Figure 2 of the Main text and in Supporting

Table 3 as well as Supporting Figures 6 and 7 also applied to individual differences—there

were scarce individual differences for facial affect expression with stereotyped species-specific

signals, cry and laugh, but notable individual differences for protophones.

Log-linear analyses: Individual differences in the expression of affect in infant vocalization

A typical log-linear analysis begins by defining a series of hierarchical models. The goal

of log-linear analysis is to identify the simplest model that still provides an acceptable fit to the

51

data. Each model in the series is less complex than the one before it: It has fewer terms—i.e., is

less constrained and so has more degrees of freedom— and consequently generates data that fit

the observed less well.

Goodness-of-fit—really, badness-of-fit—is assessed with the likelihood ratio chi-square

or G2: The bigger G2 is the worse the model fits—values for G2 are essentially the same as for

the more familiar 2, but for technical reasons G2 is preferred for log-linear analysis; see

Bakeman & Robinson (65). The first model in the series—the saturated model— constrains

expected frequencies to match the observed ones exactly and, for that reason, fits the data

perfectly: its G2 = 0 with 0 degrees of freedom. The question then becomes whether a more

parsimonious model will still fit acceptably.

Acceptable fit can be assessed in two ways. A common criterion is the significance of G2

for the model: If G2 is not significant, p > .05, the discrepancies between the observed cell

counts and those generated by the model are relatively small, and so we conclude that the model

fits acceptably. However, given large counts, this criterion may be too strict because even

relatively small deviations from expected will result in a G2 significantly different from zero.

A second criterion is the magnitude of Q2, which is a comparative fit or reduction in error index

analogous to the R2 of multiple regression. Knoke and Burke (145) suggested that any model

whose Q2 is greater than .90 provides satisfactory fit, even if its G2 differs significantly from

zero. Q2 assesses the proportion of badness-of-fit relative to a baseline model accounted for by

the model in question and is defined as:

Q2=Gbase

2 −Gmod el2

Gbase2

.

52

When the terms in a model account for over 90% of the baseline badness-of-fit, we conclude that

the model fit is acceptable and that the terms deleted to form the model are not consequential.

Cries. There was little individual variability in the tendency for cries to be coded as

negative in facial affect. Of the 27 samples, 19 contained at least one cry. For 13 of the 19 all

cries were coded negative; for three of the 19 all but 1 cry was coded negative; and for the

remaining three samples, 3 of 9, 1 of 2, and 0 of 2 cries were coded negative.

A 22 table—cry versus protophone by negative versus non-negative affect—was

formed for each of the 19 samples. A log-linear analysis of the 2219 table suggested that the

association between cry [C] and negative affect [N] was not moderated by sample [S]—that a

common pattern characterized all samples.

To test whether only the saturated model fit the data, we deleted the three-way term

[NCS], which left a model with three two-way terms: [NC] [NS] [CS]. This model constrains

expected frequencies to match the cross-classifications implied by the three two-way terms, but

not the three-way classification implied by the saturated term. The chi-square for this model

differed significantly from zero, G2 (18, N=5117) = 55.6, which is not surprising given the large

N, but its Q2—an R2 analog—was .96, suggesting that deleting the saturated term had little effect.

Conservatively, the base model used for the Q2 computation was [N] [CS]: other less-restricted

but still plausible base models—e.g., [N] [C] [S]—would have produced even higher values for

Q2.

An alternative would be to analyze the 228, negative by cry by infant table, pooling

samples over the eight infants who cried. The [NC] [NS] [CS] would fit this table somewhat

better, but it is more convincing that the [NC] [NS] [CS] not-moderated-by-sample model—not

moderated because the [NCS] term was not required for an acceptably fitting model—still fits

53

the larger 2219 table. This result, that the three-way interaction term is not needed for good

fit, indicates that the association between negativity and cry does not vary by sample.

Laughs. There was even less individual variability in the tendency for laughs to be

coded positive. Of the 27 samples, 15 contained at least one laugh. All laughs were coded

positive for nine of the 15 and for the remaining six, more laughs were coded positive than not.

A 22 table—laugh versus protophone by positive versus non-positive affect—was formed for

each of the 15 samples. A Log-linear analysis of the 2215 table suggested that the association

between laugh [L] and positive affect [P] was not moderated by sample [S]—that a common

pattern characterized all samples.

To test whether only the saturated model fit, we deleted the three-way term [PLS], which

left a model with three two-way terms: [PL] [PS] [LS]. This model constrains expected

frequencies to match the cross-classifications implied by the three two-way terms, but not the

three-way classification implied by the saturated term. The chi-square for this model did not

differ significantly from zero, G2(14, N=3645) = 17.6, p = .23, and its Q2 was .97, suggesting that

deleting the saturated term had little effect. Conservatively, the base model used for the Q2

computation was [P] [LS]: other less-restricted but still plausible base models—e.g., [P] [L] [S]

—would have produced even higher values for Q2.

An alternative would be to analyze the 229, positive by laugh by infant table, pooling

samples over infant. The [PL] [PS] [LS] model would fit this table somewhat better, but it is

more convincing that the [PL] [PS] [LS] not-moderated-by-sample model—not moderated

because the [PLS] term was not required for an acceptably fitting model—still fits the larger

2215 table. Again, the result that the three-way interaction term is not needed for good fit

indicates that the association between positivity and laugh does not vary by sample.

54

Protophones. There was considerable individual variability in the facial affect coded for

the protophones. For this analysis, a 33 table—protophone (squeal, vocant, growl) by facial

affect (positive, neutral, negative)—was formed for each of the nine infants. Then, for each

infant, expected frequencies were computed for each of the nine protophone-affect pairs. Only

two of the nine associations showed commonality across the nine tables formed for each infant:

The expected frequency for neutral given squeal was less than expected for all nine and the

expected frequency for neutral given vocant was greater than expected for all nine. These results

are displayed in Figure 5 of the Main text.

The pattern of individual variability among infants for expression of affect by

protophones was supported by Log-linear analysis. For the protophone [V] by facial affect [A]

by infant [B] table only the saturated model fit, indicating that the association between

protophone and facial affect was moderated by infant. The model with the saturated term deleted

did not fit acceptably—its G2(32, N=6535) = 135.3, p < .001 and Q2 = .83—indicating that only

the saturated model provided acceptable fit. Unlike with cry or laugh, the three-way interaction

term included in the saturated model was needed for acceptable fit to the data.

An alternative would be to analyze the 3327, positive by laugh by sample table. But

the model with the saturated term removed would fit this table even worse.

55

Log-linear Results (computed by ILOG (65))

Supporting Table 5. Negative by Cry by Sample.

[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[NCS] 0.0 0 1.000 — 1.00 —[NC][NS][CS] 55.6 18 <.001 NCS 55.6 18 <.001 .96 .04[NS][CS] 624.8 19 <.001 NC 569.2 1 <.001 .51 .45[CS][N] 1267.8 37 <.001 NS 643.1 18 <.001 .00 .51

Supporting Table 6. Positive by Laugh by Sample

[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[PLS] 0.0 0 1.000 — 1.00 —[PL][PS][LS] 17.6 14 .227 PLS 17.7 14 .227 .97 .03[PS][LS] 280.6 15 <.001 PL 263.7 1 <.001 .51 .46[LS][P] 574.4 29 <.001 PS 293.8 14 <.001 .00 .51

Supporting Table 7. Affect by Vocal Type (Protophone) by Baby (Infant)

[Model] G² df ~p Delete ΔG² Δdf ~p Q² ΔQ²[AVB] 0.0 0 1.000 — 1.00 —[AV][AB][VB] 135.3 32 <.001 AVB 135.3 32 <.001 .83 .17[AB][VB] 275.0 36 <.001 AV 139.7 4 <.001 .66 .17[VB][A] 811.0 52 <.001 AB 536.0 16 <.001 .00 .66

The role of affect expression in the functional interpretation of infant protophones

Facial affect and illocutionary force of infant utterances. The results reported in the

Main text, Figures 3a–3b, provide evidence that facial affect associated with protophones

strongly predicted the illocutionary forces attributed to infant utterances. To understand how

these data were derived and to review the coding categories in detail, see Supporting Methods:

Illocutionary force coding. Over 80% of the utterances corresponding to the illocutionary

56

grouping Converse in Figures 3a–3b had been coded illocutionarily as “Continue: Continuation

of protoconversation or vocalization in a game (such as peekaboo)”; 99% of the utterances

corresponding to the Complain grouping in Figures 3a–3b had been coded as Complain or Plea

for help; and over 75% of the utterances corresponding to the Indeterminate grouping had been

coded as Object Directed.

Facial affect and perlocutionary effect of infant utterances. We had not coded

perlocutionary force prior to the review of our paper. In response to a reviewer criticism, we

initiated a series of efforts to evaluate the expectation of systematic caregiving responses to

infant communications of varying affect, consistent with the extensive literature in parent-infant

interaction cited in the Main text. Specifically, it was anticipated that caregivers would respond

to infant affect expression in ways that would indicate systematic attempts to maintain or elicit

comfortable, happy protoconversation, while attempting to change the situation in cases where

the infant’s affect was negative. Codes used in perlocutionary coding are found in Supporting

Methods: Perlocutionary force coding.

We began by reviewing one session of recording from each of the nine infants

representing all three ages, and over 400 caregiver responses. In this effort, perlocutionary

effects were observed only for protophones that had been coded as positive or negative for facial

affect (neutral utterances were not included). As can be seen in Supporting Figure 9, the

caregivers responded in very different ways to protophones produced with positive vs. negative

facial affect. Positive facial affect in protophones yielded parental encouragement to continue the

interaction in the positive vein. Typically the parents praised the infant or attempted to elicit

another turn in conversation by the infant. In contrast, negative affect in vocalization produced

spoken evaluations of the infant’s possible discomfort, soothing, scolding, attempts to distract

57

the infant, and changes effected by the parent in the infant situation (such as picking the infant up

or initiating changing of the diaper). The intonations of the parents in these responses often

provided strong cues to support perlocutionary coding, e.g., in differentiation of, for example,

soothing (to negative protophones) as opposed to eliciting turns (to positive protophones). Such

differentiating intonations in parent vocalizations have been shown to be very similar cross-

linguistically in infant-directed speech (122–124).

.00

.10

.20

.30

.40

.50

.60

Prop

ortio

n of

Ca

regi

ver R

espo

nses

PosNeg

Supporting Figure 9: Perlocutionary effects of infant protophones with positive or negative facial affect. Responses of caregivers to infant protophones that had positive or negative facial affect were coded in terms of the categories indicated on the abscissa (and see Supporting Methods: Perlocutionary effect coding). The positive protophones (as determined by facial affect) yielded the following predominant responses all in the grouping “Encourage”: 1) Encourage: Continue (vocal, facial and postural actions by the parent that encouraged continuation of the positive interaction); 2) Encourage: Elicit Turn (actions designed to elicit a conversational turn from the infant); 3) Encourage: Praise (including phrases such as “that’s such a nice sound”, “oh that’s pretty”, and so on); 4) Encourage: Exultation by the parent (including smiling, laughing, saying “wow” or “hooray”, etc.); and 5) Encourage: Imitation of the infant utterance. The negative protophones yielded the following reactions all in the grouping “Change”: 1) Change: Evaluate change, that is verbal evaluations of possible need for change in the situation (utterances that manifest the parents’ interest in figuring out what was causing the infants’ negativity, for example questions such as “what’s the matter?” or statements such as “I think you may be wet”, etc.) 2) Change: Distract (attempts by the parent to distract the infant,

58

and thus change the focus of the infant to something positive, such as by holding a new toy up for the infant to see); 3) Change: Scold (including saying “no”, “stop that” or “shh”); 4) Change: Soothe (often manifest in soothingly intoned “oh” or “poor baby”, etc.); and 5) Change: Change situation, physical actions on the part of the parent to change the infant situation (such as picking the infant up, taking the infant to the changing table, etc.). Finally about 10% of the infant protophones, both positive and negative, resulted in no observable response on the part of the caregiver (the Unclear grouping).

After the first round of perlocutionary coding, it was decided that protophones with

neutral affect might also be revealing in terms of impact on perlocutionary effects. Consequently

a second round of coding was conducted as reported in the Main text, Figures 3c–3d. The

results in those figures confirmed the findings of Supporting Figure 9, and also indicated that

infant protophones with neutral affect showed intermediate outcomes between those for

protophones with positive or negative facial affect—neutral protophones produced less

Encouragement to continue protoconversation in responses of parents than positive ones, but

more than negative ones. Also Figures 3c–5d show protophones with neutral affect produced

fewer spoken parental evaluations of possible change in the situation or attempts to change the

situation than protophones with negative affect but more than protophones with positive affect.

An important possible concern about the data on perlocutionary effects is that they could

have been influenced by coder biases, since the coders could both see and hear the infant as well

as the parent during the coding of perlocutionary force. A third round of coding of perlocutionary

force was conducted to check for such potential bias. It involved judgments by a “blind” coder

(see Supporting Methods: Perlocutionary force coding), who was presented with audio of

parent utterances only—in particular the parent utterances that occurred right after infant

protophones and that provided the basis for judgment of perlocutionary effect—to be compared

with those of an original coder, who had judged perlocution based on both audio and video, and

who had been able to see and hear the infant as well as the parent.

59

The results displayed in Supporting Figure 10 show that the original coder and the blind

coder judged parent reactions in very much the same way. In both cases the tendency for infant

facial affect in protophones to predict the perlocutionary effect as indicated by parent reaction

(and in accord with the extensive literature in parent-infant interaction) was strong and supported

by highly significant odds ratios.

Supporting Figure 10: A check against possible bias in perlocutionary coding: The figure shows responses of caregivers to infant protophones as determined by an original coder and a “blind” coder. 157 utterances of caregivers that had been involved in the perlocutionary judgments of the original coder of three sessions from three infants at three different ages were extracted from the recordings and presented to the blind coder for perlocutionary judgment based on audio of the parent only (any utterance with infant voice was not included in this dataset). Both coders showed strong predictability between the facial affect of the infant utterance (77 positive, 66 neutral, 20 negative) that had preceded the caregiver utterance and the caregiver’s perlocutionary reaction occurring after the infant communication.

The results of these analyses make clear that facial affect accompanying infant

protophones transmits to parents useful cues regarding infant state and well-being. Parents need

to recognize their infants’ states in order to care for them. Natural selection thus appears to

sponsor the evolution and development of reliable signals by infants and reliable responses to

those signals by parents. The special reason for interest in affect expression accompanying the

60

protophones is that they show flexible association with affect, unlike cry and laughter, which are

much more stable in affect expression during infancy.

Of course there is information in the protophones themselves about infant affect, but our

research shows that affect information in protophones is much more reliably identified through

the accompanying facial expression than through vocal features of the protophones. Thus,

intercoder agreement for nine samples judged by two independent observers was much higher on

facial affect of protophones, judged in video only (kappa = .77) than on vocal affect, judged in

audio only (kappa = .38). Moreover, logistic regression was employed to predict illocutionary

force by affect for the two judges on four sessions that had been coded both for illocutionary

force and separately for facial affect, vocal affect, or affect judged with both facial and vocal

cues. Here the data suggested that in all six comparisons (three for each of the two coders), facial

affect or facial plus vocal affect played a strong and significant role independent of vocal affect

alone in predicting illocutionary force. So, we argue, the protophones manifest the capacity of

the human infant to utilize vocalization very flexibly in expression of emotion through facial

affect or a combination of facial and vocal affect. This flexibility is required in all aspects of

speech, a capacity without which spoken language would be impossible. A key challenge is to

determine whether this capacity for vocal flexibility is new in the hominin line, or whether it

may be rooted in our primate lineage.

Robustness of functional flexibility of protophones across contexts

Context of occurrence of infant vocalizations and non-human primate vocalizations.

Research in both infant vocalizations and non-human primate vocalizations profits from

evaluation of the contexts of usage of sounds. Here we use the term “context” broadly to

encompass aspects of the physical or social environment at or near the time of the occurrence of

61

the vocalization in question. In chimpanzee and bonobo research, for example, the term has been

used to refer to circumstances including whether the group is engaged in travel, preparing for

travel, hunting, whether individuals are engaged in conflict within the group or across groups,

the presence in the environment of an interesting food source and its quality, the presence of

danger, of allies, of higher ranked individuals, of sexually interesting individuals, and so on (45–

47, 49, 125–131). Recent research has addressed context of usage also in grunts of chimpanzees

across the lifespan (32), illustrating that grunts begin in development as presumable effort sounds

associated with physical movement or straining, but later are utilized in contexts that imply (at

least) greeting and come to be differentiated at least in terms of social status of the

communicators. The value of assessing contexts of usage includes (but is not limited to) the

possibility of determining unifying functions (either illocutionary or perlocutionary) that apply

across a variety of contexts (132) as well as perhaps illustrating greater flexibility of usage of

calls than was envisioned by the classical ethologists.

The issue of determining function is tricky of course because contexts can overlap and a

single vocalization may be used in several of the contexts invoked in this kind of research at the

same time. Thus a chimpanzee bark might occur during travel, but several other circumstances of

potential relevance might also apply to the bark, such as the nearby presence of an ally, the

occurrence of audible calls produced by another group of chimpanzees, the nearness of a

potential food source, and so on. How might we know, then, which aspects of the circumstance

are relevant to determining the function of the signal, if there is a unitary function? The question

could be formulated empirically in terms of 1) any aspect of circumstances (endogenous or

exogenous) that regularly correlates with a chimpanzee bark (along with perhaps any effect the

62

producer might appear to desire or anticipate) and 2) how and with what likelihood any listener

might respond in particular ways to a bark.

In humans these judgments can also be difficult, and importantly the kinds of contexts

that are relevant in non-human primate research are quite different in many instances from those

relevant for humans (especially nowadays). Fortunately we have the advantage of being able to

listen to the things people say as they respond to vocalizations, and to question receivers about

their interpretations. With research in human infancy, the opinions and observed reactions of

adult interactors can be very useful, we think, as illustrated above in the perlocution findings.

However, raw observations of “context” that may help provide insight about the functions of the

vocalizations were also available in our data, with some sessions having been coded prior to the

review of the original submission of this paper. These special codings included observations on

gaze direction and on physical/social contexts. The observations provide perspective on the

robustness of the vocal flexibility we have observed with regard to facial affect.

Gaze direction. We coded gaze direction of the infant during each vocalization where at

least one of the two video angles allowed such judgment (more than 99% of the total dataset).

The coding indicated whether infants vocalized while looking toward another person (most often

the parent) or while not looking at another person. 45% of the vocalizations occurred in the

former circumstance, as can be seen in the tabulated data in Supporting Figure 11. Vocalization

with gaze not directed to another person occurred in both social interaction (e.g., when infant and

parent were looking at a toy together) and in non-social circumstances (e.g., solitary infant play).

The bar graphs in Supporting Figure 11 can be usefully compared with Figure 1 of the

Main text where the total dataset is also represented in a single panel. It can be seen that cries

were overwhelmingly deemed negative, while laughs were overwhelmingly deemed positive in

63

both gaze circumstances. Also similarly to Figure 1 of the Main text, all the protophones for

both gaze directions showed considerable numbers of utterances with positive affect as well as

considerable numbers with negative affect. The patterns thus indicate in both gaze circumstances

the robustness of the flexibility of protophones with respect to facial affect by comparison with

the inflexibility of cries and laughs.

Supporting Figure 11: All the data were coded for gaze direction. 55% of the vocalizations occurred in circumstances not involving gaze toward a person, as can be seen in the table at the top of the figure. Also the table shows that laughter occurred much more frequently while looking at a person than while not. For cries and the three protophones, there were no major differences in frequency of occurrence between directed and non-directed contexts. The bar graphs indicate that both when infants looked at another person and when they did not, their vocalizations showed patterns of facial affect that included the primary patterns of the data from Figure 1 of the Main text. While cries and laughs showed overwhelmingly negative and positive facial affect respectively in both gaze circumstances, all the protophones showed considerable numbers of instances of all three types of facial affect. Thus while circumstances of

64

gaze did affect the pattern of data (laughter occurred much more frequently when looking at a person than while not, and positive affect in protophones also occurred considerably more frequently when looking at a person), the data strongly support the robust flexibility of the protophones in terms of facial affect compared with the relative inflexibility of cries and laughs.

While neutral vocalizations were most common for both gaze directions in the

protophones, there were more positively valenced protophones when the infant was looking at a

person, and for squeals, these outnumbered even the instances of neutral affect. The tendency for

positive infant facial expressions to occur preferentially in circumstances of eye contact has been

widely reported (133–135), and blind infants, perhaps predictably based on this tendency, smile

infrequently (136). The tendency for positive affect of protophones to occur most commonly

with gaze toward a person also corresponds to the strong tendency for laughter to occur most

commonly in the same circumstance. These data offer support for the speculation that laughter as

well as smiling may provide key foundations for human sociality (116, 137–139). The data also

suggest that the observation of gaze direction during the protophones supports the suggestion

that positive affect may be a key factor in human communication between parents and infants, a

factor that allows parents considerable opportunity to evaluate infant well-being and to foster it

in the context of face-to-face emotional regulation (44, 140, 141).

Five contexts. To provide further perspective on the robustness of the patterns of facial

affect accompanying the infant vocalizations, we analyzed the coding of each twenty-minute

recording in terms of one of five circumstances. 98% of the infant utterances had been coded as

pertaining to one of the following: 1) Separated: The infant was not engaged in interaction with

anyone—the parent and an experimenter were usually talking to each other, while the infant was

on the other side of the room playing, or on other occasions the infant and parent might be close

together, but their vocalizations were not part of an interaction or bid for interaction with each

other; 2) Interacting on lap: The infant was on an adult’s lap and the adult and infant were 65

engaged in vocal interaction, or the infant was vocally bidding for the adult to interact;

3) Interacting on changing table: The infant was lying on the changing table with the parent

changing the diaper while the parent was engaged in vocal interaction with the infant, or while

the infant was vocally bidding for interaction with the parent; 4) Interacting from high chair: The

infant was seated in the high chair and was engaged in vocal interaction with an adult or was

vocally bidding for interaction with an adult; and 5) Interacting on floor: The infant was on the

floor and was engaged in vocal interaction with an adult or was vocally bidding for interaction

with an adult. Each entire 20-min segment was coded in these terms, with codes being changed

at any point where the existing code was no longer valid for at least 10 sec. Gaze direction was

not an explicit aspect of the coding of the five contexts. Still, in some instances the judgment of

vocal interaction or bidding for it could have been influenced by gaze direction.

The tabulated data in Supporting Figure 12 show that more than ¼ of all vocalizations

occurred in the separated condition, and this was true of both the protophones and the cries. On

the other hand, as can be seen in the stacked bar graph at the upper right, both cries and laughs

were distributed proportionally very differently from the protophones. All three protophones

were distributed similarly across the five contexts. In sharp contrast only 10% of the laughter

occurred in the separated condition (a pattern that conforms to the inherent sociality of laughter),

though more than a ¼ of the protophones occurred in that circumstance. Also, 66% of the cries

occurred during the separated condition (the infant manifesting distress) or during interaction on

the changing table (presumably from discomfort associated with being wet), even though only

35% of the protophones occurred in those circumstances. In general the pattern suggests that the

protophones occurred in ways that were largely unaffected by the context (all three types

occurring in roughly in the same proportions across the contexts), while the laughs and cries

66

were more predictable by context, and much more differentiable from each other, as should be

expected of signals naturally selected for particular functions or selected to express particular

emotions.

Supporting Figure 12: All the vocalizations were coded in terms of five contexts described in the text. Tabulated data at the upper left show that more than ¼ of the vocalizations occurred when the infant was separated. Nearly half occurred during interaction on the floor. The stacked bar graph at the upper right shows that laughs and cries distributed differently from each other with regard to the contexts, while the three protophones showed a consistent pattern unlike either laugh or cry. Providing another indication of robustness in the flexibility of facial affect expression for the protophones, the standard bar graphs (formatted as in Figure 1 of the Main text) show that while cries and laughs were accompanied by overwhelmingly negative and positive facial affect respectively in all five contexts, all the protophones showed considerable numbers of instances of all three types of facial affect in all five contexts. Thus irrespective of these contexts, the protophones appear to manifest the human infant capability to utilize vocalization and facial affect with substantial freedom of association.

67

The standard (not stacked) bar graphs in Supporting Figure 12 can be usefully

compared with Figure 1 of the Main text in the same way the gaze direction data could be

compared. It can be seen that for all five contexts, cries were overwhelmingly deemed negative,

while laughs were overwhelmingly deemed positive. Also similarly to Figure 1 of the Main

text, all the protophones in all five contexts showed considerable numbers of utterances with

positive affect as well as considerable numbers with negative affect. The patterns thus indicate,

in five contexts, the robustness of the flexibility of protophones with respect to facial affect by

comparison with the inflexibility of cries and laughs. We take this flexibility of affective

expression in vocalizations to be a critical foundation for language. Given that we have

quantified the extent of the flexibility in this article, we hope in the future to see evidence of the

extent to which this kind of flexibility may be manifest in vocalizations of non-human primates,

and perhaps to collaborate in the requisite research.

68

Facial Affect codes based on the master coding for the audio-video examples

CL1 (Growls)Movie S1 NeutralMovie S2 NeutralMovie S3 PositiveMovie S4 Negative

CL2 (Squeals)Movie S5 PositiveMovie S6 NeutralMovie S7 Negative

CL3 (Vocants)Movie S8 NegativeMovie S9 Positive

CL4 (Vocants)Movie S10 NeutralMovie S11 PositiveMovie S12 NegativeMovie S13 PositiveMovie S14 Negative

CL5 (Squeals)Movie S15 NegativeMovie S16 Positive

CL6 (Squeals)Movie S17 NeutralMovie S18 PositiveMovie S19 Negative

69

SUPPORTING REFERENCES

1. Lorenz K (1951) Ausdrucksbewegungen höherer Tiere. Naturwissenschaften 38:113–116.2. Tinbergen N (1951) The study of instinct (Oxford University Press, Oxford).3. Cheney DL & Seyfarth RM (1999) Mechanisms underlying vocalizations of primates. The design

of animal communication, eds Hauser MD & Konishi M (MIT Press, Cambridge, MA), pp 629–644.

4. Hauser M (1996) The evolution of communication (MIT, Cambridge, MA).5. Jürgens U (1982) A neuroethological approach to the classification of vocalization in the squirrel

monkey. Primate Communication, eds Snowdon CT, Brown CH, & Petersen MR (Cambridge University Press, Cambridge, UK), pp 50–62.

6. Darwin C (1872) The Expression of Emotions in Man and Animals (University of Chicago Press, Chicago).

7. Owren MJ, Amoss RT, & Rendall D (2011) Two organizing principles of vocal production: Implications for nonhuman and human primates. American Journal of Primatology 73:530–544.

8. Sutton D (1979) Mechanisms underlying learned vocal control in primates. Neurobiology of social communication in primates: an evolutionary perspective, eds Steklis HD & Raleigh MJ (Academic Press, New York), pp 45–67.

9. Acebo C & Thoman EB (1995) Role of infant crying in the early mother infant dialogue. Physiology & Behavior 57(3):541–547.

10. Green JA, Jones LE, & Gustafson GE (1987) Perception of cries by parents and nonparents: relation to cry acoustics. Developmental Psychology 23(3):370–382.

11. Gustafson GE & Green JA (1991) Developmental coordination of cry sounds with visual regard and gestures. Infant Behavior and Development 14:51–57.

12. Lafreniere PJ (2000) Emotional Development: A Biosocial Perspective (Wadsworth Press, Belmont, CA).

13. Lester BM & Boukydis CFZ (1992) No language but a cry. Nonverbal vocal communication, eds H. Papoušek, U. Jürgens, & M. Papoušek (Cambridge University Press, New York), pp 145–173.

14. Sroufe LA (1996) Emotional Development: The Organization of Emotional Life in the Early Years (Cambridge University Press, New York).

15. Oller DK (1981) Infant vocalizations: Exploration and reflexivity. Language behavior in infancy and early childhood, ed Stark RE (Elsevier North Holland, New York), pp 85–104.

16. Stark RE (1981) Infant vocalization: A comprehensive view. Infant Medical Health Journal 2(2):118–128.

17. Scheiner E, Hammerschmidt K, Jürgens U, & Zwirner P (2002) Acoustic analyses of developmental changes and emotional expression in the preverbal vocalizations of infants. Journal of Voice 16:509–529.

18. Scheiner E, Hammerschmidt K, Jürgens U, & Zwirner P (2006) Vocal expression of emotions in normally hearing and hearing-impaired infants. Journal of Voice 20(4):585–604.

19. Molemans I (2011) Sounds like babbling: A Longitudinal investigation of aspects of the prelexical speech repertoire in young children acquiriing Dutch: Normally hearing children and hearing impaired children with a cochlear implant Ph.D. (University of Antwerp, Antwerp, Belgium).

20. Molemans I, Van den Berg R, Van Severen L, & Gillis S (2011) How to measure the onset of babbling reliably. Journal of Child Language:1–30.

21. Konopczynski G (1985) Acquisition du langage. La période charnière et sa structuration mélodique. Bulletin d’audiophonologie. Annales scientifiques de l’Université de Franche-Comté. 11:63–92.

70

22. Koopmans-van Beinum FJ & van der Stelt JM (1986) Early stages in the development of speech movements. Precursors of early speech, eds Lindblom B & Zetterstrom R (Stockton Press., New York), p 37 50.

23. Stark RE (1980) Stages of speech development in the first year of life. Child Phonology, vol. 1, eds Yeni-Komshian G, Kavanagh J, & Ferguson C (Academic Press, New York), pp 73–90.

24. Zlatin-Laufer MA & Horii Y (1977) Fundamental frequency characteristics of infant non-distress vocalization during the first 24 weeks. Journal of Child Language 4:171–184.

25. Locke JL (2008) Lipsmacking and babbling: Syllables, Sociality, and Survival. The Syllable in Speech Production, eds Davis BL & Zajdo K (Erlbaum, New York), pp 111–132.

26. MacNeilage PF, Davis BL, Kinney A, & Matyear CL (2000) The motor core of speech: A comparison of serial organization patterns in infants and languages. Child Development 71(1):153–163.

27. Blount BG (1985) "Girney" vocalizations among Japanese macaque females: Context and function. Primates 26(4):424–435.

28. Becker M, Buder EH, & Ward J (1999) Description of the growl vocalization in small-eared bushbaby mothers, Otolemur garnetti. American Journal of Primatology 49:32.

29. Becker ML, Buder EH, & Ward JP (1998) Vocalizations associated with mother-infant interactions in the small-eared bushbaby. American Journal of Primatology 45(2):166–167 (abstract).

30. Dunbar RIM (1996) Gossiping, grooming and the evolution of language (Harvard University Press, Cambridge, MA).

31. Morris D (1967) The naked ape (Dell, New York).32. Laporte MNC & Zuberbühler K (2011) The development of a greeting signal in wild

chimpanzees. Developmental Science 14(5):1220–1234.33. Seyfarth RM, Cheney DL, & Marler P (1980) Vervet monkey alarm calls: Semantic

communication in a free-ranging primate. Animal Behaviour 28:1070–1094.34. Struhsaker TT (1967) Auditory communication among vervet monkeys (Cercopithecus aethiops).

Social communication among primates, ed Altmann SA (Chicago Univ. Press, Chicago, IL), pp 281–324.

35. Austin JL (1962) How to do things with words (Oxford Univ. Press, London).36. Marler P, Evans CS, & Hauser M (1992) Animal signals: reference, motivation or both?

Nonverbal vocal communication, eds H.Papoušek, U.Jürgens, & M.Papoušek (Cambridge University Press, New York), pp 66–86.

37. Tomasello M, Carpenter M, Call J, Behne T, & Moll H (2005) Understanding and sharing intentions: The origins of cultural cognition. Behavioral & Brain Sciences 28:675–735.

38. Trevarthen C (1974) Conversations with a two-month-old. New Scientist 2:230–235.39. Trevarthen C (1979) Communication and cooperation in early infancy. A description of primary

intersubjectivity. Before speech: The beginnings of human communication, ed Bullowa M (Cambridge University Press, London), pp 321–347.

40. Ainsworth MDS (1969) Object relations, dependency, and attachment: A theoretical review of the infant-mother relationship. Child Development 40:969–1025.

41. Bowlby J (1969) Attachment and Loss (Basic Books, New York).42. Feldman R, Greenbaum CW, & Yirmiya N (1999) Mother-infant synchrony as an antecedent of

the emergence of self-control. Development Psychology 35(5):223–231.43. Forbes EE, Cohn JF, Allen NB, & Lewinsohn PM (2004) Infant Affect during Parent-Infant

Interaction at 3 and 6 Months: Differences Between Mothers and Fathers and Influence of Parent History of Depression. Infancy 5(1):61–84.

44. Schore AN (2001) Effects of a secure attachment relationship on right brain development, affect regulation, and infant mental health Infant mental health journal 22(1–2):7–66.

71

45. Clay Z & Zuberbühler K (2009) Food-associated calling sequences in bonobos. Animal Behaviour 77:1387–1396.

46. Laporte MNC & Zuberbühler K (2010) Vocal greeting behaviour in wild chimpanzee females. Animal Behaviour 80:467–473.

47. Slocombe KE, et al. (2009) Production of food-associated calls in wild male chimpanzees is dependent on the composition of the audience. Behavioral Ecology and Sociobiology 64(12):1959–1966.

48. Crockford C & Boesch C (2003) Context-specific calls in wild chimpanzees, Pan troglodytes verus: Analysis of barks. Animal Behaviour 66:115–125.

49. Crockford C, Herbinger I, Vigilant L, & Boesh C (2004 ) Wild chimpanzees produce group-specific calls: a case for vocal learning? . Ethology 110:221–243.

50. Parr L, Waller BM, & Heintz M (2008) Facial expression categorization by chimpanzees using standardized stimuli. Emotion 8(2):216–231.

51. Parr L, Waller BM, & Vick S (2007) New developments in understanding emotional facial signals in chimpanzees. Current Directions In Psychological Science 16(3):117–122.

52. Oller DK (2000) The Emergence of the Speech Capacity (Lawrence Erlbaum Associates, Mahwah, NJ) p 428.

53. Stark RE, Bernstein LE, & Demorest ME (1993) Vocal communication in the first 18 months of life. Journal of Speech & Hearing Research 36:548–558.

54. Buder EH (1996) Experimental phonology with acoustic phonetic methods: Formant measures from child speech. Proceedings of the UBC International Conference on Phonological Acquisition, eds Bernhardt B, Gilbert J, & Ingram D), pp 254–265.

55. Buder EH & Stoel-Gammon C (1998) Acquisition of language-specific word-initial unvoiced stops: VOT, intensity, and spectral shape in American English and Swedish. Proceedings of the 16th International Congress on Acoustics and 135th meeting of The Acoustical Society of America, pp 2987–2988.

56. Kehoe M, Stoel-Gammon C, & Buder EH (1995) Acoustic correlates of stress in young children's speech. Journal of Speech and Hearing Research 38:338–350.

57. Stoel-Gammon C & Buder EH (1998) The effects of postvocalic voicing on the duration of high front vowels in Swedish and American English: Developmental data. Proceedings, 16th International congress on Acoustics and 135th meeting acoustical society of America, eds Kuhl PK & Crum LA (Acoustical Society of America, Woodbury, NY), Vol 4, pp 2989–2990.

58. Stoel-Gammon C, Buder EH, & Kehoe MM (1995) Acquisition of vowel duration: A comparison of Swedish and English. Proceedings of the XIIIth International Congress of Phonetic Sciences, eds Elenius K & Branderud P (KTH and Stockholm University, Stockholm), Vol 4, pp 30–37.

59. Stoel-Gammon C, Williams K, & Buder EH (1994) Cross-language differences in phonological acquisition: Swedish and American /t. Phonetica 51:146–158.

60. Buder EH, Strand EA, & Iddings S (March) A quantitative and graphic acoustic analysis of phonatory instability in ALS dysarthria. Motor Speech Conference.

61. Hartelius L, Nord L, & Buder EH (1995) Acoustic analysis of dysarthria associated with multiple sclerosis. Clinical Linguistics & Phonetics 9(2):95–120.

62. Buder EH, Chorna L, Oller DK, & Robinson R (2008) Vibratory Regime Classification of Infant Phonation. Journal of Voice 22:553–564.

63. Kwon K, Buder EH, Oller DK, & Chorna LB (2007) Classifying Infant Vocalizations Based on Fundamental Frequency (f0). (International Child Phonology Conference, Seattle, WA).

64. Hosmer D & Lemeshow S (2000) Applied Logistic Regression (John Wiley & Sons, New York) 2 Ed.

65. Bakeman R & Robinson BF (1994) Understanding log-linear analysis with ILOG: An interactive approach (Lawrence Erlbaum Associates, Hillsdale, NJ).

72

66. Warlaumont AS, Oller DK, Buder EH, Dale R, & Kozma R (2010) Data-driven automated acoustic analysis of human infant vocalizations using neural network tools. 127:2563–2577.

67. Kent RD & Murray A (1982) Acoustic features of infant vocal utterances at 3, 6, and 9 months. Journal of the Acoustical Society of America 72:353–365.

68. Bakeman R, Adamson LB, & Strisik P (1989) Lags and logs: Statistical approaches to interaction. Interaction in Human Development, eds Bornstein MH & Bruner J (Erlbaum, Hillsdale, NJ), pp 241–260.

69. Bakeman R & Gottman JM (1997) Observing interaction: An introduction to sequential analysis (Cambridge University Press, Cambridge) 2 Ed.

70. Lewedag VL (1995) Patterns of onset of canonical babbling among typically developing infants. Doctoral Dissertation (University of Miami, Coral Gables, FL).

71. Oller DK & Bull D (1984) Vocalizations of deaf infants. (International Conference on Infant Studies, New York).

72. Oller DK, et al. (2007) Diversity and contrastivity in prosodic and syllabic development. Proceedings of the International Congress of Phonetic Sciences, eds Trouvain J & Barry W (International Phonetics Society, Saarbrucken, Germany), pp 303–308.

73. Webber CL & Zbilut JP (2005) Recurrence quantification analysis of nonlinear dynamical systems. Tutorials in Contemporary Nonlinear Methods for the Behavioral Sciences, eds Riley MA & Van Orden GC (National Science Foundation Program in Perception, Action, and Cognition, Web Book).

74. Papoušek M & Papoušek H (1989) Forms and functions of vocal matching in interactions between mothers and their precanonical infants. First Language 9:137–158.

75. Fernald A (1992) Human maternal vocalizations to infants as biologically relevant signals: An evolutionary perspective. The Adapted Mind: Evolutionary Psychology and the Generation of Culture, eds Barkow JH, Cosmides L, & Tooby J (Oxford University Press, Oxford), pp 345–382.

76. Fernald A (1992) Meaningful melodies in mothers' speech to infants. Nonverbal vocal communication: Comparative and developmental approaches. Studies in emotion and social interaction., eds Papoušek H, Jürgens U, & Papoušek M (Cambridge University Press, New York), pp 262–282.

77. Fernald A & O'Neill DK (1993) Peekaboo across cultures: How mothers and infants play with voices, faces and expectations. Parent-Child Play: Descriptions and Implications, ed MacDonald K), pp 259–286.

78. Owren MJ & Goldstein MH (2008) Scaffolds for babbling: Innateness and learning in the emergence of contextually flexible vocal production in human infants. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 169–192.

79. Bornstein MH & Lamb ME (1992) Development in infancy: An introduction (McGraw-Hill, Inc., New York) third Ed.

80. Goldstein MH, King AP, & West MJ (2003) Social interaction shapes babbling: Testing parallels between birdsong and speech. Proceedings of the National Academy of Sciences 100(13):8030–8035.

81. Hsu HC & Fogel A (2003) Social regulatory effects of infant non-distress vocalization on maternal behavior. Developmental Psychology 39(6):97–991.

82. van der Stelt JM (1993) Finally a word: a sensori-motor approach of the mother-infant system in its development towards speech (Uitgave IFOTT, Amsterdam, the Netherlands) p 226.

83. Bakeman R & Adamson LB (1984) Coordinating attention to people and objects in mother-infant and peer-infant interaction. Child Development 55:1278–1289.

84. Beebe B, Alson D, Jaffe J, Feldstein S, & Crown C (1988) Vocal congruence in mother-infant play. Journal of Psycholinguistic Research 17:245–259.

73

85. Cohn JF & Tronick EZ (1987) Mother-infant face-to-face interaction: The sequence of dyadic states at 3, 6, and 9 months. Developmental Psychology 23:68–77.

86. Field T, Healy B, Goldstein S, & Guthertz M (1990) Behavior-state matching and synchrony in mother-infant interactions of nondepressed versus depressed dyads. Developmental Psychology 26:7–14.

87. Hsu HC & Fogel A (2001) Infant vocal development in a dynamic mother-infant communication system. Infancy 2(1):87–109.

88. Jaffe J, Beatrice B, Stanley F, Crown CL, & Jasnow MD (2001) Rhythms of dialogue in infancy: Coordinated timing in development (Univ of Chicago Press, Chicago).

89. Stern DN (1974) Mother and infant at play: The dyadic interaction involving facial, vocal, and gaze behaviors. The effect of the infant on its caregiver, eds Lewis M & Rosenblum LA (Wiley, New York), pp 187–213.

90. Ginsburg GP & Kilbourne BK (1988) Emergence of vocal alternation in mother-infant interchanges. Journal of Child Language 15:221–235.

91. Cohn JF, Campbell SB, Reinaldo M, & Hopkins J (1990) Face-to-face interactions of postpartum depressed and nondepressed mother-infant pairs at 2 months. Developmental Psychology 26(1):15–23.

92. Zlochower AJ & Cohn JF (1996) Vocal timing in face-to-face interaction of clinically depressed and nondepressed mothers and their 4 month-old infants. infant Behavior & Development 19:371–374.

93. Bloom K (1988) Quality of adult vocalizations affects the quality of infant vocalizations. Journal of Child Language 15:469–480.

94. Goldstein MH & Schwade JA (2008) Social feedback to infants’ babbling facilitates rapid phonological learning. Psychological Science 19:515–522.

95. Buder EH, Warlaumont AS, Oller DK, & Chorna LB (2010) Dynamic indicators of mother-infant prosodic and illocutionary coordination. in Proceedings of Speech Prosody (Speech Prosody, Chicago).

96. Arbib MA, Liebal K, & Pika S (2008) Primate Vocalization, Gesture, and the Evolution of Human Language. Current Anthropology:1053–1076.

97. Corballis MC (2002) From Hand to Mouth: The Origins of Language (Princeton University Press, Princeton, NJ).

98. Hewes GW (1973) Primate communication and the gestural origin of language. Current Anthropology 14(1–2):5–24.

99. Tomasello M (1996) The gestural communication of chimpanzees and human children. (Waseda University International Conference Center, Tokyo).

100. Gardner RA, Gardner BT, & Van Cantfort TE eds (1989) Teaching sign language to chimpanzees (Suny Press).

101. Call J (2008) How apes use gestures: The issue of flexibility. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 235–252.

102. Trevarthen C (2001) Infant Intersubjectivity: Research, Theory, and Clinical Applications. Journal of Child Psychology and Psychiatry (42):3–48.

103. McNeill D, Bertenthal B, Cole J, & Gallagher S (2005) Gesture-first, but no gestures? Behavioral & Brain Sciences 28:138–139.

104. McNeill D (1992) Hand and mind: What gestures reveal about thought (University of Chicago Press, Chicago).

105. Buder EH & Stoel-Gammon C (2002) Young children's acquisition of vowel duration as influenced by language: Tense/lax and final stop consonant voicing effects. Journal of the Acoustical Society of America 111:1854–1864.

74

106. Buder EH & Stoel-Gammon C (1993) Obtaining valid and reliable acoustic measures of children's vowel productions. American Speech-Language and Hearing Association.

107. Winholtz WS & Titze IR (1997) Conversion of a head-mounted microphone signal into calibrated SPL units. Journal of Voice 11:417–421.

108. Oller DK (2010) All-day recordings to investigate vocabulary development: A case study of a trilingual toddler Communication Disorders Quarterly 31(4):213–222.

109. Oller DK, et al. (2010) Automated Vocal Analysis of Naturalistic Recordings from Children with Autism, Language Delay and Typical Development. Proceedings of the National Academy of Sciences 107(30):13354–13359.

110. Zimmerman F, et al. (2009) Teaching By Listening: The Importance of Adult-Child Conversations to Language Development. Pediatrics 124:342–349.

111. Milenkovic P (2001) TF32 (University of Wisconsin-Madison, Madison, WI).112. Lynch MP, Oller DK, Steffens ML, & Buder EH (1995) Phrasing in prelinguistic vocalizations.

Developmental Psychobiology 28:3–23.113. Ekman P & Friesen W (1978) The Facial Action Coding System (Consulting Psychologists Press,

Palo Alto, CA).114. Kojima S & Nagumo S (1996) Early vocal development in a chimpanzee infant. Primate

Institute, Inuyama.115. Panksepp J (2000) The riddle of laughter: Neuronal and psychoevolutionary underpinnings of

joy. Current directions in Psychological Science 9:183–186.116. Provine RR (1996) Laughter. American Scientist 84:38–45.117. Cheney DL & Seyfarth RM (1996) Function and intention in the calls of non-human primates.

Proceedings of the British Academy 88:59–76.118. Slocombe KE, Waller B, & Liebal K (2011) The language void: The need for multimodality in

primate communication research. Animal Behaviour 81(5):919–924.119. Locke JL (2006) Parental selection of vocal behavior: Crying, cooing, babbling, and the evolution

of language. Human Nature 17:155–168.120. Hauser MD (1996) The evolution of communication (MIT, Cambridge, MA).121. Griebel U & Oller DK (2008) Evolutionary forces favoring contextual flexibiliity. Evolution of

Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 9–40.

122. Fernald A (1989) Intonation and communicative intent in mothers'speech to infants: Is the melody the message? Child Development 60:1497–1510.

123. Papoušek M (1994) Vom ersten Schrei zum ersten Wort: Anfänge der Sprachentwickelung in der vorsprachlichen Kommunikation (Verlag Hans Huber, Bern).

124. Papoušek M, Bornstein MH, Nuzzo C, Papoušek H, & Symmes D (1990) Infant responses to prototypical melodic contours in parental speech. Infant Behavior and Development 13:539–545.

125. Crockford C, Wittig RM, Mundry R, & Zuberbühler K (2012 ) Wild Chimpanzees Inform Ignorant Group Members of Danger Current Biology 22:142–146.

126. Crockford C & Boesch C (2005) Call combinations in wild chimpanzees. Behaviour 142(4):397–421.

127. Clay Z & Zuberbühler K (2011) Bonobos Extract Meaning from Call Sequences. PLoS One 6(4):1–10.

128. Goodall J (1986) The chimpanzees of Gombe (The Belknap Press of Harvard University Press, Cambridge, MA).

129. Marler P (1976) Social organization, communication and graded signals: The chimpanzee and the gorilla. Growing points in ethology, eds Bateson PPG & Hinde RA (Cambridge University Press, Cambridge, UK), pp 239–280.

75

130. Slocombe KE & Zuberbühler K (2005) Agonistic screams in wild chimpanzees (Pan troglodytes schweinfurthii) vary as a function of social role. Journal of Comparative Psychology 119(1):67–77.

131. Zuberbühler K, Ouattara K, Bitty A, Lemasson A, & Noë R (2009) The primate roots of human language: Primate vocal behaviour and cognition in the wild. Becoming eloquent: Advances in the emergence of language, human cognition, and modern cultures, eds d'Errico F & Hombert J-M (John Benjamins, Amsterdam), pp 235–266.

132. Notman H & Rendall D (2005) Contextual variation in chimpanzee pant hoots and its implications for referential communication. Animal Behaviour 70:177–190.

133. Messinger D, Fogel A, & Dickson KL (2001) All smiles are positive, but some smiles are more positive than others. Developmental Psychology 37(5):642–653.

134. Sroufe L & Waters E (1976) The ontogenesis of smiling and laughter: A perspective on the organization of development in infancy. Psych Rev 83:173–189.

135. Sroufe LA (1995) Emotional Development:The Organization of Emotional Life in the Early Years (Cambridge University Press, Cambridge).

136. Fraiberg S (1979) Blind infants and their mothers: An examination of the sign system. Before Speech: The beginning of interpersonal communication, ed Bullowa M (Cambridge University Press, Cambridge, UK), pp 147–169.

137. Davila Ross M, Owren MJ, & Zimmermann E (2010) The evolution of laughter in great apes and humans. Communicative & Integrative Biology 3(2):191–194.

138. Dunbar RIM (2004) Language, music and laughter in evolutionary perspective. The Evolution of Communication Systems: A Comparative Approach, eds Oller DK & Griebel U (MIT Press), pp 257–274.

139. Sroufe LA & Wunsch J (1972) The development of laughter in the first year of life. Child Development 43:1326–1344.

140. Feldman R (2007) Parent–Infant Synchrony: Biological Foundations and Developmental Outcomes. Current directions in psychological science 16(6):340–345.

141. Pipp S & Harmon RJ (1987) Attachment as regulation: a commentary. Child Development 58:648–652.

142. Oller DK & Griebel U (2008) The origins of syllabification in human infancy and in human evolution. Syllable Development: The Frame/Content Theory and Beyond, eds Davis B & Zajdo K (Lawrence Erlbaum and Associates, Mahwah, NJ), pp 368-386.

143. Oller DK & Griebel U (2008) Complexity and flexibility in infant vocal development and the earliest steps in the evolution of language. Evolution of Communicative Flexibility: Complexity, Creativity and Adaptability in Human and Animal Communication, eds Oller DK & Griebel U (MIT Press, Cambridge, MA), pp 141-168.

144. Damasio A (1999) The feeling of what happens: Body and emotion in the making of consciousness (Harcourt Brace and Co., New York).

145. Knoke D & Burke PJ (1980) Log-linear Models (Sage, London, UK).

76

Documents

10.1073/pnas.1300337110/-/DCSup… · Web viewSupporting Information Appendix for. Functional Flexibility of Infant Vocalization . and the Emergence of Language. D. Kimbrough Oller,