20.04 Acoustic Assessment of Infant Cry (1)

Embed Size (px)

DESCRIPTION

medical and pshicological

Citation preview

  • JSLHR

    Article

    A Flexible Analysis Tool for the QuantitativeAcoustic Assessment of Infant Cry

    Brian Reggiannini,a Stephen J. Sheinkopf,b Harvey F. Silverman,a Xiaoxue Li,a and Barry M. Lesterb

    Purpose: In this article, the authors describe and validatethe performance of a modern acoustic analyzer specificallydesigned for infant cry analysis.Method: Utilizing known algorithms, the authors developed amethod to extract acoustic parameters describing infant criesfrom standard digital audio files. They used a frame rate of25 ms with a frame advance of 12.5 ms. Cepstral-basedacoustic analysis proceeded in 2phases, computing frame-leveldata and then organizing and summarizing this informationwithin cry utterances. Using signal detection methods, theauthors evaluated the accuracy of the automated system todetermine voicing and to detect fundamental frequency (F0)as compared to voiced segments and pitch periods manuallycoded from spectrogram displays.

    Results: The system detected F0 with 88% to 95% accuracy,depending on tolerances set at 10 to 20 Hz. Receiveroperating characteristic analyses demonstrated very highaccuracy at detecting voicing characteristics in the cry samples.Conclusions: This article describes an automated infant cryanalyzer with high accuracy to detect important acousticfeatures of cry. A unique and important aspect of this workis the rigorous testing of the systems accuracy as comparedto ground-truth manual coding. The resulting system hasimplications for basic and applied research on infant crydevelopment.

    Key Words: cry, acoustics, analysis, infants, developmentaldisorders

    Acoustic analysis of infant cry has been a focus ofclinical and developmental research for a numberof decades. A variety of approaches to cry analysishave been employed, but each has its drawbacks. Over time,advances in computing have allowed for increased powerand flexibility in acoustic analysis, including the ability toutilize robust techniques for the accurate estimation of funda-mental frequency and other acoustic features of cry vocaliza-tions. This article describes the development and validationof a modern tool for the acoustic analysis of infant cry.

    Applied and clinical studies of infant cry have examinedfeatures of cry production that may discriminate babies withspecific conditions or medical risks. For example, there hasbeen considerable interest in utilizing infant cry analysis as ameasure of developmental status in babies with pre- and peri-natal risk factors, such as prenatal substance exposure (e.g.,Lester et al., 2002) or premature birth (e.g., Goberman &Robb, 1999). There has also been interest in utilizing infantcry to identify babies at risk for developmental or medical

    conditions including hearing impairment (Vrallyay, Beny,Illnyi, Farkas,&Kovcs, 2004) and, recently, autism spectrumdisorders (Esposito & Venuti, 2010; Sheinkopf, Iverson,Rinaldi, & Lester, 2012).

    Early studies in this area relied on visual inspection ofsound spectrograms in order to describe acoustic featuresof the cry in specific clinical populations (e.g., Karelitz &Fisichelli, 1962; Lind, Vuorenkoski, Rosberg, Partanen, &Wasz-Hockert, 1970; Prechtl, Theorell, Gramsbergen,&Lind,1969; Vuorenkoski et al., 1966). Manual inspection of spec-trograms has been seen as a gold-standard method for detect-ing acoustic features in cry sounds, including the timing andonset of cry vocalizations and the fundamental frequency (F0)of cry. Such visual inspection has also been used to describemelodic variations in F0 across an utterance, and voicing orperiodicity in the cry utterance. However, although these man-ual approaches have the advantage of being able to robustlydetect F0, the trade-off is that the process is slow, limiting theamount of data that can be analyzed in any one study.

    More recent approaches have utilized computer-assistedmethods or commercially available speech analysis softwarepackages to code acoustic aspects of infant cry. For example,one approach uses a computer cursor that is moved alonga digitally displayed sound spectrogram to quantify aspectsof cry duration and selects locations on the spectrogram foracoustic analysis (Goberman & Robb, 1999; Grau, Robb, &Cacace, 1995;Wermke&Robb, 2010; Zeskind&Barr, 1997).In this way, resulting portions of the cry can be subjected to afast Fourier transform (FFT) to yield information from the

    aBrown University, Providence, Rhode IslandbWomen and Infants Hospital of Rhode Island, Providence

    Correspondence to Stephen J. Sheinkopf: [email protected]

    Editor: Jody KreimanAssociate Editor: Megha Sundara

    Received November 27, 2011Revision received June 29, 2012Accepted January 8, 2013DOI: 10.1044/1092-4388(2013/11-0298)

    Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013 A American Speech-Language-Hearing Association1416

  • power spectrum in the cry (e.g., maximum/minimum F0). Aswe noted previously, abstracting quantitative data from spec-trograms by this method is time consuming (LaGasse, Neal,& Lester, 2005). In addition, these approaches utilize speechanalysis tools that were designed to extract acoustic infor-mation from adult speech. Given the anatomical differencesin the vocal tract of infants, there is a need for tools designedto track and extract the F0 and other acoustic features frominfant cries specifically. Finally, advances in computing ca-pacity now allow researchers to utilize methods for quanti-fying aspects of the sound spectrum that can be expected toyield more accurate estimates of F0 and related parameters.

    Automated approaches have the advantage of fastanalysis of very large data sets, objective assessment, theability to quantify multiple data points, and the flexibility toyield derivative measures (e.g., jitter, pitch contours, etc.),which have the potential to increase the applied value andclinical utility of cry assessment. These advantages notwith-standing, past automated approaches have also sufferedfrom weaknesses due to limits in computing power. This hasresulted in, for example, difficulties in signal detection andin not being able to pinpoint important cues when multipleanalyses are required to do so. Automated detection of the F0of infant cry is a difficult challenge, and the accuracy of F0detection has not been fully reported in many approaches.

    There are multiple challenges to the design of an auto-mated acoustic analysis of infant cry. Past researchers havedeveloped automated analysis systems intended for the quan-tification of cry acoustics in larger samples of babies and witha method that minimizes the need for visual inspection ormanual processing of data. For example, researchers haveutilized automated application of FFTs to detect F0 in cryutterances or across whole cry episodes (Branco, Fekete,Rugolo, & Rehder, 2007; Corwin et al., 1992; Lester et al.,1991). In this general approach, digitized and filtered samplesare subjected to an FFT in order to compute the logmagnitudespectrum for analysis blocks of specific lengths (e.g., 25 ms).Summary variables for each analysis block can then be aggre-gated to yield summary statistics for a cry episode and for theindividual cry utterances (a single expiratory period) withincry episodes (a series of expiratory periods).

    Alternative automated approaches have also been re-cently described.Manfredi and colleagues (Manfredi, Bocchi,Orlandi, Spaccaterra, & Donzelli, 2009; Manfredi, Tocchioni,& Bocchi, 2006) have described a method that utilizes simpleinverse filter tracking (SIFT) of short, fixed-length analysisframes followed by adaptive estimation of F0 using 5- to15-ms frames, varied in proportion to the changing F0of the signal. Vrallyay et al. (2004) used what they termeda smoothed spectrum method to detect F0 for the purposeof identifying infants with possible hearing loss.

    These automated approaches have the potential to speedscientific inquiry, allowing for the study of large numbers ofinfants with efficient and rapid analyses. In addition, theybypass the need for manual inspection of spectrograms anddo not require the time-consuming task of cursor placementand frame selection used in some computer-assisted methods.In this way, automated approaches allow for more rapid

    analysis that is less prone to observer bias or coding errors.Also, because these systems were developed specifically forthe study of infant cry, there is the assumption that thesealgorithms accurately track F0 and differentiate voiced fromunvoiced utterances. However, with the exception of a studyof the efficiency of F0 detection by Vrallyay et al. (2004),formal studies of the actual accuracy of measurement havenot been conducted. Moreover, the automated approachesdescribed above often utilize older signal processing approaches(FFT, SIFT).Moremodern approaches may now be employedgiven advances in computing power.

    In this article, we describe the development and valida-tion of a cry analysis tool that utilizes robust methods forvoicing determination along with a cepstral analysis for thedetection and tracking of F0. Investigating the validity ofautomated acoustic assessment of cry can be thought of asstudying the sensitivity and specificity of an automatedmethodof detecting the signal periodicity that constitutes F0. We havedeveloped a new, robust tool for extracting acoustic infor-mation from digitally recorded infant cries, and we evaluatedits validity by describing the sensitivity and specificity of theautomated system to detect F0 (pitch periods) in comparisonto the pitch periods manually coded by a trained observerfrom a sound spectrogram (oscilloscope view). In addition,we describe a method for categorizing voiced versus typicallyshort, unvoiced utterances, or segments of utterances that areunvoiced. This includes a quantification of the confidenceof the voicing determination, which can be adapted by re-searchers depending on the scientific questions being addressed.Finally, we describe the detailed output from this system thatcan be easily subjected to statistical analysis for hypothesistesting. This system is detailed and flexible enough to allowresearchers to describe infant cries at the utterance level whilealso producing detailed frame-by-frame acoustic output.

    MethodIn developing our analysis tool, we began by defining

    the range of measures or outputs that we wanted to examine,basing this list on prior cry analysis work. These includedparameters to characterize F0, amplitude or energy of cry,timing variables (latency, onset, duration, interutteranceinterval, etc.), and formants (while acknowledging difficultyin measurement). In addition to the kinds of variables used inprior automated analyses, we also had the aim of using theF0 tracking to model the shape or contour of F0 across eachcry utterance in a cry episode. This would be similar to thestudies of cry melody in some past research, such asWermke,Leising, and Stellzig-Eisenhauer (2007), in which F0 wascharacterized as rising then falling, or having other contoursacross a cry utterance.

    Acoustic characteristics of infant crying are determinedby the complex interplay of neural, physiological, and anat-omical factors manifested in the properties of the drivingfunction of the cry (Newman, 1988). Notably, periodicityof the glottal motions determines properties such as the pitchor the amount of voicing/excited turbulence in a cry. Theshape of the vocal tract determines the resonant frequencies

    Reggiannini et al.: Infant Cry Analyzer 1417

  • (formants) of the spectrum of the cry at a given instant.Important acoustic properties of infant cry include F0, de-fined as the fundamental frequency of the glottal excitation(vibrations of the vocal folds), and the formant frequencies,defined as resonances of the vocal tract. Non-vocal-fold driventurbulences need also to be detected and categorized in asuitable analysis system.

    Our system is run as two sequential programs: Phase Ianalyzes the digitized data to produce a set of parameters foreach consecutive 12.5-ms frame. Phase II takes the Phase Ioutput as input and produces an output record for eachconsecutive group of frames that has similar properties. Theanalysis tool is currently implemented in MATLAB, but it iseasily adaptable for any embedded processor. The analyzerassumes the input is a .wav file, sampled at 48 ks/s with 16 bitsper sample (768 kbits/s). Using these high sampling andquantization parameters ensures that all cues are capturedand that there is sufficient headroom for dynamic rangedifferences. In this study, we recorded cry samples using anOlympus DM-520 digital voice recorder (Olympus ImagingAmerica, Inc., Center Valley, PA). This standardized inputformat can easily be replicated in other studies.

    Phase I takes the .wav files and produces a comma-separated value (CSV) file that is readable not only by thePhase II program but also by programs such as MicrosoftExcel. In the first phase of the analysis system, all outputsrelate to the unit of a fixed-length, fixed-advance framedescribed by 22 numerical parameters. Thus, because eachnumber has a 32-bit representation, this implies a data rate ofonly 56.32 kbits/s, a significant reduction. The first two linesof each Phase I output file are headers. The fields for theheader record are defined in Table 1. We needed to addressthe fact that there are many useful (older) infant cry samplesthat have been recorded on analog tape. However, the re-cording quality of these tapes may vary. Thus, a preliminaryautomatic scan of a digitized recording has been designed toascertain a recordings quality based on background noiseusually hum and signal-to-noise ratio (SNR; as determinedby an average amplitude for high-energy events to the ampli-tude of easily identifiable silence regions)and a detectionof saturation at some phase of the recording process. Themean value of the recording, an estimate of the dynamic range,

    and a classification of the quality of the file (high quality, noisy,low level, analog saturated, digital saturated) are all put intothe header file for the Phase I system. The rest of the outputfile consists of fixed-length records, one record per frame,as defined in Table 2.

    We use a fixed-frame rate of 1,200 samples (25ms) witha frame advance of 600 samples (12.5 ms) to keep reasonablyhigh resolution in both time and frequency. Given todaystechnology, the analysis system was designed to be liberalwith its use of computation so as to reflect resultant parametersmore accurately. Thus, three discrete Fourier transforms arecomputed for each 1,200-point frame. The middle 768 pointsare transformed for the F0 estimate as explained below. Thefull frame (1,200 points) is transformed for amplitude com-putations, and an interpolated transform of 4,096 points(1,448 zeros, the 1,200 point frame, and 1,448 zeros) is usedto detect F0 above 1 kHz (what we term hyper-pitch).

    The Phase II program takes the Phase I data as inputand reduces the data further, separating them into groups offrames having similar properties, whichwe call sound segments.The CSV output has a record for each of these groups offrames. The concatenated groups of frames are labeled tobe one of the following classes:

    1. silence

    2. short utterances (length < 0.5 s, relatively high energy)

    3. long utterances (length > 0.5 s, high energy)

    The output from Phase II contains informationsummarizing utterance-level characteristics of infant cries,and thus the Phase II output is expected to be most usefulfor studies of crying in various infant populations. Phase Iaccuracy has been carefully tested for this article because it isupon this phase that the validity of the summary output rests.

    Table 1. Initial header record definitions.

    Field Definition

    1 Phase I or Phase II (text)2 Subject (text)3 Subject description (text)4 Number of zero frames5 Mean value of recording [0, 32767]6 1% dynamic range (dB)7 1% dynamic range [0, 32767]8 5% dynamic range (dB)9 5% dynamic range [0, 32767]10 10% dynamic range [0, 32767]11 10% dynamic range [0, 32767]12 Quality class

    Table 2. Phase I: Definition of fields of per frame record.

    Frame record Field

    1 Frame number2 Time (ms)3 F0 (Hz)4 F0 amplitude (dB)5 F0 confidence [0, 1]6 Hyper-pitch (Hz) ([1, 5] kHz range)7 Hyper-pitch amplitude (dB)8 Hyper-pitch confidence [0,1]9 Peak pitch amplitude (dB)10 Overall amplitude (dB)11 Amplitude [0.5, 10] kHz (dB)12 Amplitude [0, 0.5] kHz (dB)13 Amplitude [0, 5, 1] kHz (dB)14 Amplitude [1, 2.5] kHz (dB)15 Amplitude [2.5, 5] kHz (dB)16 Amplitude [5, 10] kHz (dB)17 F1 (Hz)18 Amplitude of F1 (dB)19 F2 (Hz)20 Amplitude of F2 (dB)21 F3 (Hz)22 Amplitude of F3 (dB)

    1418 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • Phase I SystemThere are several approaches that can be used for pitch

    detection, and the more common of these methods are basedon (a) time-event rate detection (Ananthapadmanabha &Yegnanarayana, 1975; Rader, 1964; Smith, 1954, 1957),(b) autocorrelationmethods (Dubnowski, Schafer, &Rabiner,1976; Gill, 1959; Rabiner, 1977; Stone & White, 1963), and(c) frequency domain methods. Time-event rate detectionmethods are based on the fact that if an event is periodic, thenthere are extractable time-repeating events that can be countedand the number of these events per second is inversely re-lated to the frequency. Autocorrelation methods are used asameasure of the consistency or sameness of a signal with itselfat different time delays; the peak of the time-delay value isreturned as the pitch period. Finally, frequency domain ap-proaches include methods such as comb filters (filters in whicha signal is subtracted from itself at different time-delay values;Martin, 1981), tunable infinite impulse response (IIR) filters(Baronin & Kushtuev, 1971), and cepstrum analysis (Bogert,Healy, & Tukey, 1963; Noll, 1967).

    The time-event rate detection methods are extremelysimple and easy to implement. However, they have immensedifficulties dealing with spectrally complex signals such ashuman speech or a babys cry. The autocorrelation and thefirst two frequency domain methods are also more suitable forcleaner signals (e.g., sounds produced bymusical instruments).Perhaps the method most widely used for obtaining F0 foradult speech is cepstrum analysis. When applied correctly,it has proven to be a robust method for describing acousticproperties of noninfant vocalizations, and it should be suitablefor the complex vocalic signals of infant cry. The resultingcepstral coefficients are the standard features for speech rec-ognition algorithms. Accordingly, we have selected cepstrumanalysis to develop the cry analysis algorithm in this project.

    It is accepted that a normal infant cry F0 range is 200Hzto 1 kHz, or a pitch-period range of 5 ms to 1 ms. Becausepitch-period estimates are obtained using a modified versionof the cepstrum method (Noll, 1967), several pitch periodsare required within each frame to make the short time frameappear periodic. Thus, to get aminimumof three pitch periods(and a nice number for an FFT), we selected a fixed frameof 768 points (or 16 ms for 48 kHz sampling) of each 1,200-point frame and a 768-point Hamming window. A largerwindow will cause the cepstral pitch peak to broaden for thehigher F0 values, and a smaller window will not have as goodcepstral peaks for low values of F0. The Hamming windowwill broaden the harmonic peaks but eliminate most theeffects due to sidelobe accumulations. This analysis strategywas decided upon in order to capture four to eight pitch periodsper frame. Given the nature of infant cry, greater framelengths would decrease the reliability of pitch-period estima-tion. Thus, we had to modify the basic technique in order tocompensate for the unique characteristics of infant cry. Thefirst changewas to apply a frequencywindowW[r], effectivelylimiting the band to be considered to be from 200 Hz to2200 Hz to the log-spectrum before computing the inversediscrete Fourier transform (IDFT). Because energy in voicedspeech naturally falls off after 4 kHz, the spectral harmonic

    structure is amplitude modulated by the roll-off function,which can cause multiple peaks in the cepstrum when thesampling rate exceeds 8 kHz. Applying a frequency windowsmoothes the cepstrum, eliminating these modulation effects.The window also deemphasizes low- and high-frequencynoise. The effects of the frequency window are depicted inFigure 1, specifically in Panel (c), in which the pitch period iseasy to identify, although a second rahmonic is also evident(the term rahmonic refers to harmonics in the cepstral domain).

    It is noted that infants generally do not double or halvetheir pitch frequency nearly instantaneously during voicedportions of a cry vocalization. Thus, by considering multipleframes at once, many F0 doubling and halving estimationerrors can be eliminated. We consider halving and doublingerrors to be those that occur for one or two frames, whichwould imply very rapid changes in pitch frequency. It is thesethat we try to eliminate, not the longer doubling or halvingregions that appear when even or odd harmonics disappear inthe spectrogram. A dynamic-programming smoother is areasonable mechanism to ensure continuity in the F0 estimatesat transitions and many other anomalies. This is not a newidea (Secrest & Doddington, 1982), although our imple-mentation is specifically set up for infant cries. In our imple-mentation, 50-frame blocks (0.625 s) are run through thedynamic-programming algorithm after determining F0 and aconfidencemeasure for independent frames. The last 50 framesof the recorded cry constitute the last block. As the numberof frames is not likely to be divisible by 50, there is somespecial processing due to overlap for the last block. All negativecepstral values are truncated to zero, and the accumulated pathmetric is simply the sum of the 50 cepstral values built in thenormal forward part of the dynamic-programming algorithm.Thepitchperiod is allowed to changenomore thanplus orminus20 cepstral points (0.416 ms) per frame. The backtracked pathis used for the initial estimates for F0. Following the dynamicprogramming, some further outliers (typically at utterancetransitions) are eliminated using a standard five-point medianfilter. The result is pitch-period estimate q0[i] for Frame i, andpitch frequency (DataElement 3 as inTable 2) is simplyF0 [i] =fs /q0[i], where fs is the sampling frequency. Data Element 4,pitch energy, is the cepstral value of q0[i], C[q0[i], i].

    Instead of using amplitude alone, the pitch-estimationsystem is also well suited for making voicing decisions foreach frame. Data Element 5 in Table 2 is a pseudoprobabilityfor voicing based on the cepstral analysis. For cepstrumC [q, i]and pitch-period estimate q0[i], the traditional cepstrummethod uses C [q0[i]] as a measure of voicing. This measurehas been found to fluctuate under different noise conditions,making it difficult to find a reliable threshold for a multi-environment system. Instead, we use an SNR-like measureto make a voicing decision. This measure is based on theheight of the cepstral peak with respect to the cepstrum noiselevel. The windowW [r] effectively smoothes the cepstrum oflength N by a factor of D, where

    D NPN1r0 W r

    : 1

    Reggiannini et al.: Infant Cry Analyzer 1419

  • This smoothing causes peaks in the cepstrum to have a widthof approximately D + 1 samples. This information is usedto compute the voicing confidence measure V, which is afunction ofC [q0[i], i] and its surrounding. The cepstrummethodsearches for evidence of periodicity within a finite pitch-periodrange based on knowledge of human F0 production. In thismethod, qmin and qmax are the minimum andmaximum pitch-period (quefrency) indices in the search region. These arefixed and do not vary with the frame index i. The voicing-detection algorithm begins by zeroing out all negative C [q]values and all values outside the region qZ[qmin, qmax] inthe cepstrum C [q, i]. This nonnegative cepstrum is denotedas

    ^C[q, i], and let

    ^D = D. Pitch-period estimate q0[i] is

    chosen to correspond to the maximum value of^C[q, i], as is

    done in the traditional method. Then, the voicing confi-dence V [q0[i], i] is defined as

    V q0i; i PR

    r1P ^D

    i^D^Cr q0i; i2Pqmax

    jqmin^C j; i2 ; 2

    where R is the number of rahmonics to include. It wasfound that R = 3 was sufficient, because larger rahmonicswere often insignificantly small.

    V [q0[i], i] is a number between 0 and1.Values ofV [q0[i], i]corresponding to high-quefrency (low-frequency) pitch-period estimates tend to have smaller magnitudes because

    fewer rahmonics fall within the search interval [qmin, qmax].The decision threshold, a [q0[i]], depends linearly (from 0.7 atqmin to 0.5 at qmax) on the index of the current pitch-periodestimate q0[i]. In the Phase II program, a frame would belabeled as voiced ifV [q0[i], i] a [q0] perhaps alongwith someamplitude criteria.

    aq0 0:2qmax qmin qmin q0 0:7 3

    In addition to beingmore robust to different noise conditions,V [q0[i], i] also protects against doubling errors by includingthe magnitude of cepstral content away from the peak. Al-though doubling errors will not be corrected by using thismethod, it was ultimately found that ignoring such difficultframes by labeling them unvoiced was sufficient for the taskat hand.

    There is a mode in an infants cry when the fundamentalfrequency is above 1000Hz, whichwe call hyper-pitch (Golub,1989; LaGasse et al., 2005). Thus we attempt to determinea set of hyper-pitch values for each frame.We use aHamming-windowed 4,096-point DFTwith the full 1,200-point frame datain the center of inserted zeros to compute an interpolatedspectrum and search its log magnitude for peaks in the rangeof 1000 Hz to 5000 Hz. The highest peak P[1] in the rangeis found first, and, because the lowest hyper-pitch is 1000 Hz,the spectrum is masked from max[1000, P[1, i] 1000] to

    Figure 1. Panel (a): An example of a voiced infant cry spectrum. Panel (b): Nonwindowed cepstrum of same frame showingrange for inspecting for rahmonics. Panel (c): Windowed cepstrum showing range for inspecting rahmonics.

    1420 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • min[5000, P[1, i] + 1000] and searched for another peak. Thisprocess is repeated until three such peaks have been foundP[k, i], where k denotes the individual elements of the set ofthree peaks (k [1, 3]). The set is then reordered to be left toright as P^[k, i]. It is hypothesized that the three peaks formsome harmonic set, and the frequency differences are taken,yielding a hyper-pitch value Fhp[i] = 0.5(P^[3, i] P^[1, i]). Ifonly two peaks can be found, then Fhp[i] = P^[2, i] P^[1, i].There is a special case when the hyper-pitch is about 1000 Hzand the odd harmonics dominate. In this case, the minimumdifference between peaks is taken as the hyper-pitch frequency.An example of a spectrum for a frame driven by hyper-pitchis shown in Figure 2.

    The hyper-pitch energy (seventh value in the record) issimply taken as the average of the fundamental hyper-pitchvalue and two of its harmonics. It is not necessarily that ofthe average of the peaks. The hyper-pitch confidence (eighthvalue in the record) is determined in a similar fashion to thatof the confidence in the normal pitch range. It is a numberbetween 0 and 1 that correlates well with the validity of thehyper-pitch condition being active in the frame. For thisresult the power, A, not the log power, is accumulated forthe range 1000 Hz5000 Hz, and the power in the detectedpeaks, B (up to four in the range), is also accumulated. Thepower for a given peak is accumulated over about 30 inter-polated points or about 360Hz about the peak. The ratioB/Ais the confidence measure.

    Fields 10 to 16 of the record give the amplitudes in dBfor the entire band and for the six subbands listed above. Thefull Hamming-windowed 1,200-point DFT output is used toaccumulate the power in each prescribed band (and overall).

    Those values are directly converted to dB without any nor-malization. Thus no information is lost, but differences inrecording levels, distance from the microphone, and otheraspects of soundacquisitionwill also affect these data.However,keeping nonnormalized data allows the Phase II system toconsider the recording conditions when making its decisions.

    As has been stated, the determination of formants is avery difficult problem for analyzing infant cries because ofthe high pitch of the cries and thus the sparse harmonics.Formant positions can be estimated, but their precise centralvalues, if somewhat distant from a pitch harmonic, may behard to obtain. To estimate formants as accurately as pos-sible, we use the interpolated 4,096-point DFT data. Afterobtaining the log-magnitude spectral data, we apply a low-pass lifter to the data, whose parameters depend uponthe pitch value. Then substantial peaks in the smoothed dataare taken for the formant positions and the heights of thepeaks are taken for the magnitudes. Figure 3 shows a typicalvoiced frame. In Panel (a) the smoothed spectrum is shown,whereas in Panel (b) the unsmoothed spectrum is given. Theformant positions and their magnitudes take up the lastsix positions in each record. One should note that the thirdformant is more arbitrary than the first two and for thisreason has really not been used yet in our follow-up work.

    Phase IIBecause this article is meant to describe the Phase I

    part of the analyzer and validate this first extraction of infantcry data, the Phase II analyzer is only described somewhatbriefly. Phase II output starts with two header records, the

    Figure 2. An example of the spectrum of a hyper-pitch-excited frame and the cues from the peaks.

    Reggiannini et al.: Infant Cry Analyzer 1421

  • first being the same one as the Phase I header with the firstfield changed to read Phase II. The second contains the81 Phase II column headings. (Specific definitions of the fieldsare given in the supplementary material at www.lems.brown.edu/array/download.htm.) The first step in the Phase IIprocessing utilizes the recording quality classification that iscontained in the header information from the Phase I prescan.When running Phase II, the user defines which quality classesshould be used, and Phase II processing is then performedonly on recordings with quality classifications that have beenentered by the user. The Phase II data output consists ofrecords, each of which describes a sound segment, where asound segment is a group of consecutive frames that aresimilar. The Phase II analyzer takes in the Phase I data andproduces an output .csv file with sound segment records ofsize 81 and an average rate of about 3 sound segments persecond. Thus the data rate, using 32-bit numbers, is reducedby a factor of about 7 to 7,776 bits per second. In Phase II,the user makes decisions, the most fundamental of which haveto do with the partitioning into these utterances.

    The output contains one 81-element record for each ofthe three sound segment types that were defined previously,long utterance, short utterance, and silence. (The specificfield definitions are available in the supplementary material;see above.) All 81 fields are filled for long utterances, andappropriate fields are filled for the other types. The 81 fieldsquantify file ID and five various classifier outputs, eight timingparameters, six F0 parameters, five hyper-pitch parameters,13 formant parameters, 15 parameters from fitting a poly-nomial to the pitch contour, and 28 parameters for amplitudes

    from several octave frequency bands. The segmentation isobtained by K-means clustering the 500-Hz to 10-kHz ampli-tude (dB) data into three classes in a prescan of the wholerecording and using the results to classify each frame as oneof three classes: 1 = low energy, 2 = transition energy, and3 = high energy. The important long utterances consist of acontiguous sequence of frames that each has a 500-Hz to10-kHz amplitude (dB) classified as in the high-energy clusterwith a highF0 confidence.Using these frame labels, the changein energy to help with the boundaries, and some extensionrules, the partitioning is determined. If a contiguous sequenceof high-energy frames is longer than 0.5 s (40 frames), a longutterance is created. If only the length criterion is not met,then that sequence is classified as a short utterance, and if thesequence is of low energy, then the sequence is called a silence.The operational definition of a long utterance is consistentwith prior research on infant crying (LaGasse et al., 2005) andallows for analyses of utterances produced in different typesof cries (e.g., initial utterances of pain-induced cries can beexpected to be longer than 0.5 s, but cry utterances producedin different contexts may be shorter). In our work with soundfiles of adequate quality, there has been virtually no mis-labeling of low-energy cry information as silence.

    Many infant cries are very intensive, with a largeamount of frication in the high-energy long utterances.This can be found in our system by seeing if there is veryhigh energy for a frame but low F0 confidence; the extrafrication-sounding energy for this frame tends to mask thecepstral detector. We call this phenomenon voiced fricationand extract pertinent information about it for the Phase II

    Figure 3. An example of the smoothed function for the determination of formant positions; there is strong influence fromthe harmonics of F0.

    1422 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • output. Also, many infants exhibit a short air-intake noiseaudible inspiration that typically follows a long cry and/orone produced by an infant under duressimmediately aftera long utterance. If sufficiently close (in time) to the end of along utterance, this period is included in the long utterancebut specifically noted as a classifier for the long utterance. Anaudible inspiration of this type is likely to be perceived as apart of the cry utterance. The use of this classifier retains thefull length of the utterance while also allowing for the user toexamine utterances with this classifier separately. Althoughthe third formant is very suspect, it has been included. Be-cause the contours of the F0 data within an utterance areimportant, we approximate these contours by a polynomialfit. Using an information-theoretic criterion, we estimate thebest order to use for this model. This number is often large,approaching 20 or more.We then restrict the fit to be of orderfive or fewer, and the best fit is often of the third or fourthorder. All the polynomial fitting is done on the F0 data. Theclass field is a number (1 to 10) descriptor of the shape of thefit: rising, falling, flat, double peak, and so forth. The final28 fields contain information on the amplitudes. Again, thesevalues have not been normalized in any way. Each of thesound segmentlevel statistics has been calculated by goingback to the power domain, accumulating properly over theframes of an utterance, and then transforming back to dB.

    Validation of Pitch-Estimation andFrame Voicing-Decision Algorithms

    Interpreted results from older analysis systems mostoften indicate that timinglengths and spacing of utter-ancesF0, and voicing are highly informative features ofinfant cry production. Moreover, other features of infant cry,such as the contours of F0 across utterances, are dependenton the accuracy of F0 estimation. Therefore, an experimentwas conducted to evaluate the performance of the voicing-detection and pitch-estimation algorithm. We identified cryrecordings recorded previously in an ongoing longitudinalstudy (Lester et al., 2002). Cries were elicited and recordedusing procedures approved by the hospital institutionalreview board (IRB). The IRB also approved access to thesearchival recordings for the purpose of the analyses reportedin this paper. Recordings were made of cries elicited bystandard methods (LaGasse et al., 2005) from typicallydeveloping infants at 1 month of age. Cries were elicited by aspecially designed device that applied a painful stimulus(analogous to a rubberband snap) to the sole of the rightfoot while babies lay supine in a stroller with a unidirectionalmicrophone suspended at a standardized distance above thebaby (5 inches). Cry samples were selected from an existinglongitudinal data set. A total of 15 cries from 15 individualbabies were evaluated, each sample containing between36 and 42 s of cry data. We coded and analyzed only criescharacterized by intense, loud, rhythmic, and sustainedvocalizations that were differentiated from brief cries andfusses characteristic of lower states of arousal.

    These cries were selected on the basis of the infantsbeing the products of full-term normal pregnancies and

    receiving scores within normal limits on later assessments ofdevelopmental functioning (e.g., Bayley Scales of Infant andToddler Development [Bayley, 2005] at 24 months of age).Recordings were made in a quiet and controlled setting at ahospital-based developmental assessment center, and thusthe recording quality was high and background noise wasminimal. Recordings were sampled at 48 kHz with theOlympus direct PCM recorder described above.

    Establishing ground truth. Ground truth was estab-lished for both the presence of voicing and the correspondingF0 by hand-labeling each cry. Pitch-frequency labels wereobtained by hand-marking pitch-period intervals from thetime-domain plot of the cry waveform. For this purposewe utilized a software program developed in our lab thatconveniently displays both time and frequency plots from.wav files (Silverman, 2011). All labels were affixed by asingle person trained to affix time markers at the high-energypeaks that generally allow the denotation of a pitch frequency.Pitch-period labels were affixed for regions of each cryrecording determined to be clearly voiced.

    The intervals of voicing were also hand labeled using aspectrogram plot, as shown in Figure 4. Intervals were firstmarked at the frame level, indicating that the region aboutthat particular 12.5-ms frame advance was voiced. Then, theregions indicated by the labels on the frames as voiced werefine-tuned to indicate specific interval types at the resolutionof the sampling time by viewing the corresponding time-domain plot. Five different interval types were defined:voiced (V), unvoiced (UV), silence (S), voiced frication (VF),and high voicing (HV). An interval was labeled as V ifthe spectrogram showed a well-defined harmonic structure,indicating periodicity. An interval was labeled as UV ifthe spectrogram showed significant energywithout the presenceof harmonics. S intervals showed very low signal energy. TheVF label was assigned when an interval exhibited a harmonicstructure in addition to turbulent (frication) noise at nonhar-monic frequencies. VFs were given a separate label because itis unclear whether such frames should be labeled as V or UV.Finally, the HV label was assigned to intervals with a verysparse harmonic structure, indicating a very high fundamen-tal frequency (greater than 1 kHz), which we have calledhyper-pitchexcited frames.

    Table 3 shows the number of frames in the data setcorresponding to each of the five voicing classes. The infantcries in this data set consisted mainly of voiced speech. Exam-ples of the HV and UV classes occurred quite infrequently.

    The labeling was conducted by a research assistantwho was first trained to understand the kinds of patternsthat should be labeled and then trained to criterion level ofaccuracy by the first author. Once the labelers accuracy wasconfirmed on a series of training samples, she then handcoded the cry samples as described above. It was these hand-coded cry samples that were used as the gold standard orground truth for subsequent analyses of the accuracy of theautomated system. Each frame required the careful labelingof 4 to 15 (or more if hyper-pitch) F0 onsets, and some 2,915frames were hand labeled. To cross-validate the hand-labeledground truth, the author (X. L.) used the same criterion to

    Reggiannini et al.: Infant Cry Analyzer 1423

  • hand label a little less than 10% of the frames (256). Thereceiver operating characteristic (ROC) curve and an expan-sion of the knee part of the curve are shown in Figure 5.It may be seen in this figure that about 92% of the ground-truth data agree with the data labeled by X. L. within a 2-Hztolerance and that there is 98% agreement within a 5-Hztolerance. We are thus quite confident in our ground-truthdata.

    Fundamental frequency. The results presented heredemonstrate the accuracy of the F0 estimation algorithm.The ground-truth labels were placed at sample indices ofconsistent peaks bracketing each pitch period during clearlyvoiced cries. There are clearly multiple pitch periods in eachvoiced frame. The sample indices were compared with theframe boundaries used by the analysis system to find allframes that were 100% covered by the pitch-period labels.The subset of frames for which hand-marked pitch-periodlabels were available is represented as v0. The same set ofcry recordings were processed by the analysis system, whichoutput the set of estimated voiced frames v. The followinganalysis was carried out on v 7 v0, the set of all frames forwhich the automatic voicing labels, v, and the ground-truthvoicing labels, v0, agreed. The set v 7 v0 contained a totalof 2,915 voiced frames.

    For each voiced frame in v 7 v0, the magnitude of theerror between the estimated pitch frequency, f, and the ground-truth pitch frequency, F0, was computed. The pitch frequencyestimate was considered to be correct iff F0 T for some

    tolerance T in Hz. One should note that the quantizationtolerance in the cepstral domain varies from about 1 Hzat F0 = 200 Hz to about 5 Hz at F0 = 1 kHz. Figure 6shows the percentage of frames with correct pitch-frequencyestimates corresponding to each pitch-frequency tolerance,T.Several operating points are also shown in Table 4. As canbe seen, the automated F0 detection had an accuracy ofabout 90% at a tolerance of 10 Hz and nearly 95% at atolerance of 20Hz.We did not see evidence for any systematicdisagreement between the hand-coded and automated F0detection.

    Voicing. A separate analysis was carried out to eval-uate voicing-detection capabilities of the system. This anal-ysis was formulated as a simple two-category classificationproblem, and Figures 7 and 8 give standard ROC curvesshowing the results. The two figures differ in that Figure 7includes S frames, whereas Figure 8 does not.

    Figures 7 and 8 show that the system is very effective indistinguishing V frames from UV and S frames. As expected,the system achieves much higher error rates when attempt-ing to detect VF, which by definition is a mixture of voic-ing and turbulent signals. The HV frames were also moredifficult to detect, although they occurred infrequently in thisdata set. Figures 7 and 8 include area under the curve (Az)values demonstrating accurate detection of V sound seg-ments. Az values ranged from .907 to .997 for the analysis thatincluded frames with S and .883 to .995 for frames that didnot include S.

    ConclusionsWe have presented the details of a modern infant cry

    analyzer that can be run in near real time on a normal PCplatform or could be run in real time on many of todaysembedded processors. The design is the result of 2 years ofcollaborative effort between hospital-based and engineering-based faculty at Brown University. The intent of thiscollaboration was to produce a system that would haveutility for both basic and applied research on infant cry

    Figure 4. Ground truth for voicing type was established by hand labeling a spectrogram plot. Intervals were labeled asvoiced (V), unvoiced (UV), silence (S), voiced frication (VF), or high voicing (HV).

    Table 3. Number of frames in the data set, labeled with each of thefive voicing classes.

    Voicing class # of frames

    Voiced (V) 27,745.High voiced (HV) 92.Unvoiced (UV) 3,155.Voiced frication (VF) 560.Silence (S) 13,638.

    1424 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • production. This system extends and builds upon recentapproaches to quantifying acoustic features of infant cry (e.g.,Branco et al., 2007; LaGasse et al., 2005; Lester et al., 1991;Manfredi et al., 2009; Vrallyay et al., 2004). This auto-mated system is described in detail in order to provide thereader and potential users with a clear understanding of theapproach that we used to develop this system. In addition,and quite uniquely, we conducted stringent tests of theaccuracy of this automated system as compared to hand-labeled cry spectrograms.

    The analysis system has two levels of output. Phase Isegments the sound into analysis frames with an advance of12.5 ms. Each frame is summarized by the system for featuresthat include timing, voicing, F0, amplitude, and formantinformation. Phase II operates on the Phase I data, makingdecisions with regard to classifying portions of the sample ascry utterances or silence, which could be a portion of therecording prior to cry onset or could represent time periodsbetween cry utterances. This timing information allowsresearchers to utilize measures such as latency to cry, which

    is of interest for researchers utilizing standard methods toelicit infant cries (LaGasse et al., 2005), and interutteranceintervals, which may be useful for classifying different typesof infant cries (e.g., pain vs. nonpain cries). In addition to thistiming information, the Phase II output yields summary de-scriptors of cry utterances, including measures of F0, ampli-tude of cry in various frequency bands, and estimates offormant location. This Phase II output also yields measuresof the voiced proportion of each cry utterance. A uniqueaspect of this output is that it includes a confidence estimate

    Figure 5. Panel (a): Receiver operating characteristic (ROC) curveshowing agreement between ground-truth hand labeling andauthor X. L. hand labeling of about 10% of data used in the validation.Panel (b): Expanded graph of the dotted area of Panel (a).

    Figure 6. Percentage of the 2,915 voiced frames with correct pitch-frequency estimates (f F0 T ) for several error tolerances(T in Hz).

    Table 4. Percentage of the 2,915 voiced frames with correct pitch-frequency estimates (f f0 T) for several error tolerances(T in Hz).

    T (Hz) % Correct frames

    10 88.4420 94.1730 95.3340 96.1250 96.43

    Reggiannini et al.: Infant Cry Analyzer 1425

  • for the voicing decision. This is based on an SNR analysisand allows the researcher both full information on how thevoicing decision was made and the ability to modify thisdecision, should the research question call for a more- or less-stringent definition of voicing.

    An additional unique feature of the Phase II output is anautomated approach to describing F0 contours across a cryutterance. Some past research has made use of this variationin F0 across utterances to describe melodic aspects of cries,but it has accomplished this task by hand classification of F0contours from spectrograms (Mampe, Friederici, & Wermke,2009; Wermke, Mende, Manfredi, & Bruscaglioni, 2002).The system described here utilizes a polynomial fit methodto classify F0 contours. Initially, the system classifies thesecontours into one of 10 categories. This outputmay be used toidentify cry utterances withmore or less prototypical contours,to characterize the complexity of such F0 variation, or toexplore differences in F0 contours related to development orpopulation differences. The validity of an automated acousticanalysis is dependent on its performance accuracy. Therefore,we conducted a substantial experiment that indicates theaccuracy of both the voicing and the fundamental frequencydetectors. The features that were selected, F0 and voicing,are the ones that have proven to be most discriminating ofclinical populations in past literature.

    As depicted in Figure 6, about 90% of the automaticestimates were within a F0 tolerance of 10 Hz. The best theestimator does is 96.4%when the tolerance is opened up a bit to50 Hz. Virtually all errors occur at the boundaries of voicedutterances. Equal-error rates for voiced (vs. unvoiced or silence)frame detection is nearly 99%. Much more difficult to detecthyper-pitch frames are identified with an equal-error rate ofabout 80%. Past research utilizing automated analyses ofinfant cry has generally not reported this type of performanceanalysis. Furthermore, other computer-assisted methodshave utilized analyzers designed for adult speech. Validationof a system specifically designed to summarize the acousticfeatures of infant cry is therefore an advance in the field anda unique strength of the study reported here. The results ofour experiments revealed high accuracy of the automaticdetectors of F0 and voicing decisions in comparison to gold-standard hand coding from spectrogram displays.

    Our careful experiment demonstrates that the analysissystem yields an excellent reduced data representation of thedesired acoustic features of babies cries. However, there aresome areas of analysis that are a significant challenge forinfant cry analysis. In particular, the accurate automaticdetection of formants is quite difficult given the high pitchand wide harmonic structure of infant cry (Robb & Cacace,1995). In adult speech, the shape of the vocal tract determinesthe resonant frequencies, which are described as formants.For our purposes, we applied a low-pass lifter to the data inorder to assist in estimating the location and magnitude offormants in the infant cry. We have described this approach,but we acknowledge that the problem of both the measure-ment and interpretation of formants in infant cry remains tobe fully resolved. An additional challenge is to reliably deter-mine voicing in conditions that we refer to as voiced fricationor high voicing portions of a cry utterance. These issues area reflection of some of the conceptual and methodologicalchallenges to infant cry analysis more generally. Thus, theseissues notwithstanding, our interpretation is that this studyreports a level of performance accuracy that is quite sufficient

    Figure 7. ROC curves giving the voicing-detection performance ofthe system. V, VF, and HV frames were separately considered tobe positives. In each case, both UV and S frames were considered tobenegatives. Area under the curve (Az) valueswereas follows: V/(UV,S) =.997; HV/(UV,S) = .907; VF/(UV,S) = .995.

    Figure 8. ROC curves giving the voicing-detection performance ofthe system. V, VF, and HV frames were separately considered to bepositives. Only UV frames were considered to be negatives. Az valueswere: V/(UV) = .995; HV/(UV) = .883; VF/(UV) = .934.

    1426 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • for both human and automatic interpretation of cries forvarious phenomena. For example, the automated nature ofthis analysis system makes possible rapid analysis for largedata sets and thus studies of substantial numbers of subjects,allowing for more powerful studies of differences in infant cryassociated with various medical or developmental conditionsor populations. A highly unique aspect of this system is thatit allows researchers to summarize broad characteristics of cryutterances using the Phase II output while also preservingdetailed microanalytic data in the Phase I output that wouldallow for precise characterization of within-utterance varia-tions in cry production.

    A number of future directions for this research arepossible: It can be applied to questions pertaining to possibleindividual or group differences in cry production that mayhelp to screen for infants at risk for various developmentaldisorders, or it may find use in medical applications, such asidentifying infants at risk for poor developmental outcomes.Thus, a validated cry analyzer will be useful for continuedresearch on developmental outcomes in at-risk infants, in-cluding investigations of neurobehavioral outcomes associ-ated with prenatal environmental risk factors. Moreover, thecomplex nature of infant cry acoustics has the potential toyield feature patterns that can be used to identify infants atelevated risk for poor developmental outcomes or specificdevelopmental disorders such as autism spectrum disorders.More basic research may also utilize this system in order tostudy normative aspects of infant cry production with largersamples than has been possible in the past. To this end, weintend to make the MATLAB version available online atwww.brown.edu/Departments/Children_at_Risk, so that thegeneral community will have access to a high-quality infantcry analyzer. Future efforts will involve refining the analysisand output of this system, as well as developing a more user-friendly interface to enhance its accessibility for a variety ofresearchers.

    AcknowledgmentsWe thank Abbie Popa for her assistance on this project. This

    study was supported by National Institute on Deafness and OtherCommunication Disorders Grant 5-R21-DC010925.

    ReferencesAnanthapadmanabha, T. V., & Yegnanarayana, B. (1975). Epoch

    extraction of voiced speech. IEEE Transactions on Acoustics,Speech, and Signal Processing, 23, 562569.

    Baronin, S. P., & Kushtuev, A. I. (1971). (In Russian.) How to designan adaptive pitch determination algorithm. Proceedings of the7th All-Union Acoustical Conference, 18.

    Bayley, N. (2005).Bayley Scales of Infant DevelopmentThird Edition.San Antonio, CA: PsychCorp.

    Bogert, B. P., Healy, M. J. R., & Tukey, J.W. (1963). The quefrencyanalysis of time series for echoes:Cepstrum,pseudo-autocovariance,cross-cepstrum, and saphe cracking. In M. Rosenblatt (Ed.), Pro-ceedings of the Symposium on Time Series Analysis (pp. 209243).New York, NY: Wiley.

    Branco, A., Fekete, S. M. W., Rugolo, L. M. S. S., & Rehder, M. I.(2007). The newborn pain cry:Descriptive acoustic spectrographic

    analysis. International Journal of Pediatric Otorhinolaryngology,71, 539546. doi:10.1016/j.ijporl.2006.11.009

    Corwin, M. J., Lester, B. M., Sepkoski, C., McLaughlin, S.,Kayne, H., & Golub, H. L. (1992). Effects of in utero cocaineexposure on newborn acoustical cry characteristics. Pediatrics,89, 11991203.

    Dubnowski, J. J., Schafer, R. W., & Rabiner, L. R. (1976). Real-timedigital hardware pitch detector. IEEE Transactions on Acoustics,Speech, and Signal Processing, 24, 28.

    Esposito, G., & Venuti, P. (2010). Developmental changes in thefundamental frequency (f0) of infant cries: A study of childrenwith autism spectrum disorder. Early Child Development andCare, 180, 10931102. doi:10.1080/03004430902775633

    Gill, J. S. (1959). Automatic extraction of the excitation functionof speech with particular reference to the use of correlationmethods. ICA, 3, 217220.

    Goberman, A. M., & Robb, M. P. (1999). Acoustic examinationof preterm and full-term infant cries: The long-time averagespectrum. Journal of Speech, Language, and Hearing Research,42, 850861.

    Golub, H. L. (1989). Infant crying: Theoretical and research per-spectives. In B.M. Lester & C. Boukydis (Eds.),A physioacousticmodel of the infant cry (pp. 5982). NewYork,NY: PlenumPress.

    Grau, S. M., Robb, M. P., & Cacace, A. T. (1995). Acoustic cor-relates of inspiratory phonation during infant cry. Journal ofSpeech and Hearing Research, 38, 373381.

    Karelitz, S., & Fisichelli, V. R. (1962). The cry thresholds of normalinfants and those with brain damage:An aid in the early diagnosisof severe brain damage. Journal of Pediatrics, 61, 679685.

    LaGasse, L. L., Neal, A. R., & Lester, B. M. (2005). Assessment ofinfant cry: Acoustic cry analysis and parental perception.MentalRetardation and Developmental Disabilities Research Reviews,11, 8393. doi:10.1002/mrdd.20050

    Lester, B. M., Corwin, M. J., Sepkoski, C., Seifer, R., Peucker, M.,McLaughlin, S., & Golub, H. L. (1991). Neurobehavioral syn-dromes in cocaine-exposed newborn infants. Child Development,62, 694705.

    Lester, B. M., Tronick, E. Z., LaGasse, L., Seifer, R., Bauer, C. R.,Shankaran, S., . . . Maza, P. L. (2002). The maternal lifestylestudy: Effects of substance exposure during pregnancy onneurodevelopmental outcome in 1-month-old infants. Pediatrics,110, 11821192. doi:10.1542/peds.110.6.1182

    Lind, J., Vuorenkoski, V., Rosberg, G., Partanen, T. J., &Wasz-Hockert, O. (1970). Specto-graphic analysis of vocalresponse to pain stimuli in infants with Downs syndrome.Developmental Medicine and Child Neurology, 12, 478486.

    Mampe, B., Friederici, A. D. C. A., & Wermke, K. (2009).Newborns cry melody is shaped by their native language.Current Biology, 19, 19941997. doi:10.1016/j.cub.2009.09.064

    Manfredi, C., Bocchi, L., Orlandi, S., Spaccaterra, L., & Donzelli,G. P. (2009). High-resolution cry analysis in preterm newborninfants.Medical Engineering & Physics, 31, 528532. doi:10.1016/j.medengphy.2008.10.003

    Manfredi, C., Tocchioni, V., & Bocchi, L. (2006). A robust tool fornewborn infant cry analysis. Conference Proceedings: AnnualInternational Conference of the IEEE Engineering in Medicineand Biology Society, 1, 509512. doi:10.1109/IEMBS.2006.259802

    Martin, P. (1981). Dtection de la f0 par intercorrelation avec unefunction peigne [Detection of F0 by cross-correlation with acomb function]. Journes dtude sur la Parole, 12, 221232.

    Newman, J. D. (1988). Investigating the physiological control ofmammalian vocalizations. In J. D. Newman (Ed.), The physio-logical control of mammalian vocalizations (pp. 15). New York,NY: Plenum Press.

    Reggiannini et al.: Infant Cry Analyzer 1427

  • Noll, A. M. (1967). Cepstrum pitch determination. The Journal ofthe Acoustical Society of America, 40, 1241.

    Prechtl, H. F., Theorell, K., Gramsbergen, A., & Lind, J. F. (1969).A statistical analysis of cry patterns in normal and abnormalnewborn infants. Developmental Medicine and Child Neurology,11, 142152.

    Rabiner, L. R. (1977). On the use of autocorrelation analysis forpitch detection. IEEE Transactions on Acoustics, Speech, andSignal Processing, 25, 2433.

    Rader, C. M. (1964). Vector pitch detection. The Journal of theAcoustical Society of America, 36, 1463.

    Robb, M. P., & Cacace, A. T. (1995). Estimation of formant fre-quencies in infant cry. International Journal of Pediatric Otorhi-nolaryngology, 32, 5767. doi:10.1016/0165-5876(94)01112-b

    Secrest, B., & Doddington, G. (1982). Postprocessing techniques forvoice pitch trackers. Proceedings of the International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), 7,172175.

    Sheinkopf, S. J., Iverson, J. M., Rinaldi, M. L., & Lester, B. M.(2012). Atypical cry acoustics in 6-month-old infants at risk forautism spectrum disorder. Autism Research, 5, 331339.

    Silverman, K. (2011). Hmaview: A program for viewing and labelingarray time-domain and frequency-domain audio-range signals.Retrieved from www.lems.brown.edu/array/download.html.

    Smith, C. P. (1954). Device for extracting the excitation functionfrom speech signals. U.S. Patent No. 2,691,137 (Issued Oct 5,1954; filed June 27, 1952; reissued 1956).

    Smith, C. P. (1957). Speech data reduction: Voice communications bymeans of binary signals at rates under 1000 bits/sec (AFCRC No.DDC-AD-117920).

    Stone, R. B., & White, G. M. (1963). Digital correlator detects voicefundamental. Electronics, 36, 2630.

    Vrallyay, G., Beny, Z., Illnyi, A., Farkas, Z., & Kovcs, L. (2004).Acoustic analysis of the infant cry: Classical and new methods.Conference Proceedings: Annual International Conference of theIEEE Engineering in Medicine and Biology Society, 1, 313316.

    Vuorenkoski, V., Lind, J., Partanen, T. J., Lejeune, J., Lafourcade,J., & Wasz-Hockert, O. (1966). Spectrographic analysis of criesfrom children with maladie du cri du chat. Annales PaediatriaeFenniae, 12, 174180.

    Wermke, K., Leising, D., & Stellzig-Eisenhauer, A. (2007). Relationof melody complexity in infant cries to language outcome inthe second year of life: A longitudinal study. Clinical Linguistics& Phonetics, 21, 961973. doi:10.1080/02699200701659243

    Wermke, K., Mende, W., Manfredi, C., & Bruscaglioni, P. (2002).Developmental aspects of infants cry melody and formants.Medical Engineering and Physics, 24(78), 501514. doi:S1350453302000619

    Wermke, K., & Robb, M. P. (2010). Fundamental frequency ofneonatal crying: Does body size matter? Journal of Voice, 24,388394. doi:10.1016/j.jvoice.2008.11.002

    Zeskind, P. S., & Barr, R. G. (1997). Acoustic characteristics ofnaturally occurring cries of infants with colic. Child Develop-ment, 68, 394403.

    1428 Journal of Speech, Language, and Hearing Research Vol. 56 14161428 October 2013

  • Reproduced with permission of the copyright owner. Further reproduction prohibited withoutpermission.