051013 Speaker Identification-ocr

Embed Size (px)

Citation preview

  • 7/30/2019 051013 Speaker Identification-ocr

    1/17

    Attempted Speaker Identification

    Florida vs. Zimmerman

    Report to:

    Richard W. Mantei

    Assistant State Attorney

    Fourth Judicial Circuit of Florida

    220 E. Bay St., Jacksonville, FL 32202

    March 20,2013

    Report Prepared by:

    Harry Hollien PhD

    James D. Harnsberger PhD

    Senior ConsultantsForensic Communication Associates

  • 7/30/2019 051013 Speaker Identification-ocr

    2/17

    REPORT ON ATTEMPTED SPEAKER IDENTIFICATION

    Florida vs. Zimmerman

    INTRODUCTION

    Personnel at Forensic Communication Associates (FCA) were contacted by Mr. Richard

    Mantei, Assistant State Attorney, Fourth Judicial Circuit of Florida, Jacksonville, regarding

    recordings associated with the above cited case. It was requested that an attempt be made to

    discover if the male voice found on a 9-11 recording (i.e., the unknown voice) was the same

    as the one recorded on an exemplar5 CD (the known voice). The person speaking on the

    exemplar was Mr. George Zimmerman. Later exemplar recordings were requested for Mr.

    Trayvon Martin; the speech on them was to be compared to the 9-11 utterances also.

    MATERIALS RECEIVED

    Two CD recordings were received at FCA. One was of the relevant 9-11 call. It was

    labeled 911 witness call, an address and^m^^i^. The second CD contained the voice of

    George Zimmerman. It was dated 2/26/12, dispatch callfllMHHSfc. Both were labeled

    with an FCA number and digital copies made on laboratory equipment and a computer.

    FCA personnel then requested additional voice samples both of G. Zimmerman and T.

    Martin. At various later dates, three Zimmerman CDs were received (they were jail calls,

    4/20/12; video interview, 2/27/12; and reenactment audio, 3/22/12

    Finally, two DVDs taken from Trayvon Martins phone (only markings) were received at yet

    a later date. Identifying marks were placed on these recordings also and digital copies made of

    Page 2 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    3/17

    them (i.e., via computer input). The undersigned and a senior technician listened to them in

    their entirety several times. Analysis CDs were constructed (of evidence-exemplar sets and

    pairs) and the samples they contained processed by means of several aural-perceptual speaker

    identification techniques (see below and Hollien, H., Acoustics of Crime, Plenum, 1990;

    Hollien and Hollien, Improving Aura-perception Speaker Identification Techniques, Studies in

    Forensic Phonetics, 1995, Wissenchaftlicher, Trier, and Hollien, H., Forensic Voice

    Identification. Academic Press, 2001).

    THE RECORDINGS

    The Evidence Recording.

    As expected, the samples on the 9-11 evidence recording were not at all suitable for

    ordinary speaker identification analyses. First, they were mostly short grunts, calls or cries; a

    few gave the illusion of speech, mostly help or help me. Second, with only two exceptions,

    they were rather faint. Third, since they were recorded at a 9-11 center, other voices were heard

    (they were much louder, of course). In many instances, these voices obliterated and/or

    overlapped those in the background. Fourth, 16 utterances in all could be identified. However,

    only six were found to be potentially useful and some of their extent was lost when they were

    extracted. Taken as a whole, only a little over 8 sec. of speech was found to be available for

    assessment.

    Exemplars.

    On the other hand, the energy levels of the utterances on the exemplar recordings were

    sufficient and the overall quality of those produced was quite good in all instances. They were

    Page 3 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    4/17

    of the type suitable for speaker identification purposes; that is, they were intelligible and,

    although noise was intermittently present, it rarely masked the speech. The problem, of course,

    was that very few of the utterances they contained actually were suitable for comparison as

    those involving very short samples produced under high stress were quite rare. Although a

    number of procedures were tried, ultimately the judgments made contrasting the Zimmerman re

    created cries as exemplars (i.e., when compared to the six 9-11 samples) were the most useful.

    For Martin, several of his high frequency laughs, exclamations and mocking utterances were

    employed.

    Selection of the Unknown Samples.

    As stated, the major problem was that very little speech/voice material was available for

    processing; a second problem was that they were all calls or cries; the third was that they were

    very faint and, the fourth, that most were at least partly masked by the speech of other talkers. In

    all, 16 short calls/cries were identified and very little intelligible speech was available: i.e., only

    one or two instances of help or help me. Of these 16 samples, only six provided at least

    500ms or more of a clear call and, even in these instances, part of the total call had to be

    removed. As stated, just a little more than 8 sec. of phonation was available. Samples this brief

    rarely lead to attempts at speaker identification. Ordinarily, 10 words or 10 seconds of speech

    constitute a bare minimum. However, they were the only unknown samples available and the

    task involved making a determination between but two speakers (i.e., G. Zimmerman and T.

    Martin).

    Page 4 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    5/17

    Preparation of the Recordings.

    Samples of the "unknown" (U) and the "known" (K) speakers were prepared for the

    aural-perceptual comparisons. These procedures included the selection of three sets of samples

    for three different analyses. The first set was mostly for familiarization purposes. It involved a

    compilation of all the calls/cries onto a single recording. It was compared (serially) to a group of

    short speech passages drawn from the several interviews/calls made by Mr. Zimmerman (K).

    Later on, this procedure was independently applied to samples from Mr. Martins telephone

    calls (K).

    The second procedure was to create six separate CDs, each with a different call or cry

    from the 9-11 recordings and individually compare them to a variety of short speech samples

    from the K recordings (first to Zimmerman samples, then separately to those by Martin). The

    final procedure (i.e., the third one) was most important. These same cries/calls were individually

    compared to the cries/calls from the reenactment recording. In Mr. Martins case, samples of

    laughter, mocking, and high pitched exclamations were employed. As would be expected, all

    samples were of the best available quality and where noise was at a minimum. Both the

    known and the unknown samples were band-pass filtered with both the high pass and low

    pass filtering cutoffs set outside the speech range. This procedure was carried out in order to 1)

    minimize any situational differences, 2) reduce distractions and 3) eliminate some of the non

    speech artifacts present on the recordings. Thus, the highest quality samples possible were

    made available.

    AURAL-PERCEPTUAL SPEAKER IDENTIFICATION

    The aural-perceptual speaker identification procedures employed are those where an

    Page 5 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    6/17

    unknown voice (U), drawn from an evidence recording, is compared to exemplars of a known

    voice (K). As stated above, samples of a number of U-K combinations were placed on a CD in

    pairs for direct and repeated comparison. The undersigned then carried out evaluations which

    were based on a number of heard parameters. In this instance, they only included comparisons

    of: I) fundamental frequency, 2) voice quality, 3) vocal intensity (variability) patterns, 4)

    vowels, and 5) nasality. Subjective impressions also were logged for consideration. This entire

    procedure was completed, then repeated in its entirety some time later -- usually the next day.

    In this case, a speaker identification procedure had to be employed in which an attempt

    had to be made to match - or not match - Mr. Zimmermans vocalizations to the six usable cries

    found on the 9-11 call with utterances (as similar as possible) from exemplar recordings. As

    stated, the process was carried out as follows. The greatest extent of the cry possible was

    isolated; all extraneous noise was removed; the cry was repeated 8-10 times. A comparison

    recording of a number of Mr. Zimmermans exemplar utterances was compared in turn - and,

    individually - with each of the eight samples. Voice quality, pitch, vowel quality, nasality and

    intensity inflections were the primary judgmental features. Finally, the entire process was

    repeated using the mimicked cries by Mr. Zimmerman. It then was carried out twice for Mr.

    Martin. First with general speech samples, then with the available stress units.

    It also should be noted that, for this case, the usual procedure was further modified.

    Ordinarily, evaluating an identification parameter (pitch say) was carried out by playing the

    pairs over and over until a decision could be made. Here, the single identification parameter

    remained the same but the specific cry (No. 8 say) was compared to a variety of exemplar

    samples (again, over and over) until the judgment is made. It then was repeated for the five

    other U utterances. Thus, the process here more closely parallels six separate speaker

    Page 6 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    7/17

    identifications with the product for each summed both by cry andfeature.

    Please note that the two investigators worked independently and did not compare results

    until afterall evaluations had been completed.

    As implied, the (individual) assessments ordinarily obtained are summarized on a

    continua like the one found in Figure 1. In general, the range of scores making up each

    continuum can be divided roughly as follows: 1) any mean scores in the 0-3 range suggest that a

    match cannot be made and the samples were produced by two different individuals, 2) a scoring

    of 4-6 is generally neutral but somewhat on the positive side (i.e., toward a match) and 3) those

    that fall within the 7-10 range indicate a positive-to-strong match. It should be stressed again

    that the listed parameters were evaluated one at a time with the complete procedure

    independently replicated a number of times. This method of presentation was adapted for these

    evaluations (see Figures 2-5).

    RESULTS

    The prepared samples were played (repeatedly) on high quality laboratory equipment.

    The findings and impressions of the undersigned resulted in differing conclusions depending on

    which of the U samples were compared to which of those uttered by the two known (K)

    speakers. As stated, a maximum of only five speech/voice parameters (plus an overall

    judgment) could be used to permit U-K judgments.

    The Bases of the Comparisons.

    1. Pitch. Perceived pitch is the psychophysical correlate of fundamental frequency

    Page 7 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    8/17

    usage. In this case, it refers to the level of those tones produced by the speaker. It

    proved to be one of the weaker contrasts in this evaluation.

    2. Voice Quality. This dimension is a little difficult to define but rather easy to

    demonstrate. Any hearing individual would have little difficulty differentiating a

    violin from a saxophone even though both were played at the same fundamental

    frequency and intensity. The relative differences among the partials (frequencies)

    within the complex musical sounds are what make this discrimination possible. This

    characteristic proved to be a major factor for these assessments as the same is true

    for human speakers.

    3. Vocal Intensity Patterns. Absolute vocal intensity levels are very difficult to detect

    because even slight differences in the environmental situation, microphone position,

    talker distance, etc., can result in large differences in the absolute level of

    measured or perceived loudness. As with pitch, the intensity variability patterns

    proved to be one of the lesser identification features. Yet they aided in some

    judgments.

    4. Nasality. Detection of the amount of nasality in the cries and exemplar samples

    proved to be helpful.

    5. Vowels. In some cases, vowel format comparisons of the calls with exemplar

    samples provided enough information to permit graded same-different judgments.

    Page 8 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    9/17

    6. Finally, each of our evaluators provided a general overall assessment of the U-K

    samples. In many cases, these efforts aided in the decision making.

    Specific Results.

    The first of the three sets of judgments (i.e., general speech) for Mr. Zimmerman was

    simply inconclusive and will not be included in the results. The second provided some insight so

    its results will be established as Figure 2. The third (i.e., the U-K comparisons of the 9-11 call

    calls/cries vs. those from the reenactment) was the most important. Please see Figure 3. The two

    sets for Mr. Martin parallel those of Mr. Zimmermans to some extent - i.e., short speech

    samples and more relevant samples. They will be presented as Figures 4 and 5. Over two

    thousand specific judgments were required to permit the following decisions to be made.

    As can be seen from consideration of the four figures, no robust matches were obtained.

    On the other hand, several rather strong tendencies were found. First, please note the following.

    Call No. 11 proved almost impossible to judge once the masking (of other) voices was trimmed

    from its borders. Accordingly, data from this sample will not be included on any figures.

    Data on Mr. Zimmerman. The scores for cries/calls Numbers 1 and 8 were so low (see

    Figures 2 and 3) that Iittle-to-no evidence that Mr. Zimmerman made them appeared to exist.

    His scores for call No. 13 were rather mixed - and they were quite variable. Thus, even though

    their mean was close to 5.0, the judgment had to be that they were inconclusive. That is, while

    they graded above the 0-3 range, they still fell far short of a match. On the other hand, the data

    for cry No. 14 and (more so) for cry No. 16 proved to be more toward - but in most instances

    not quite reaching - a match. Indeed, as may be seen from the range data, several of the

    Page 9 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    10/17

    individual scores exceeded the border of the match category - and the mean for No. 16 (see

    Figure 3) also came very close to a match. In short, there is a very good possibility that, under

    normal circumstances, cry No. 14 and, especially, cry No. 16 would be judged to be a match -

    i.e., that Mr. Zimmerman had, indeed, made one or both of those two utterances. In this

    instance, the confidence level only reaches about 65-70%. Nevertheless, it is even much less

    likely that he (George Zimmerman) was notthe person who made these two cries.

    Data for Mr. Martin. The data for Mr. Martin are similar in extent but different in

    pattern. Of course, the judgments here are even more difficult to make as they were drawn from

    a telephone call and, unlike those for Mr. Zimmerman, no reenactment samples were available.

    In judging Figures 4 and 5 (based also on two separate analyses), it can be noted 1) that no

    judgments were possible for call No. 11, and data for calls No. 13, 14, and 16 quite clearly

    demonstrate that he did not make them. That is, the means of all the many hundreds of

    judgments usually ranged from 1.0 to 3.1. And, even though one score reached 5.5, very few of

    the individual judgments exceeded the non-match category. Thus, even with these restricted

    judgments, there is too little evidence suggesting that he uttered any of these three calls. On the

    other hand, there is some evidence that he was responsible for the first two calls/cries (i.e., No. 1

    and No. 8). Note their mean scores on Figures 4 and 5. They range from 5.8-6.5 for the first

    identification run (Fig. 4) and 6.4-6.5 for the second (Fig. 5). Note also, that several of the

    individual judgments are in the 7.0 or above (i.e., match) category. Thus, it may be concluded

    with a nearly 70% confidence level that Mr. Martin produced the first two calls. Again, while

    they did not reach the definitely match category, the data do not provide any real evidence

    that he did notmake these utterances.

    Page 10 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    11/17

    DISCUSSION

    While the evidence suggests that Mr. Martin produced the first two utterances and Mr.

    Zimmerman made the last two, the confidence level for these relationships is not very robust.

    Yet, conclusions of these low magnitudes are hardly surprising, given the limits and difficulty of

    the evaluation process. It is possible, of course, that more robust data could have been obtained

    if we had been supported in conducting two additional sets of procedures. The first of these

    procedures would have included comparative acoustic analyses of the listed U-K samples. The

    second would have been a perceptual experiment to compare the evidence recordings to an

    appropriately-sized samples of male speakers that were matched in age, gender, and linguistic

    background to, alternatively, Trayvon Martin and George Zimmerman. These two groups of

    speakers would produce utterances similar to those found on the 9-11 and exemplar recordings.

    The results of these procedures would have aided the undersigned in confirming or not

    confirming the findings reviewed above.

    CONCLUSIONS

    The opinions to follow are based primarily on the aural-perceptual evaluations described

    above. As was stated, even though many problems were evident, the evidence recording

    provided minimum-to-marginal material for identification purposes. Moreover, the exemplar

    recordings contained enough material to permit a number of different judgments to be made.

    Based on the many analyses carried out, the undersigned had to conclude that, while there is

    evidence to suggest that Mr. Martin made the first two calls/cries (Nos. 1 and 8) and that Mr.

    Zimmerman made those identified as 14 and 16, none of these conclusions reached the criterion

    for a match. Neither speaker could be identified as being responsible for the others.

    Page 11 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    12/17

    Finally, it must be conceded that the aural-perceptual method of speaker identification,

    while reasonably well organized and extensive in this case, is somewhat subjective in nature

    and, hence, the possibility of error exists. Nonetheless, the reported data can be defended on the

    basis of the rigorous procedures employed and, hence, the conclusions drawn can be viewed as

    reasonable.

    Respectively submitted,

    Forensic Communication Associates

    James D. Hamsberger, PhD

    Senior Consultant

    Harry Hollien, Ph.D.

    Senior Consultant

    Page 12 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    13/17

    Figure 1.

    Case Name:

    A sample of the type of summary figure employed in ordinary aural-perceptual

    speaker identification. The structuring of figures 2-5 is patterned on this one.

    FORENSIC COMMUNICATION ASSOCIATES

    Aural-perceptual Approach to Speaker Identification Score Sheet

    0 = U-K least alike; 10 = U-K most alike

    FCA REF:

    1. PITCH

    a. Level

    b. Variability

    c. Patterns

    2. VOICE QUALITY

    a. General

    b. Vocal Fry

    c. Other

    3. INTENSITY

    a. Variability

    4. DIALECT

    a. Regional

    b. Foreign

    c. Idiolect

    5. ARTICULATION

    a. Vowels

    b. Consonants

    c. Misarticulations

    d. Nasality

    6. PROSODY

    a. Rate

    b. Speech Bursts

    c. Other

    10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    0 . . . . 5 . . . . 10

    SCORE RANGE

    MEAN

    Page 13 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    14/17

    Figure 2. Comparison of Mr. Zimmermans general (but short) samples with the six cries

    or calls drawn from the 9-11 telephone call. Twenty such samples were

    compared to each cry. The data (x) on the continuum are the means of at least

    four of the five features plus a general assessment.

    Cry Number Perception Mean Judgment Range

    No. 1 wow 0. .X.. 5.. . 10 = 2.4 1.0-4.0

    No. 8 ow 0.X..5... . 10= 1.9 0-4.5

    No. 11 (mainly cherp) 0 .... 5 .... 10 = N/A null

    No. 13 wyra 0... .X5.. . 10 = 4.7 2.5 - 5.5

    No. 14 owa 0 .... 5 X .. . 10 = 6.1 4.0 - 6.0

    No. 16 swaa 0 .... 5 .X.. . 10 = 6.6 4.0-7.0

    Page 14 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    15/17

    Figure 3. Comparison of Mr. Zimmermans reenacted cries with each of the six cries/calls.

    Twenty such samples were matched with each cry. The data (x) on the

    continuum are the means of at least four of the Five features plus a general

    assessment.

    Cry Number Perception Mean Judgment Range

    No. 1 wow 0 . X . . 5 . . . . 1 0 = 2 . 0 0-4.0

    No. 8 ow 0 .X .. 5. . . . 1 0 = 2 .1 1.0-3.5

    No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N / A null

    N o . 1 3 wyra 0 . . . . X 5 . .

    < . y>., {

    . 1 0 = 4 . 8 2.0-5.5

    No. 14 owa 0 . . . . 5 X . .. 1 0 = 6 . 0 5 . 0 - 7 . 0

    ti,.'ij Site f * . iN o . 16 swaa 0 . . . . 5 . X .. 1 0 = 6 .9 4 . 5- 7 .5

    Page 15 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    16/17

    Figure 4. Comparison of Mr. Martins general (but short) samples with the six cries or

    calls drawn from the 9-11 telephone call. Twenty such samples were matched

    with each cry. The data (x) on the continuum are the means of at least four of the

    five features plus a general assessment.

    Cry Number Perception Mean Judgment Range

    No. 1 wow o Ui

    X o II

    bo

    3.0-7.0

    No. 8 "ow 0....5.X... 10 = 6.5 4.0 - 7.5

    No. 11 (mainly cherp) 0 . . . . 5 . . . . 1 0 = N /A null

    HiNo. 13 wyra O X . . . 5 . . . . 1 0 = 1 . 2 0-3.2

    No. 14 owa 0 . . . X 5 . . . . 1 0 = 3 . 9 2.5-5.5

    No. 16 "swaa 0 . . X . . 5 . . . . 1 0 = 2 . 5 1.0-4.0

    Page 16 of 17

  • 7/30/2019 051013 Speaker Identification-ocr

    17/17

    Figure 5. Comparison of Mr. Martins selected samples (shouts/cries) with the six cries or

    calls drawn from the 9-11 telephone call. Twenty such samples were matched

    with each cry. The data (x) on the continuum are the means of at least four of the

    five features plus a general assessment.

    Cry Number Perception Mean Judgment Range

    No. 1 wow 0 . . . . 5 . X . . . 10 = 6. 5 4 . 5 - 7 .5

    No. 8 ow 0 . . . . 5 . X . . . 1 0 = 6 . 4 4 . 5 - 7 . 0

    No. 11 (mainly cherp) 0___5____1 0 = N / A n u l l

    No. 1 3 wyra O X . . . 5 . . . . 1 0 = 1 . 1 0 . 5 - 3 . 2

    No. 14 owa 0 . . X . 5 . . . . 1 0 = 3 . 1 2 . 0 - 5 . 0

    No. 16 swaa 0 . . X . 5 . . . . 1 0 = 2 . 8 1 . 0 - 4 . 5. . _ ^ ' '

    Page 17 of 17