8/12/2019 Colburn et al. (2002) - ASR
1/6
METHODOLOGY FOR DEVELOPING SPEECH AND NOISE DATABASES
FOR SOLDIER-BORNE AUTOMATIC SPEECH RECOGNITION
Kevin D. Colburn1, Cynthia L. Blackwell2, Randall D. Sullivan3, Gary E. Riccio1, and Olivier Deroo4
1Exponent, Inc.
21 Strathmore Road
Natick, MA 01760
Corresponding authors e-mail: [email protected]
2U.S. Army Soldier and Biological Chemical Command
Natick Soldier Center
Kansas St.
Natick, MA 01760
3The Wexford Group International
931 Front AvenueColumbus, GA 31901
4Babel Technologies SA
33 Boulevard Dolez
B-7000 Mons, Belgium
Abstract: A soldier-borne Automatic Speech Recognition (ASR) system would enable the dismounted soldier of the near
future to control other electronic subsystems without removing hands from weapon or eyes from target, thus increasing the
soldier's mobility, lethality, and survivability. Soldier-borne ASR systems face two primary challenges to successful
performance: varying background noise, including complex combinations of speech and background noises, and shifts in
user state. Traditionally, ASR systems perform accurately only when the training speech and noise conditions are identical
or substantially similar to the operating speech and noise conditions. A survey of existing databases revealed a lack ofoperationally relevant speech and noise databases for dismounted infantry operations. The purpose of this paper is to
provide a methodology for creating such databases. We summarize our analyses and conclusions about existing databases,
the definition of various operational noises and individual speech variations that occur on the battlefield, the
experimentation plan for the collection of data, its planned refinement into useful databases, and the planned validation of
those databases. Part of this reads like a conclusion.
1.INTRODUCTIONThe soldier of the near future will be equipped with a growing number of body-worn electronic subsystems. It is critical to
minimize the impact of these systems control mechanism on the soldiers physical and cognitive capabilities; a mechanism
that enables control of electronic subsystems without removal of hands from weapon or eyes from target will naturally
increase soldier mobility, lethality, and survivability. A recent collaboration [Exponent, 2002] between Exponent and theU.S. Armys Natick Soldier Center, addressing System Voice Control (SVC), demonstrated the possibilities for
implementing a hands-free, and to some extent eyes-free, control mechanism. Part of the SVC effort involved an
investigation of variations in speech signal and noise and their influence on recognition performance.
Automatic speech recognition (ASR) for the soldier faces two primary challenges to successful performance: varying
background noise and shifts in user state. Battlefield acoustic conditions involve complex combinations of speech and
background noises, including noises that are continuous, impulsive (and often intermittent), stationary, non-stationary, and
of extremely high sound pressure levels. In addition, the soldier may shift states rapidly, e.g., from whispering (to avoid
revealing ones position) to shouting (primarily because of background noise).
ASR systems tend to perform well only when the training speech and noise conditions are identical or substantially
similar to the operating speech and noise conditions. The SVC effort revealed a dearth of operationally valid speech and
battlefield noise databases. Until new databases are created, systems intended for use on the battlefield can be neither
8/12/2019 Colburn et al. (2002) - ASR
2/6
properly trained nor properly evaluated. The SVC effort provided a foundation for an effort to develop such databases and
to demonstrate that their use in the development of robust ASR systems will enable those systems to perform in battlefield
noise conditions. Building on that foundation, we describe a methodology that can achieve the following goals:
Collect data that will allow soldier-worn ASR systems to be trained with acoustic conditions that are identical orsubstantially similar to those of the environment in which the systems will operate.
Collect data needed for databases that can be used by others for ASR research and development that is relevant toArmy stakeholders.
Create the databases according to established standards, and document the process to facilitate replication andelaboration by others.
Validate the databases in a way that is free of specific ties to a particular ASR system or subsystem vendor.Once trained with the databases, ASR systems should improve their performance in the traditionally challenging
acoustic conditions experienced by dismounted soldiers in combat and support roles. Reliably high levels of performance in
such conditions are a necessary condition for soldier acceptability of SVC. The resulting databases would also complement
the ongoing refinement of ASR algorithms, systems, and general architecture by creating a test bed for systematically
comparing and evaluating ASR systems and components. Thus, the databases created using this methodology would form
an enabling technology for robust ASR systems.
2.SURVEY OF EXISTING DATABASESCOTS/GOTS databases for speech signal variations that are used for customization of SVC products in the commercial
sector were obtained and categorized in terms of the dimensions of acoustic variation that they represent. Some databases
are composed of continuous speech, which is not ideal for command-style ASR training. Others focus on discrete words
and numbers, which is not ideal for dictation or continuous-speech ASR. COTS/GOTS databases of environmental noise,
relevant to military applications, were evaluated in a similar process to that of the speech databases. One database,
NOISEX, was selected as the best available because of its range of noise types as well as the rigor taken in documenting the
sources and technical details associated with the samples in the database. However, that range of noise types provides
insufficient variation across dismounted-infantry-appropriate operational environments, and the noises differ (drastically in
some cases) in intensity and complexity from the noises that dismounted infantry would actually hear. Is this the only
problem with that database? This is a key statement in that it provides the basis for the work you propose. Need something
stronger here it used the wrong microphone is not a strong reason for recommending the need for this soldier-bornedatabase. What is the real problem?
The survey clearly revealed that recordings of speech and environmental noise that are appropriate to dismounted-
infantry operations are needed. I dont see where this concurrency is addressed in the paper.
3.DEFINITION OF OPERATIONAL NOISEOPERATIONAL ENVIRONMENT FACTORSThe acoustic operating environment for soldier-worn ASR systems will vary by mission, operational mode, duty position,
terrain, and other such factors.
Every mission involves stages, such as planning, movement, and actions on the objective. A stage may exhibit the
preponderance of particular noises or noise types, though the frequency of occurrence of the noises will vary significantly.
Offensive, defensive, and stability missions are likely to be similar acoustically and can be combined in a general combat
category. A support mission is less likely to contain weapons, munitions, or explosives noises, so it can be considered asecond category of mission.
ASR systems will be required to perform in all of the operational modes soldiers may encounter in combat (generally
referred to as dismounted, mounted, mounted supported by dismounted, and dismounted supported by mounted), each of
which generates particular noises. Dismounted operations are performed on foot by forces in close combat (offensive,
defensive, and stability) and support. Mounted operations include all operations utilizing ground, air, and maritime
vehicles, all of which generate noise. The mounted mode is also characterized by the firing of weapons organic to the
vehicle during combat operations. Both mounted and dismounted modes are likely to be supported by Joint and coalition
forces and assets (e.g., air and naval fire support), generating a wide variety of noises.
Soldiers in different duty positions will experience different acoustic environments, though many will have common
aspects. Duty position examples include: Rifleman/machine gunner, grenadier/special operations soldier; medic; vehicle
refueler; engineer (combat/sapper); engineer (bridge builder); and field artilleryman.
8/12/2019 Colburn et al. (2002) - ASR
3/6
The type of terrain expected to be encountered may dictate the inclusion or exclusion of entire classes of vehicles,
weapons, and the like, and the presence or absence of these noise sources will influence the soldiers acoustic environment.
Terrain will also influence the amount of echo and reverberation in the acoustic environment.
4.DEFINITION OF OPERATIONAL NOISEBATTLEFIELD NOISE SOURCES NEED TO SAY WHY THESEARE DIFFERENT THAN OPERATIONAL FACTORS. ONE SENTENCE SHOULD DO IT.
The preceding section addressed the factors influencing the set of noises to which a soldier might be exposed. This section
identifies the specific noise sources that populate those sets. The sources that a soldier is likely to encounter in combat are
divided into natural groupings that have different effects on ASR performance: vehicles and aircraft tend to be stationary,
continuous noises compared to weapons, especially machine guns, which are intermittent and impulsive; munitions and
explosives tend to be non-stationary, impulsive, and often generate overpressure at the impact point; and environmental and
infrastructure noises can be stationary, non-stationary, continuous, impulsive, and intermittent, and can introduce effects
such as reverberation.
The noise of weapons, munitions, or explosives rounds impacting various targets is different from the noise the
weapons, munitions, or explosives generate at the firing position, and these target sounds should be collected to replicate
the noise of enemy munitions fired at U.S. forces in battle. One possibility for recording noises at the target position is to
emplace one expendable microphone close to the target position to experience the effects of overpressure and another, non-
expendable microphone at a safe distance to record the entire duration of the noise.
Any of the vehicles typically found on the battlefield can produce noise from its engine, drive train, wheels/tracks, and
airflow around the vehicle. The noise from the vehicle will generally be constant (stationary), though engine speed and
travel speed (via wind and/or wave effects) will affect intensity and acoustic frequency. The firing of weapons mounted on
the vehicle or carried by the rider(s) will produce additional noises. Aircraft typically found on or over the battlefield
include helicopters, fighter jets, and transport planes. Any of the aircraft can produce noise from engines, propellers, rotors,
airflow around the aircraft, and from the firing of weapons mounted on the aircraft or carried by the rider(s).
Other noise sources that can affect speech recognition and should thus be considered for data collection include the
following: weather-related noise such as heavy rain, thunder, and wind; natural terrain, which rarely generates additive
noise, but can attenuate or amplify battlefield noise and create echo and reverberation; urbanized terrain, which like natural
terrain, can either attenuate or amplify battlefield noise, but can also generate noises; and gasoline and diesel generators
used by U.S. forces in combat and support. These last two paragraphs are key. The fact that the DI is exposed to thesenoises directly is the pitch for this work to be done and what separates this work from previous databases. I believe
something of this sort is worth saying/pointing out here.
Dismounted infantry are, uniquely, exposed directly to all of these noise sources and therefore require new databases
for ASR that accurately reflect their unique acoustic environment.
5.DEFINITION OF OPERATIONAL NOISEOPERATIONAL SPEECHSpeech typically encountered in the battlefield includes voice communications from leaders, members of squads/crews, and
command groups. Some messages tend to recur. The standard formats of some of these typically recurring messages are
prime examples of the kinds of vocabulary used at the battalion level and below. These messages are predominantly spoken
over a radio or, in some instances, person-to-person. Many could potentially be spoken into a soldier- or leader-worn
computer using ASR.Standard message format examples are as follows, in order of likelihood of occurrence in combat and stability
operations: Spot Report (SPOTREP), Size, Activity, Location, Unit or Uniform, Time and Equipment (SALUTE), or Size
Activity, Location and Time (SALT) report; call for fire; brevity codes; fire commands for machine-gun teams and anti-
tank weapons; immediate Close Air Support (CAS) request; medical evacuation request (Medevac); Ammunition,
Casualties, and Equipment (ACE) report; Warning Order/Fragmentary Order; SLANT report/combat power report; and
free-text message. We are not aware of any examples of soldiers talking to computers in the current force.
6.INDIVIDUAL VARIATION
8/12/2019 Colburn et al. (2002) - ASR
4/6
The following dimensions are quasi-static and represent variation acrossindividuals: age, accent [does this include racial
variations (e.g., Hispanic, Asian) or is there an assumption that all soldier are essentially English- speaking Americans?],
gender, and education level. For a database to accurately represent any of these dimensions, the characteristics of both the
experimental subjects and of the Army soldier population are necessary. The database would need adequate sample sizesfor various subpopulations. Without adequate sample sizes, best practices indicate that databases should be annotated with
descriptions of the subpopulations from which the speech samples are collected. Ambient environmental conditions could
be recorded to some degree, and qualitative observations of respiration rate, anxiety level, and any particular tasks that the
soldier performs could be recorded when possible. The most effective ASR system will take advantage of the likely
emergence of smart card technology that will be carried by each soldier. The emergence of such a solution allows for the
storage of various data, including a voice template for each soldier. Although this will not eliminate the need for some
general baseline algorithm training to provide a base ASR capability, it will likely obviate the need to train the system on a
large set of data that represents the quasi-static dimensions of variation listed above.
Dynamic dimensions of variation represent variation within individuals, are less likely to be captured in a voice
profile, and thus must be addressed in real-time by any ASR system:
Fatigue: At Exponents direction, the University of Massachusetts Exercise Science Department conductedexperiments and generated a database of spoken utterances recorded during standing, locomotion, and fatigue. The
data from these experiments are still being analyzed. To our knowledge, no other databases have been released that
capture speech in combination with locomotion and fatigue.
Heat and cold: The U.S. Army Research Institute of Environmental Medicine developed the psychological HeatStrain Index (HSI), based on core body temperature and heart rate, for humans [Moran, 1998]. A similar index
called the Cold Strain Index (CSI) was then developed [Moran, 1999], though the effect of either heat stress or
cold stress on speech production has not been addressed. There may be value in relating HSI or CSI values in test
subjects to trends in speech data collected from those subjects, but the required data collection (core body
temperature and heart rate) would be outside the scope of the database development effort.
Cognitive load and anxiety level: These characteristics are ill-suited for ready and quantitative assessment duringfield speech data collection.
Intensity: Whispered and high-stress/high-intensity speech are traditionally the most challenging to ASR systems,and therefore should be assigned a high priority for recording. We expect that the Lombard effect [Lombard, 1911]
would be pervasive throughout the data collection, given the anticipated background sound pressure levels of the
training exercises.
Throughout this section is seems as though you are saying speech data cannot be gathered. Is that what you are getting at?Kind of a glass half empty section. What speech data can be gathered? It is probably feasible to collect adequate sample
sizes for each of these dimensions. Much of the data would be collected in a laboratory setting.
7.SPEAKER DEPENDENCE IN ASR SYSTEMS NOT SURE HOW THIS SECTION RELATES TO THEPAPER. SEEMS OUTSIDE THE SCOPE OF METHDOLOGY DEVELOPMENT OF A DATABASE.
SEEMS MORE ABOUT SVC TECHNOLOGY SOLUTIONS AND GENERAL TRAINING OF THOSE
SOLUTIONS. WHAT ARE YOU TRYING TO GET AT HERE?
A speaker-independent solution for SVC would avoid the need for an initial training session for each soldier and an
additional training session when one soldier needs to use another soldiers system. A speaker-independent system is
typically less accurate than a speaker-dependent system and requires far more training data, but is likely to be more noise-
robust, especially when the operating or testing conditions do not match exactly the training conditionsas will often bethe case with soldiers due to background noise and speaker stress.
Speaker-independent systems allow the recognition of words that were not in the training set, but require many
speakers during the training phase. One type requires a total of at least 200 repetitions of each keyword from a group of at
least 50 different speakers for training and testing; another requires at least 100 phonetically balanced sentences from a
group of at least 50 different speakers; and a third type requires more speakers (preferably 300 or more) and more data.
Speaker-dependent systems are generally capable of accurately recognizing only words that were in the training set
any other word is rejected as an out-of-vocabulary utterance. Furthermore, the system will not be robust to noise that is
substantially different from noise present in its training data. Because a different model is created for each speaker, though,
the quasi-static sources of inter-speaker variation are inherently addressed and thus no other speakers are needed during
training. For isolated word recognition, a minimum of two repetitions of each keyword by the speaker are necessary during
training to achieve good recognition rates.
8/12/2019 Colburn et al. (2002) - ASR
5/6
The amount of data required to train a speaker-independent system is thus much greater than for a speaker-dependent
system. For either system, a sufficient amount of data must be collected to divide the data into separate training and testing
sets. Typically, 10-20% of the data should be set aside for testing. Each set must be large enough to achieve statistical
significance for the accuracy results. This database development effort would likely be conducted with the expectation ofbeing used for speaker-dependent systems. This is an important decision because although a speaker-dependent system can
be trained on data gathered for a speaker-independent system, the converse is not possible.
The data collection in this effort would be variable, opportunistic, and often unpredictable. Single-word utterances of
relevant vocabulary words would occur frequently, given the verbal exchanges that one should target (as described in
Section 5). It is unrealistic to expect soldiers in training exercises to speak phonetically balanced sentences that are not part
of their typical lexicon.
It would be prudent to begin the effort with a pilot study to demonstrate the ASR performance advantage of recording
speech and noise combinations in the battlefield. The nature of a pilot study would limit the amount of data collected in the
study. After separation into training and testing sets, the amount of pilot data would likely be large enough to train only a
speaker-dependent system. Both speaker-dependent and speaker-independent systems could be tested with the data, though:
one could train a speaker-independent system with a specific training database, as in previous Exponent SVC experiments,
test it on the pilot test data, and then compare the results to those of a speaker-dependent system tested and trained on the
pilot data. In short, one could record both a speaker-dependent training and testing database and a speaker-independent
testing database.
Beyond the pilot study, for the full database development effort, a goal for the remainder of the data collection should
be to collect a sufficient amount of data to enable testing of any combination of speaker-dependent and speaker-
independent systems, as well as noise-canceling algorithms and other front-end additions to the models. Ideally, data would
also be collected that would allow training of speaker-independent systems, but whether a sufficient number of phonetically
balanced sentences from a sufficient number of different speakers could be recorded remains to be seen.
8.LOGISTICS OF DATABASE DEVELOPMENTCollecting speech and noise data during actual combat operations would be nearly impossible for a variety of reasons, so
data would be collected primarily during live-fire training exercises. For mobile speech collection instrumentation, an air
conduction microphone(s) would be placed as close as possible to the location where an ASR system close-talk microphone
would be placed. The characteristics of the recorded speech signal would vary significantly with microphone placement,while the characteristics of the recorded background noise may vary little with microphone placement, depending on the
type of microphone used. Bone conduction microphones would also be considered and, if used, placed in the soldiers
helmet.
In order to determine the noise or noise combinations to be recorded, four aspects of the noise or noise combinations
should be considered:
Probability of occurrence in the soldiers environment: Support operations may be more predictable than combat. Degree of challenge posed to ASR: The effect of duration, frequency, intermittence, intensity, etc. on an ASR
systems ability to detect contemporaneous speech.
Ease of collection (equipment, methods, and personnel) Probability of occurrence in a training exercise: Likely combinations of noise are easier to predict in a training
exercise than in the battlefield because the position and motion of noise sources are somewhat predetermined.
To integrate these factors into the decision process for data collection, one would assign each noise or noise combination a
numerical value in the range of 1 to 10 for each factor, where a value of 10 represents the most favorable state for a factor(e.g., the highest degree of challenge). One would then assign weights to each of the four factors in order of importance.
Then the sum for each noise or noise combination would reflect a composite rating of four weighted factors, and the greater
the value of the sum, the more desirable the noise or noise combination would be for recording.
Once the noises and noise combinations to be collected have been identified, other collection factors should be
considered: if possible, five repetitions of each noise or noise combination should be recorded; noises should be recorded at
various distances from the microphone; and if an intermittent noise (such as a machine gun) is fired differently by different
soldiers, then each style of firing should be recorded (i.e., capture whatever occurrence frequencies exist).
Say why combinations of sites are needed. Why cant you just go to one site then another and another?
The data collection requires a minimum of two teams: a coordination team and a data collection team. The two-person
coordination team would coordinate with the targeted installations and units with, at a minimum, the following: division-
8/12/2019 Colburn et al. (2002) - ASR
6/6
level and above operations section (G3)/Directorate of Plans and Training (DPT)/Directorate of Plans, Training, and
Mobilization (DPTM); brigade- and battalion-level operations section (S3); installation range control and safety office.
Unnecessary detail.
The viability of locating portions of the instrumentation off-soldier and off-vehicle would be explored. Unlike themobile instrumentation, stationary instrumentation would be the only means of collecting noise data in down-range target
locations and would be inexpensive in the event of loss due to munitions effects. First time you even address speech data
collection in this section. Need a little more on this.
In addition to the acoustic data recording, other data may be collected in order to identify the various components of
speech and noise combinations present in the recordings. This could involve simultaneous GPS signal recording,
simultaneous video and/or additional audio recording, real-time note-taking by data collection personnel, or other methods.
The collected acoustic data would be prepared, annotated, and transcribed according to Linguistic Data Consortium
guidelines. Preparation would involve uploading audio from the recording medium to computer disk and segmentation into
units appropriately sized for analysis. National Institute of Standards and Technology (NIST) SPHERE headers would be
attached to the audio to facilitate research, and the data would be divided into training, development, and evaluation data
sets. Transcription of the data would be conducted according to a modified Hub-5 convention for acoustic databases to be
used in ASR.
An increase in ASR recognition accuracy when tested in operationally relevant noise conditions would clearly validate
the database generation. In addition, another validation is possible: that of combining noises in the laboratory for training
and testing of ASR systems. As the data are collected, we expect that several noises would occur simultaneously (e.g., the
noise from a tank passing by as a machine gun is fired). The first step in validation would be to record those same noises
individually. Next, they would be combined in the laboratory. Finally, this combination would be used to train an ASR
system. With identical test conditions, similar performance with a system trained on simultaneously-occurring noises versus
a system trained on noises combined in the laboratory would validate the combination method. Widely dissimilar results
would invalidate the method and would necessitate much more extensive data collection in the field.
9.CONCLUSIONThis paper delineates a methodology to create operationally valid speech and battlefield noise databases for ASR
systems intended for battlefield use by dismounted infantry. There is a paucity of available speech and noise databases for
military operational environments as compared to other application domains. The value of the databases lies in gaining ahigher-fidelity representation of the acoustic environment in which dismounted infantry operate. The methodology leads to
a higher-fidelity representation because speech and noise data would be collected under relatively controllable conditions
that are as similar as possible to the actual operating environment. Creating such databases and using them to train and
evaluate ASR systems would enable sufficiently effective ASR performance to achieve soldier acceptability, and thus to
grant the desired increase in mobility, lethality, and survivability that would come with hands-free operation of electronic
subsystems. . First part of this reads like an abstract.
10.REFERENCES1. Future Warrior Technology Integration: System Voice Control Phase III Final Report. Exponent, Inc. (2002). Natick,
MA.
2. Future Warrior Technology Integration: System Voice Control EY2000 Report. Exponent, Inc. (2000). Natick, MA.3. Lombard, E. (1911). Le signe de l'lvation de la voix. Ann. Maladiers Oreille, Larynx, Nez, Pharynx, 37: 101-119.4. Moran, D.S., Shitzer, A., and Pandolf, Kent B. (1998). A physiological strain index to evaluate heat stress. American
Journal of Physiology, 275 (Regulatory Integrative Comp. Physiology 44): R129-R134.
5. Moran, D.S., Castellani, J.W., OBrien, C., Young, A.J., and Pandolf, Kent B. (1999). Evaluating physiological strainduring cold exposure using a new cold strain index. American Journal of Physiology, 277 (Regulatory Integrative
Comp. Physiology 46): R556-R564.