Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Speech Processing in the Time Domain Using
Sparse Autoencoders for Unsupervised Feature Learning
And An Additive Model of the Peripheral Auditory System
Thomas Bryan
PhD proposal
Electrical and Computer Engineering
Florida Institute of Technology
1. INTRODUCTION
1.1 “Efficient coding” and human sensory perception
Human neural perception is unsurpassed in terms of its ability to process complex neural inputs, particularly in vision
and audio processing. The brain processes information at a relatively slow rate, on the order of 10’s of Hz, but it does so
with a massively parallel structure. Today, convincing evidence exists [1-4, 28] that the brain stores sparse overcomplete
“basis vector” representations of input stimuli. The basis vectors are overcomplete as the number of basis vectors is
larger than the effective dimensionality of the input space. The “essence of information”, is captured by a linear
superposition of these basis vectors.
It has been postulated that these basis vectors are learned through a process of unsupervised feature learning. Five of the
best-in-class machines learning algorithms have been applied to visual, audio and sensory touch, and all five algorithms
are qualitatively similar [4]. They all produce sparse representations of the input data, and are trained using unsupervised
feature learning in a greedy layer by layer fashion. Finally, they all produce results that match the corresponding
receptive field in cortical portions of the brain. In this paper we apply one these, the sparse autoencoder, (SAE) to
automatic speech recognition tasks.
Sparse coding has been applied to the output of cochlear filter banks that produce frequency domain representations that
are similar to spectrograms [5, 28]. These time-frequency, or spectrotemporal representations may be processed using
image processing techniques. The learned basis vectors are components of spectrogram-like representations such as
horizontal lines corresponding to formant frequencies. These basis vectors represent the slow envelope of the neurons in
the peripheral auditory system.
The first part of the study is to directly apply the SAE to time domain speech waveforms. Time domain basis vectors
have been derived that have been demonstrated to match the peripheral neural responses in upper mammals [3]. The
learned basis vectors were shown to closely match the cochlear impulse response in mammals and are described by
gammatone filters [24, 25]. As a result of this finding, gammatone filter impulse responses will be used as the kernel
function for wavelet decomposition of speech using a non-orthogonal basis set. SAE input speech feature vectors will be
selected based on a MSE fit to the gammatone wavelets.
The first part of the study makes no assumptions about the model of the peripheral audio system. It simply selects
portions of speech that are most similar to gammatone impulse responses, normalizes them and processes them using a
SAE. Speech recognition and dialect detection is done to validate the approach. This is done by removing the output
stage of the SAE and adding a softmax classifier.
The second part of the study uses an additive model of the peripheral audio system that incorporates the slow envelope
and the temporal fine structure. In this approach an SAE is used at the output of each filter, and there is one SAE for the
slow envelope, and one for the temporal fine structure. The temporal fine structure is used directly, so there is not a
spectrogram-like representation for it. The composite model has spectrogram-like features coming from the slow
envelop, as well as a time domain features due to the temporal fine structure. The additive model should support
functions such as pitch detection, music detection, and should work better than the phase one experiments in the
presence of noise and reverberation.
1.2 The slow envelope and temporal fine structure of the peripheral auditory system
Models of the audio pathway in the human ear are now fairly well understood [6-7]. The Cochlear frequency response is
composed of a bank of logarithmically spaced parallel filters. At the output of each filter are stereocilia (hairs) that have
1 to 8 nerve cells attached to them. The hairs half-wave rectify the audio signal and is subsequently low pass filtered by
auditory nerves “phase locking”, to produces the slow envelope. The rectified output passes through a band-pass filter
that passes the fundamental and the second harmonic. This is the fine temporal structure.
It has been known that speech intelligence comes from the slow envelope [8]. This is the stimuli that cochlear implant
patients perceive. Additionally, it is known that they do not have the ability to perceive pitch, to hear music, or to
perceive speech in noisy environments. This is due to the lack of the temporal fine structure.
If the Hilbert phase of the fine temporal structure is taken, this becomes a multiplicative model. This is a modulator
implementation whereby the temporal fine structure is multiplied by the slow envelope to reproduce the audio. If the
temporal fine structure is added to the slow envelop this becomes the additive model. This is a very simplified model and
is not intended to accurately model the peripheral audio neural pathways, as the filter bandwidths for the lowpass and
bandpass filters are not clearly understood. Instead, this model is to serve as the basis for a signal processing algorithm
that is inspired by the research in cochlear implants. The proof is in the pudding, as they say, if the model is valid it
should add pitch detection capability, and be robust to reverberation and interference as compared to models that only
represent the slow envelope.
For the past two decades, the multiplicative model of the temporal fine structure has been adopted by research scientists.
In this model, the fine temporal structure is multiplied by slow envelope to produce signals to the audio cortex. This
model uses the Hilbert phase of preprocessed audio signals, which has 180 degree phase flips at the phase
discontinuities, and is not consistent with the functions of auditory nerve cells. However, the multiplicative model, based
on the Hilbert phase is the most widely accepted model in use today. A new additive model has recently been proposed
that uses the fine temporal structure directly, that is added to the slow envelope [26]. This new additive model in
consistent with the neurobiology and is therefore, the model that will be adopted for this investigation.
1.3 Entropy and human auditory perception
For decades researchers have been studying in speech processing based on linguistic features, such as vowels, glides,
stops, fricatives and plosives. Many studies have sought to find out what parts of speech contain the most information.
Many experiments have been conducted over the years to determine if vowels or consonants contain most of the
information required for perception. It is widely known that written English is understandable without the vowels. This
is due the high redundancy in the language. This is not true for spoken English. Many studies a sought to remove,
vowels, consonants, transitions between vowels and consonants, all with no meaningful, distinguishable preference of
any particular linguistic feature.
Recently, studies have been conducted that remove high entropy portions of the speech and these studies correlate
perception to features. Entropy for one of the studies [9] has been called Cochlear Scaled Entropy, and has been defined
as the difference, or dot products between adjacent frames of data at the output of Cochlear frequency response filters.
Other researchers [10-11] have define audio entropy based on random fractals know as fractional Brownian noises.
These noises are characterized by the frequency domain exponent. For white noise the exponent is 0, for 1/f noise, the
exponent is 1 and for a random walk, the exponent is 2. The entropy increases with decreasing exponent so that white
noise has the highest entropy, or randomness. The random walk has the lowest entropy as the next value of pitch for
example, will be highly correlated with the present one. The noise is used to modulate pitch, loudness and duration.
There have been psychological experiments done in which subjects can distinguish entropy between pitch, and loudness,
but not duration. Moreover, similar studies have shown a direct correlation to blood oxygenation levels in the brain [12]
to entropy for fractional Brownian motion based on MRI measurements. From these experiments, the evidence shows
that the brain responses, and perception, are strong functions of the audio entropy of the stimulating signal.
The Cochlear Scaled entropy and the fractal Brownian entropy model define entropy in terms of randomness or change.
While these definitions are consistent with entropy from a thermodynamic standpoint, especially in terms of Brownian
motion, they do not translate well to Shannon’s idea of entropy in terms of coding. Shannon defined channel capacity in
terms of source entropy, and “equivocation” of the channel, and showed that codes could be found that perform at the
limits of the mutual information, the source entropy minus the equivocation. A central theme for Shannon’s coding is
based on having a fixed set of source symbols of length N, and a set of code words of length M, where the code length of
M is greater than the symbol length N. Additionally, the dictionary of code words is of fixed length. Shannon defined the
capacity of the communication channel by properly decoding received code words back into the original transmitted
symbols.
Smith and Lewicki [3] developed an efficient auditory coding algorithm that decomposes speech into a sparse set of
kernel functions, or “codes”. The use of these codes provides a method for defining the source entropy of speech that is
more consistent with Shannon’s definition of entropy as opposed to simply viewing source entropy as high rates of
“change”. Furthermore, the kernel function decomposition method captures the underlying structure of speech and
should maximize the “SNR in the detector”. The algorithm by Smith and Lewicki serves as the basis for speech feature
extraction of “codes” in this study. The method they present is based on an autoencoder with linear activations. The
linear activations will be replaced by sigmoidal activations in hopes of learning higher order statistics from the speech.
1.4 Matched filter speech coding
Modern communication systems that are able to extract signals in high noise, high multi-path fading environments and
perform within tenths of a dB for Shannon’s limit [17]. At the core of these communications systems is the matched
filter. The matched filter optimally maximizes the signal to noise ratio in the receiver. It does this by matching the
receive waveform to the transmit waveform, after channel equalization. This same principle can be applied to speech
coding. However, due to the inherent complexity of speech the “matched filters” will be allowed to overlap in time.
Encoding of speech has three major challenges [13]; 1) time dilation due to differences in fast and slow talkers, 2) high
dynamic range of amplitude variations, 3) time frequency resolution problems.
It is not expected that an independent set of dictionary codes may be used for speech like matched filters in
communication systems, due the richness and variability or speech. However, it should be possible to find time domain
overlapped codes that may be summed together that reproduce speech waveforms. The intuition for the existence of
these codes comes from the model of the Cochlear filters in the peripheral auditory system [6-7]. From basic
communication theory, the output of a discrete filter is the sum off individually weighted impulses responses.
Furthermore, there are n numbers of parallel filters in the Cochlear. The outputs of each individual filter may
approximated by keeping a limited number of the weighted impulses. These outputs may then be summed together to
reproduce a compressed code based representation of the speech. The challenge in finding good codes is to find which
weighted impulse responses that capture the best features for the given classification task. Features, or code words, for
speaker recognition will likely not be the best for other tasks such as, phoneme detection, and sentiment or opinion
mining. For example, it will be shown that a SAE is able to reconstruct intelligible speech by random sampling of the
speech waveform. However, the classification accuracy for speaker identification is no better than random guessing. The
SAE finds an optimum set of initial conditions, in a greedy layer by layer approach. The output layer of the SAE is
removed, and a softmax classifier is added to hidden layer. The entire network is then trained with backpropagation.
New input weights emerge, based on the output of the classifier, and these new input weights become the feature vectors
or codes. Finding good codes, then, is a three step process:
1. Raw feature extraction.
2. Sparse autoencoding to obtain good initial conditions for a classification network.
3. Tuning these features via backpropagation.
The idea of representing speech as code words is not new. The most widely used method for speech transcription uses a
decomposition of speech into complex exponentials, or sinusoids called spectrograms.
Spectrograms are one of the most widely used methods today for speech processing. Spectrograms are Short Time
Fourier Transforms (STFT’s) [13]. STFT’s are based on overlapped windowed data samples, followed by Fast Fourier
Transform. The STFT performs a decomposition of the time domain into windowed sine and cosine vectors of fixed
length. These vectors may be thought of a dictionary of code words of fixed length. One simple method of calculating
the source entropy of speech would be to threshold the magnitude if each frequency bin. If it exceeds the threshold,
histogram that “code word”, normalize the distribution so that all bins sum to 1, and calculate the entropy.
The problem with STFT’s is that there is an inherent time-frequency resolution due to “windowing”. Long windows lead
to high frequency resolution and poor time resolution. Conversely, short windows lead to good time resolution and poor
frequency resolution. Some data is always lost in the process, as each frequency bin is the dot product between the
speech and the sine/cosine vector. For example, if the speech has a phase inversion halfway through the sample vector,
the dot product might be close to zero while the actual energy is quite high. This can be seen when doing cepstal analysis
and liftering the pitch period of speech [13]. If the window is pitch synchronous, the glottal pulse never disappears.
However, for none pitch synchronous windowing, the liftered pitch pulse fades in and over frames of data.
The time-frequency resolution may be address by taking the MEL frequencies from the FFT’s [13]. Here, the FFT bins
are arranged into logarithmically spaced center frequencies with triangular overlapped frequency bins. This results in
wider bandwidths at high frequencies, and lower bandwidths at low frequencies. This is similar to the frequency
response of wavelets. This addresses the time-frequency problem, but the underlying problem with the loss of
information from the STFT remains.
Returning to the additive model [6-7], it is a natural choice to base the “matched filter” representation of speech on the
impulse responses of the cochlear bandpass filter bank. These impulses responses, or gamma-tones, form the input
stimuli of human auditory perception. Taken collectively, these gamma-tone impulse responses comprise a cochlear-
gram when the magnitude is taken.
Cochlear-grams are frequency domain representations of speech that are based on a model of a set of parallel band-pass
filters in the human ear. There are logarithmically spaced in frequency similar to MFCC’s [13]. Moreover, they are
overlapped in the frequency domain as well, which is also similar to MFCC’s.
It is believed that the human peripheral auditory system efficiently encodes auditory signals sent to the auditory cortex of
the brain [3]. Smith and Lewicki [3] decompose speech based on a set of learned kernel functions. They also show that
gamma-tones closely match the “spiking” models in the auditory nerves of cats.
Using this method, speech is convolved with a revcor filter impulse responses or gamma-tones. Each gamma-tone
impulse response is assigned a temporal coordinate, and an amplitude scaling value. The audio signal may be
reconstructed by summing the individual gamma-chirps to generate the composite audio wave form. There are a finite
number of parallel cochlear filters, which leads to a fixed dictionary of source encoding code words. This will support
the calculation of source entropy. Moreover, this method is attractive because it addresses the time dilation problem by
repetitive use of similar gamma-tones over long formant frequencies. Additionally, the amplitude problem is addressed
because the gamma-tone decomposition is based on a mean square error fit to the data, which is independent of data
amplitude. The time-frequency problem is addressed in a manner similar to wavelets or MFCC’s, in that the higher
frequency components have wider bandwidths and faster impulse responses [13]. Finally, the gamma-tones are known
model the phase locking output of auditory nerve fibers when driven the AWGN.
2. LITERATURE REVIEW
2.1 Historical background on sparse representations
Ernst Mach [17] was the part of the first generation of physiological psychologists, and is considered a forerunner of
modern neural networks by modern psychologists. Mach discovered that “the eye has a mind of its own” and that; we
perceive not direct stimuli, but relations of stimuli. Mach believed that we do not perceive reality; rather, we perceive
the after effects of the nervous systems adaptation to new stimuli. Mach believed that animals survive in their
environment based on their ability to adapt to a wider range of temporal and spatial surrounding. Over time organisms
develop memory that better enables them to adapt. As they evolve, communications is possible and there are able to
learn from others. From this learning science is created. Mach developed the theory of the “economy of thought” [18]
that the simplest most parsimonious theories economize memory by using abstract concepts and laws instead of
attending to the details of individual events.
Tolman, a “field theorist”, and put forth the idea of cognitive maps [19]. Tolman performed psychological experiments
on rats in a maze, and theorized that the rats learn a cognitive map of the maze. This is substantially, different from the
stimuli/response psychological theories at the time were based on a telephone switchboard school of thought. In this
school of thought, good “connections” get reinforced, and bad “connections” get disconnected.
Craik introduced the concept of mental models [20] and was one of the first practitioners of cognitive science. Craik
believed that the mind constructs small scale models of reality that are used to anticipate events. He believed the process
of perception and reasoning takes place in three steps:
1. The translation of the external process into word, numbers, or other symbols, which can function as a model of
the world.
2. A process of reasoning from these symbols leading to others.
3. The retranslations back from these symbols into external processes, or at least a recognition that they
correspond to external processes.
Stimulation of sense organs result in neural patterns, reasoning stimulates other neural patterns that are retranslated into
excitation of the motor organs.
The study of cognitive fields was formalized after Shannon introduced information theory. Philosophers, physiologist
and psychologist began to apply the Shannon’s concepts of redundancy and efficiency to the primary processing in
sensory organs. “Efficient coding” soon became a central theme in how sensory inputs are mapped to neural codes that
captures the statistical structure of sensory inputs. The application of efficient coding resulted in the sparse
representations of features that are learned through unsupervised feature learning. Moreover, these features closely
match the physiology of the sensory organs.
2.2 Claude Shannon 1948 A Mathematical Theory of Communication, Bell System Telephone Journal
In this epic work, Shannon develops information theory for modern communication theory based on the invention of
Pulse Code Modulation and Pulse Position Modulation for telephony. He defines information, based on the work of
Hartley, as at the log of the probability of an event. If the ���� is the basis, the information content is in bits. Shannon
defines a block diagram for the communications channel in term of an information source, a transmitter, a channel with
noise added, a receiver, and a destination that receives a message. In this model the information source can be anything
from television signals, to telephony or radio. The transmitter maps the information source to the channel. Examples of a
transmitter are FM modulators, or a voice encoder system. The channel is any medium the signal passes through
between the transmitter and the receiver. Note that the channel in Shannon’s definition is a probabilistic model, and not
the physical channel typically thought of by communications engineers. The receiver performs the inverse function as
the transmitter, and reconstructs the message and passes it along to the destination, the person the message is intended
for.
Shannon classifies communications systems as discrete, continuous, or mixed, and goes on to focus only the discrete
case for the first half of the paper. His version of discrete is not a sampled data system. Instead Shannon defines discrete
as both the message and the symbols are a sequence of discrete symbols. Telegraphy is a discrete system where the
message is a sequence of letters, and the symbols are dot, dashes and spaces. With the discrete communication system in
mind Shannon derives the capacity for the communication channel. He first introduces the “Discrete Noiseless Channel”.
For the discrete noiseless channel, a Teletype has 32 symbols of equal duration. If all the symbols the equally likely, the
maximum capacity of the channel is log232 = 5bits/symbol*n symbols second = 5n bits/sec. Normalizing the symbol rate
to 1 symbol/sec, yields 5bits/Hz.
In the case of the English language the number of symbols is 26, with a space required between words. If all symbols the
average information per symbols would be log227 = 4.7549bits/symbol.
Not all symbols are equally likely, and not all symbol sequences may be allowed. For this discussion, Shannon shows
state diagrams for communications sources. Shannon defers to the appendix but states that if a state diagram is known
with it transition probabilities, the channel capacity can be calculated.
The next topic covered is the discrete information source. Here statistical knowledge of the source is determined that will
aid in reducing the required capacity of the channel. For example, the capacity of the channel for the English language
with the proper frequency of occurrence for each letter reduces the required channel capacity from 4.7549bits/symbol to
4.1811 bits/symbol. Further reductions in allowable sequences occur by taking 2 letter sequences or 3 letter sequences.
Even greater reduction is achieved using word repletion rates, and valid sequences of words. By using allowable word
sequences, English text contains about 1.5 bits/symbol.
Shannon next talks about using Markov models for discrete sources and describes these models as ergodic processes. He
loosely states that each sequence generated by these processes will have that same statistics.
In section 6 of the paper, entitled Choice Uncertainty and Entropy, Shannon introduces the definition of entropy as a
search for a method to determine the best choice from one of many possible outcomes. He labels the “choice”�, and
lays out three properties for H:
1. H should be continuous in �� . 2. If all �� are equal, and there are n possibilities, then H should be a monotonically increasing function of n, as
there are more choices with increasing n.
3. If choice is broken down into successive choices, H should be the weighted sum of the individual choices.
The only function that meets these three choices is
� = −� ∑ �� ��� ������ Shannon sets K=1, and calls H entropy, the average number of bits of information per symbol.
When an outcome is known, p = 1 and H = 0:
� = −�0�����0� + 1�����1�� = −�−∞+ 0�����/������ Here, −∞ is generally treated as 0.
When all outcomes are equally likely independent events the entropy is maximized. For the binary case:
� = −�. 5�����. 5� + .5�����. 5�� = 1����/������ The entropy of a joint event is:
��!, �� = −#���, $��,%
������, $� An inequality for joint entropy is:
��!� = −#���, $����#���, $�%�,%
���� = −#���, $����#���, $�
��,%
��!, �� ≤ ��!� + ���� The equality holds when both events are independent, when
���, $� = ������$� Finally, conditional entropy is defined based on the following idea. For the joint event x and y, the probability that x
assumes a value i, there is a probability that y can assume a value j given by
���$� = ���, $�∑ ���, $�%
Conditional entropy is defined as
�'(y) = − ∑ ���, $�������,% �$� This is the amount of uncertainty about y that is removed by observing x.
Substituting the value of ���$�,
�'��� = −#���, $���� ( ���, $�∑ ���, $�% )�,%
, �'��� = −#���, $�
�,%������, $� +#���, $����#���, $�
%�,%
�'��� = ��!, �� − ��!� ��!, �� = �'��� + ��!�
��!� + ���� ≥ ��!, �� = ��!� + �'��� Hence,
���� ≥ �'��� From this Shannon concludes that the uncertainty of y is never increased by the knowledge of x, it can only be
decreased, or equal when x and y are independent.
In the next section Shannon talks about the entropy of a source. The source entropy is maximized when all symbols are
equally likely. The ratio between the actual entropy and the maximum entropy is called the relative entropy. This is the
maximum compression possible. The redundancy is 1 minus the relative entropy. The redundancy of English over eight
letters is about 50%, so half the letters are chosen freely, and half are determined by the structure of the language.
The Fundamental Theorem for a Noiseless Channel is introduced in the next section. Here the channel capacity C is
defined in bits per second. The Entropy is in units of bits per symbol. The theorem states communications is possible for
rates up to +, − e symbols per second, where e is an arbitrarily small number. For a simple example, if H = 5 bits/symbol
and the capacity is 5 bits per second then communications is possible up to 1 symbol per second.
Part II of the paper is called The Discrete Channel With Noise. The first section is called Equivocation and Channel
Capacity. Here Shannon discusses a noisy communication channel that is making 1% errors. The source is sending 1’s
and 0 with equal probability at a rate of 1000kbps. Since the source produces symbols with equal probability the source
rate is 1000 bits per second. The expected rate out of the receiver is not 990 bits/second. Likewise, if the channel is
making 100% errors, the received rate is not 500 bits/second. The receiver rate is
. = ��!� − �/�!� Here Shannon refers to H1�x� as the equivocation of the channel. If a 0 is received, the a posterior probability that a zero
was sent is .99, and is .01 that a 1 was sent. The situation is similar if a 1 is received.
The equivocation for 1% errors is
�/�!� = −3. 99 ∗ ���2�. 99� + .01 ∗ ���2�. 01�7 = .081����/�9:
. = �1 − .081�1000 = 919����/�9:
For 50% errors the equivocation is
�/�!� = −3. 5 ∗ ���2�. 5� + .5 ∗ ���2�. 5�7 = 1���/�9:
. = �1 − .1�1000 = 0����/�9:
It is tempting to assume that random guessing would produce 50% errors. However, from the mathematical definition of
conditional entropy, there is 1 bit of equivocation, and no information is received.
Shannon’s fundamental theorem of a discrete channel with noise can be summed up by
; = �<!3��!� − �/�!�7 The max argument corresponds to the best source encoder for a particular source. Since capacity is a positive number in
terms of bits/symbols, capacity is the entropy of the source minus the equivocation of the channel. Shannon goes on the
say that as long as the equivocation is less than the entropy of the source, efficient codes exist, that make communication
possible.
While Shannon invented the field of information theory for communication systems, it was quickly adapted by
researchers and scientist studying principals of human perception. Based on the work of Shannon, it was quickly
hypothesized that “efficient coding” in the peripheral auditory and optical systems took place in order to provide
maximum information content to the auditory and vision cortexes of the brain.
2.3 H.B. Barlow 1961 Possible Principles underlying the Transformations of Sensory messages Chapter 13 In:
Sensory Communication, W. Rosenblith (Ed.), M.I.T. Press, pp. 217-234.
Barlow studied how frog neurons respond to visual stimuli. He was interested in the hunting or snapping response, as
well as the escape response from artificially generated visual inputs. Barlow was one of the first to apply information
theory to sensory information in the nervous systems of animals and reptiles. His fundamental theory is that redundancy
is removed from visual sensory input before it is sent on the neural pathways to the brain. His hypothesis has three
assertions:
1. Incoming messages have certain “passwords” that carry key significance to the animal.
2. There are filters or recoders, whose pass characteristics may be controlled with requirements from other parts
of the nervous system.
3. “They recode sensory messages, extracting high relative entropy from highly redundant sensory input.”
Barlow’s “password” idea came from studies of frogs in which a snapping, or hunting response was elicited based on
visual stimulation. The idea of feedback came from studies of cat retinas and how they react to visual stimulation.
Barlow’s third assertion is later explained as striping away redundancy to produce “economy of thought” in order to
bring simplicity and order to complex sensory input.
Barlow goes on to model the recoders with the following simplifying assumptions:
1. “Sensory pathways are treated as noiseless systems using discrete signals.”
2. “The discrete signals are single impulses in particular nerve fibers in particular time intervals. For any one fiber,
and time interval, the impulse is present of absent, so the code is binary.”
3. “The constraints on the capacity of the nerve pathway are the number of fibers F, the number of discrete time
intervals R, and average number of impulses per second per fiber I. The average number of impulses per fiber is
assumed to be a variable constraint.”
Barlow, then talks about intrinsic noise that is present on the neural pathways, and then proceeds to ignore it as this
might obscure his fundamental point.
Barlow then talks about the capacity of a neural pathway in terms of 10 nerve fibers carrying binary codes over an
interval of 1/10 of a second. He assumes all messages are independent and mutually exclusive so that the average
information or entropy is
�=> = −#�??
����?
For a given symbol rate T, the information flow per unit time is
�=>@
The information rate for a fiber bundle is given as
; = −A. B C. ��� C. + D1 − C.E log�1 − C.�I The relative entropy is
�@;
And the redundancy is
1 − �@;
Barlow’s main point is this discussion is that is the neural pathway is of limited capacity C. Therefore then efficient
codes must exist to remove the redundancy from the input “messages”.
Barlow then goes on to discuss simple redundancy reducing codes. The purpose of these redundancy codes is for an
“economy of impulses” or for “emphasizing the unusual”. From this discussion, Barlow predicted the following for the
neural impulse code model.
Usual events should be represented by a decrease in number of impulses.
Codes should exist in accordance with the probabilities of the events encountered.
Codes should respond to complex features of the inputs, not to properties that are simple in physical or anatomical terms.
Barlow further assumes that redundancy reduction take place in several stages. Redundancy is a process of subdivision.
Higher levels of redundancy should remove more complex forms of redundancy. To emphasize the need for redundancy
reduction, Barlow points out the number of cells is the cortical region of the brain far outnumber of the number of cells
in the nerve channel. From this his concludes that an enormous reduction in the amount of impulses “seems to be
possible” without any loss of information.
Barlow summarizes by saying the paper is not based on physiological models, but says there are hypothesis only. From
this he justifies the “password” hypothesis based on laboratory experiments in which animals respond to stimuli. The
control hypothesis also comes from laboratory observations of animal’s reaction to stimuli. Barlow devotes most of the
paper to redundancy reduction on the neural pathways, and concludes with the following statement:
“To strip the redundancy from the previous pages, what I have said is this: it is foolish to investigate sensory
mechanisms blindly- one must also look at the ways in which animals make use of their senses. It would be surprising if
the use to which they are put was not reflected in the design of the sense organs and their nervous pathways-”
Barlow doesn’t formulate a model for the reduction in redundancy, other than to show a simple example of 10 nerve
fibers. These fibers use a binary on/off code, with a symbol time of 1/10 of a second. He does use Shannon’s theory to
show for the simplifying assumptions, that the information capacity of the nerve fibers is limited in information capacity.
2.4 Atick, Joseph J., 1992 Could information theory provide an ecological theory of sensory processing? Network:
Computation in Neural Systems 22.1-4 (2011): 4-44.
Atick studied the ganglion cells of the retina in detail. He develops one of the first neural networks that model the
behavior of the ganglion cells. Atick discusses the statistical compositions of natural images and demonstrates that the
functions of the simple cells in the primary visual cortex capture these same statistics. Atick expounds on the theory of
minimum entropy, or factorial codes that minimize difference between inter symbol dependencies. Consider a message w composed of l symbols, the probability of a message,
��K� = ����, … . . , �M�. If the messages are independent, then the probability of a message is simply,
N�K� = N����N����… N��M� The entropy ���� can be written as the sum over symbols (or pixel), entropies, ���� as:
��O� = −# # N����P
?��
M
���log�N���� ≡ #����
M
���
Atick points out that in general the symbols are not statistically independent and so the total entropy does not equal the
sum of the individual pixel entropies. In general the message entropy is:
��O� ≤ #����M
���
If all symbols in a message are equally likely,RN���� = �P , ∀��T the maximum entropy is achieved which is denoted as
the capacity C.
max���O�� = maxW#����M
���X = �log�Y ≡ ;
The redundancy is given as:
. = 1 − ��O�;
Atick states there are two contributions to redundancy.
. = 1; W; −#����
���X + 1; W#����
���−��O�X
The first term is the redundancy due to unequal symbol probabilities. This term goes to zero if all the symbols are
equally likely. The second term is used to describe redundancy due to joint entropies of statistically correlated symbols.
If all the symbols are independent the second term vanishes. Codes that reduce or eliminate redundancy are called
minimum redundancy codes. This is the approach taken for communications systems where the goal is to operate as
close as possible to the channel capacity. Codes that minimize the difference,���� − ��O�, are called minimum
entropy codes. The goal of minimum entropy codes is to eliminate the statistical dependencies between symbols at the
expense of decreasing the entropy of the individual symbol probabilities. The goal is to express the joint entropy of
symbols as a product of independent symbol probabilities, resulting in “factorial” codes. This can only be done by
tolerating some redundancy in symbol probabilities. For neural coding, Atick, notes that some redundancy adds
robustness in the presence of noise, and can therefore be tolerated.
`
2.5 B. Olshausen and D. Field 1996 Emergence of simple-cell receptive field properties by learning a sparse code
for natural images Nature vol. 381pp. 607-609
Olshausen and Field build upon the work of Atick and develop minimum entropy codes that capture the underlying
statistics of natural images. Moreover, these codes behave in the same manner as the receptive fields of the primary
visual cortex.
Olshausen and Field have extensive knowledge of the receptive fields in simple cells in the primary visual cortex of
mammals. They describe these fields as localized, oriented, and bandpass, comparable to the basis functions of wavelets.
They develop a sparse basis vector encoding scheme that captures the properties of localized, oriented and bandpass.
This paper introduces the SAE as a method of extracting a sparse basis vector representation of natural images. The
autoencoder performs in a manner similar to ICA in that it learns a sparse basis vector representation with high kurtosis.
Additionally, the features extracted from the SAE look very similar the “edges” learned by ICA.
An image may be made constructed from a weighted linear summation of basis vectors.
I�x, y� = #a\\
φ\�x, y� The goal of image encoding is to find a statistically independent set of basis vectors φ\ that form a complete code that
span a set of natural images. The notion of statistically independence here is the same as “minimum entropy” or factorial
codes of Barlow (1989) and Atick(1992).
A discussion of Principal Component Analysis, (PCA) concludes that this method is not consistent with the
neurobiology. PCA produces an orthonormal representation of the images that capture the direction of maximum
variance. The pairwise data points are decorrelated, ⟨a\a_⟩ = ⟨a\⟩⟨a_⟩. However, the data that results from PCA do not
match any know receptive fields, as the data is not localized. Furthermore, natural images containing curved edges do
not render well with orthogonal components. In short PCA does not capture the statistical structure of the data.
Olshausen goes on to say the statistical structure of images may be seen when the joint entropy, H�a�,a�, … aa� < ∑ H�a\�\ , is less than the sum of individual entropies. The strategy of reducing the individual entropies is suggested as a
means to gain statistical independence. This is attributed to Barlow’s (1989) minimum entropy code. It is then
conjectured that natural images have “sparse structure” and that any image may be made with a small number of basis
vectors selected from a large dictionary.
The cost function for the sparse code is introduced as:
b = −3�c9�9cd9Cef�c�<���e7 − g3��<c�e9���f<�7 Where,
�c9�9cd9Cef�c�<���e = −#hC�!, �� −#<��
i��!, ��j',/
�
and,
3��<ce9���f<�7 = −#k l<�m n�
Where m is a scaling factor. Several different penalty functions were tried through experimentation, −9o'p ,�1 + ���!��, and|!|, and all produced similar results.
The cost function E is minimized by gradient descent averaged over a number of images. The derivatives of <� are
found from:
<r� = �� −#i��!, ��C�!, ��',/
−#i��!, ��i%�!, ��<% − gm%ks l<�m n
The learning rule for updating i is then:
∆i��!, �� = u ⟨<� hC�!? , � � −#<�i��!?, � ��
j⟩ Here u, is a learning rate parameter, and ∑ <�i��!?, � �� is the reconstructed image.
The magic of this algorithm is in the “sparsification” penalty term, which punishes individual feature vectors for
overuse, while at the same time finding the best set of φ\’s that minimize the reconstruction error. This penalty strategy
does not directly enforce a policy of equally likely feature vectors. However, indirectly it does, by forcing the “usage
percentage” for any one feature vector below a threshold. If the threshold is low enough, and very uncommon feature
vectors are not used, the overall population of feature vectors is “herded” into a small usage percentage window.
The algorithm allows for an over complete representation, as the number of feature vectors can be greater than the size
of the input. The vectors are non-orthogonal, which allow for a richer representation of complex images from simple
feature vectors. In short, the feature vectors capture the underlying statistical structure of the images.
This paper concludes that the basis vectors are very similar to the receptive fields. Additionally, the entropy after
training is 4.0 bits as compared to 4.6 bits before training, and the Kurtosis has increased from 7 to 20, for an image
reconstruction error that is 10% of the variance of the original image.
It is pointed out that other cost functions and optimizations, produce similar results. This will be seen later when the
sparsity penalty is based on the Kullback-Liebler divergence. The Kullback-Liebler penalty term directly forces equally
likely feature vector usage. Additionally, a different cost function will be used that minimizes the mean square error
between the input and the reconstructed output. The training will be based on back-propagation, so the gradients will
also be different. However, this approach will be shown to produce indistinguishable results from the feature vectors
learned here. This paper does not address audio processing. However, it is a landmark paper that ties the theory of
minimum entropy codes to the neurobiology of optical cortical fields. Moreover, it presents the first unsupervised
learning algorithm that finds sparse basis vector representations of natural images consistent with the know properties of
simple cells in the primary visual cortex of mammals.
2.6 E.Smith and M.Lewicki 2006 Efficient Auditory Coding Nature vol. 439|23
Smith and Lewicki introduce an audio encoding algorithm that closely matches the physiology of auditory nerve fibers
in mammals. RevCor, or reverse correlation filters are the linear equivalent of the impulse response of auditory nerve
fibers. The representation of speech is based on time and amplitude weighting from a set of kernel functions. The audio
stream may be represented by:
!��� = ##��?i?�� − ��?��?
+ v��� Where t\x and s\x are the temporal positions and amplitude scaling of the ith component of the mth kernel. The error
term is due to residual, or the difference between the original audio waveform and the coded signal. The “matching
pursuit” algorithm performs two important functions:
It is used to find the amplitude and time positions, t\x and s\x
The kernel function is updated.
The matching pursuit algorithm iteratively decomposes the audio signal by projecting the input data onto a set of kernel
functions. The kernel with the largest inner product is subtracted from the input vector, and its time and amplitude are
recorded. The algorithm is halted when the amplitudes of the kernels fall below a threshold. The paper says it halted
when s\x fell below .1, using TIMIT data.
The equation for !��� above can be rewritten as
��!|z� = {��!|z, ������ |�
≈ ��!|z, �̂����̂� Here, s� is an approximation to posterior maximum, comes from a set of coefficients generated by matching pursuit. It is
assumed that the noise in p�x|Φ, s��, is Gaussian, and that p�s�� is sparse. The kernel function is optimized by doing
gradient ascent on the approximate log data probability,
��i? log���!|z�� = ��i? 3��!|z, �̂����̂� + log����̂��7 = −12m��
��i? �! −##��?i?�� − ��?��?
��
= 1m��#�̂�?i?3! − !�7����
Here 3x − x�7��� is the residual from kernel φx , at position t\x. The update equation is just the weighted average of the
residual error.
The quantized amplitudes were treated a sample from a random variable. The entropy was estimated as histograms of the
quantized values. This method produces the number of bits/symbol. Rate is then the measure of the number of code
words per second times the number of bits/symbol.
The algorithm was run on two categories of natural sounds as well as speech. The natural sounds were divided transient
sounds such as cracking twigs and crunching leaves. The other category was made of ambient sounds such as rain and
rushing water and wind, or rustling sounds.
The coding efficiency is defined as the number of codes required to reach an arbitrary level of fidelity. Coding efficiency
of learned kernels and gammatone wavelets are shown to be much more efficient than FFT’s or Daubenchies wavelets,
for SNR’s below 30dB. At 15 dB SNR, the learned kernels have a signal rate of 8kbps, while the wavelets and FFT’s
are operating at 30kpbs. Therefore, the coding efficiency is three times greater than the wavelets or FFT’s.
The matching pursuit algorithm is complicated in that it allows the size of kernel functions to grow and shrink in size
over time. This allows a set of kernel functions to emerge that behave like gammatone wavelets, whereby higher
frequencies have shorter lengths, and lower frequency kernels have longer responses. The paper claims that kernel
functions that are initialized with Gaussian noise look remarkable similar to cochlear impulse responses in the peripheral
audio pathways of mammals.
2.7 Rosso O.A., Martin M.T., Figliola A., Keller K., Plastino A. (2006). EEG Analysis using wavelet-based
information tools. Journal of Neuroscience Methods 153 163-182
This paper uses wavelet-based information tools in order to predict epileptic seizures from EEG data. The general
concept of order and randomness of signals is discussed and Shannon’s entropy is given as a measure of the statistical
complexity of a signal. Spectral Entropy, which is based on the entropy of the short-time Fourier transform (STFT) is
shown to be inadequate to capture the time evolution of EEG signals, due to the “windowing” problem. Another problem
with FFT based methods is that EEG signals are not stationary, and FFT’s must be averaged; or have the autocorrelation
taken in advance, in order to represent the power spectral density of the signal faithfully. The solution is to use the
orthogonal discrete wavelet transform, or ODWT.
The ODWT follows the time evolution of frequency patterns with optimal time-frequency resolution and makes no
assumptions about the stationarity of the signal. The ensuing entropy from the wavelet transform is called “Shannon’s
wavelet entropy”. An in depth discussion of different measures of statistical complexity based on entropy models from
Tsallis, (TWS) escort-Tsallis (GWS), and Renyi are presented (RWS).
The wavelet model is reviewed and is based on a quickly vanishing oscillating mother kernel of the following form:
�=,���� = |<|o�/�� D� − �< E
Where <, � are scale and translation parameters that allow scaling and stretching and time translation of the mother
kernel, �. The continuous wavelet transform (CWT) is given as:
��k�<, �� = |<|o�� { k����
o��∗ D� − �< E |� = ⟨k, �=,�⟩
The CWT can be inverted to reproduce the input signal. The CWT produces an infinite number of coefficients, is
difficult to compute and produces a highly redundant representation of the signal. To overcome the limitations, the
discrete wavelet transform (DWT) is introduced and provides a non-redundant, highly efficient wavelet representation of
the original signal. The DWT uses an orthonormal representation based on a mother wavelet with<% = 2o%, �%,� = 2o%�,
with $, � ∈ �, a set of integers. The resolution levels are represented by $, and the kernel number by �. The family �%,���� = 2%/���2%� − �� produces a wavelet series ⟨k, �=,�⟩ = ;%� with as many coefficients as there are samples in
the original sample without the loss of any information and constitutes an orthonormal basis of �����. For a sampled data
system with equally space time samples and M total samples.
k��� = # #;%�
o�
%�oP�����%,���� = # c%
o�
%�oP����
Where Y% = �����O�. The wavelet coefficients are ;%� = ⟨k, �=,�⟩ and the energy at each resolution for $ = −1,−2… ,Y%is
b% = �c%�� = #�;%������
The total energy is
b��� = ‖k‖� = ##�;%������
= #b%%��%��
The relative wavelet energy (RWE) is �% = b%/b���. The sum ∑ �%% = 1, and is represented by the probability
distribution N��� = ��%�. The Shannon wavelet entropy is given as:
k �¡�¢N���£ = −#�%ln��%�%��
For a completely random waveform the frequency distribution will be uniform, and the RWE will be equal at all
resolution levels. At this point the entropy is maximized for all representations of entropy, Talllis, escort-Tallis, (TWS
and GWS) and Renyi (RWS).
This paper uses the wavelet transform and wavelet entropy to follow the frequency-time series of EEG’s for patients
undergoing epileptic seizures. The central concepts of efficient coding of time-frequency non-stationary signals and their
corresponding entropy are uniquely tied together via a orthonormal DWT representation of the signal.
The basic concept of the wavelet will be used for a non-orthogonal representation of speech, based on gammatones as
the kernel function. Additionally, the concept of entropy established will serve as the basis for entropy calculation of the
efficient encoding process.
2.8 McLeod, S. A. (2008). Selective Attention. Retrieved from http://www.simplypsychology.org/attention-
models.html
Theories of auditory cognition from a psychological perspective are discussed and the models proposed by Broadbent
and Triesman are reviewed. The models deal with selective attention and describe it in terms of an information
bottleneck in which the brain cannot simultaneously process all the sensory inputs it is receiving. The brain is only able
to concentrate on one task at a time.
Broadbent conducted “dichotic” listening experiments in which two different messages are sent simultaneously to each
ear. The subject can only focus attention on one message at a time. Broadbent sent 3 digit numbers to each ear
simultaneously, and subjects reported better results by interleaving left and right ears, rather than reporting just one ear.
Broadbent developed his model based on these experiments. Broadbent believed that all sensory information enters a
sensory buffer. Inputs to the buffer are filtered based on physical characteristics, and only select information is allowed
to pass to the output. Inputs in the buffer that are not passed quickly decay. Broadbent believed that non-shadowed or
unattended messages are removed at an early stage in processing. Semantic processing occurs in Broadbent’s model only
after the physical filtering occurs. Criticism of Broadbent’s model is based on being able to hear your name when not
paying attention in the cocktail party scenario.
Treisman modified Broadbent’s model with an attenuation theory, which states the unattended messages are attenuated,
not lost in the short term memory buffer. The messages in the short term buffer are processed based on physical
characteristics, syllabic patterns and individual words. Treisman postulated that a dictionary of words exist with different
triggering thresholds. Words like ‘fire’ and ‘help’ and your name have low attenuation thresholds, and are thus allowed
to pass the filter stage. Treisman’s work is criticized because it does not address how the semantic processing takes
place, nor has the attenuation model ever been validated. Deutsch and Deutsch theorized that the processing of key
words takes place ahead of the physical characteristic filter.
3. PROPOSAL
Ernst Mach believed that “the eye has a mind of its own”. In a similar fashion, the ear must also have a mind of its own.
Another way to say it is the ear only passes along messages that the brain is interested in. The models of Broadbent,
Treisman and Deutsch and Deutsch suggest that preprocessing stages in auditory cognition takes place by a physical
characteristic filter.
The idea of Deutsch and Deutsch that dictionary of key words bypasses the physical characteristic filter does not make
sense. What is more likely is that the ear encodes key words into code patterns that the brain prioritizes. As Ernst Mach
pointed out, relationships between stimulus is what we respond to. The physical characteristics, such as best correlations
to the “auditory spikes” of Smith and Lewicki, might be the way the ear encodes data. Furthermore, the brain might
respond to difference in combinations of the codes. The ear may have evolved, or learned to decompose the auditory
simuli into a summation of gammatone wavelets. It makes intuitive sense that the gammatone wavelets will cluster
around formants in vowels. If the distances between code words carriers the information, then a very powerful code
emerges that represents “differences between code clusters” of physical codes. In this case the encoding for a single
vowel would be represented for all speakers as distances between codes. For example, harmonically related gammatone
wavelets are expected to cluster around format frequencies of vowels, whereas, broad band high frequency pulse
combinations should occur around plosives. Smith and Lewicki found their spiking model of auditory nerve fibers
closely match gammatone wavelets. If follows that physical characteristic filtering in the models of Broadbent, Treisman
might also be based on gammatone wavelets.
Phase 1 of this study is to apply the SAE to natural images and to show that the SAE learns basis vectors that appear as
“edges”. This will verify this implementation of the SAE learns “edges” similar to Olfhausen and Fields representations.
The entropy, kurtosis and reconstruction SNR are evaluated for sparsity parameter ρ = .01 and ρ = .05. It will be
shown that for low sparsity ρ = .01, the kurtosis is high, the entropy is low and the SNR of the reconstructed “image
patches” is low. Increasing the sparsity parameter to ρ = .05 will demonstrate that the basis vectors are no longer
“edges”, but look more like shaded regions. The kurtosis is reduced, the entropy goes up and the SNR is increased.
Histograms of the pixels are also shown which show the “peakedness” of the distributions follow the kurtosis.
In the second part of the Phase 1 investigation the SAE is next applied to randomly selected “voice patches” in order to
determine if high kurtosis representations have a pattern similar to “edges” for images. The “voice patches” are
displayed as images in which the time domain is mapped into an x,y grid so that the time slices appear as column
vectors. This enables viewing the time domain as images so that the structure of the basis vectors becomes apparent.
Moreover, the ability of the SAE to reconstruct the “voice patches” is easily observed in a manner similar to viewing the
reconstructed images from part 1. The kurtosis, entropy and reconstructed SNR are investigated for sparsity parameters
of ρ = .01 and ρ = .05. The basis vectors are found to appear as vertical strips, with the width of the stripes varying
in proportion to the frequency of the speech waveform.
The Phase 2 investigation in this paper is to decompose speech into a linear superposition of wavelet functions whose
kernels are gammatone impulse responses. The kernel functions correspond to a bank of gammatone filters. The portions
of raw speech that have the best mean square error fit to gammatone wavelets, are normalized and used as input vectors
to a SAE, providing they reduce the variance of the speech waveform by a detection threshold. This prevents low
amplitude gammachirp matches from being selected based on a MSE fit to the data. The normalized gammatone is then
subtracted from the speech waveform. Once the decomposition is complete for a portion of speech, the data is
reconstructed and the residual error is measured. The SAE will then encode the raw speech that was selected by a good
match to a gammatone wavelets into a new set of feature vectors. After training the output layer of the SAE is removed,
and a softmax classifier is hooked up to the hidden layer of neurons. The complete network is then trained with
backpropagation using supervised learning.
This approach is solves the amplitude attenuation problem of Treisman, because the MSE fit to the kernel function is
what selects the “essence of speech”. If the MSE is below threshold, a good match is proclaimed and the raw data is
scaled up to the amplitude of the “hitting gammatone wavelets”, so that the SAE doesn’t waste resources learning many
different amplitudes of the same basic kernel. The metric for data selection is signal energy reduction by kernel
subtraction. The difficulty with the decomposition is how to initialize the algorithm, and how to subsequently choose
which gammachirp gives the next best MSE to fit to the partially decomposed data. It is conjectured that this is the
process the neural pathways in the ear use to encode data, and that they have a very efficient search algorithm for
decomposing audio data into kernel MSE projections.
The first part of this study is to validate this concept using speaker recognition. The second part of the study is to apply
the methods of part 1 to the additive model proposed by Les Atlas, of the slow envelope and the temporal fine structure
(TFS) of human audition.
3.1 SAE design equations
The SAE is a single layer neural network that is trained with backpropagation using gradient descent [22]. The block
diagram from [22] is shown in figure 3. The targets for the backpropagation algorithm are the inputs, so the network
learns the identity function. The number of neurons in the hidden layer may be less than the number of input features,
enabling the network to compress the input data. Additionally, a sparsity parameter based on Kullback-Leibler
divergence (KLD), is introduced to the cost function, which penalizes the neurons for activity above or below a
threshold. Often, the sparsity parameter threshold is set below 10%, so the number of active neurons becomes quite low.
This leads to even higher amounts of data compression. Moreover, the KLD, sparsity forcing function directly controls
the percentage of the time, individual neurons fire. This enforces the policy of minimum entropy coding, which removes
redundancy and simultaneously forces approximately equally likely independent symbols.
Figure 1 SAE Block diagram
The neurons in the hidden layer are soft-limiter, non-linear functions. For unipolar data a sigmoid function is used:
f�¦� = 11 + 9o§For bipolar data, a hyperbolic tangent function is used:
f�¦� = �<eℎ�¦� = 9§−9o§9§ + 9o§The responses for the activation functions are shown in figure 4.
Figure 2 Sigmoid and tanh activations
The inputs to each neuron are the inputs, multiplied by input weights. Additionally, a bias term is also added that always
has a unity value. Ignoring the input weight subscript, the input for neuron 1, is given as,
¦1 = !1 ∗ K11 + !2 ∗ K12…+ !e ∗ K1e + �1,
Likewise for z2,
¦2 = !1 ∗ K21 + !2 ∗ K22…+ !2e ∗ Ke + �1
For the nth neuron input,
¦e = !1 ∗ Ke1 + !2 ∗ Ke2…+ !ee ∗ Ke + �1
The weights may be arranged into a row matrix with the inputs arranged in columns so that the inputs to a neuron may
be compactly written in matrix form as:
¦� = �� ∗ !, The activation at the output of the hidden layer is,
<� = f�¦��, While the input to the output layer is,
¦© = �� ∗ <�, And the output is given as:
<© = f�¦©�, Note that the SAE has two non-linearity functions, one in layer two and one in the output layer. This should allow the
network to capture higher order statistics from the input data, leading to a higher kurtosis in the coded representation.
Moreover, the entropy should be lower than the original data due to the minimum entropy coding principal.
The cost function for the SAE is given by:
ª�«, ¬� = 12�#�®��� − ����?
���+ g2 ‖«‖� + ¯#���°||°��
�
%��
The KL penalty term forces the difference between a threshold or desired value ° of average frequency of activation, and
the estimated activation rate, °�%, for the jth
out of � neurons. The KL penalty term is given by:
���°|�°�%� = °��� °°�% + �1 − °���� 1 − °1 − °�% The average activation of hidden node $ with input ! is :
°� = 1�#<%�?
����!��
The cost function averages the error over m input vectors using batch gradient descent. Each update epoch spans the
entire data set. Ridge regularization is used, which prevents over fitting of the data by adding the Euclidean norm
squared, of the weights. A regularization parameter λ controls the amount of regularization.
The output of the KL penalty term for ρ = .2, is shown from [23] in figure 5. A sparsity parameterβ, controls the
amount of influence the KL penalty term has on the overall cost function.
Figure 3 KL penalty for ° = .2
The equations for updating the weights and bias’s in layer �,using gradient descent are given as:
��%�M� = ��%�M� − ³ ����%�M�
ª��, �� ���M� = ���M� − ³ �
����M� ª��, ��
Where, α is the learning rate parameter for gradient descent. Note that regularization is applied to the weights for each
layer l, but not the bias terms. The equations for calculating the partial derivatives address a credit assignment problem,
whereby the contribution to the overall cost function for each neuron is accounted for. The derivation is shown in the
appendix.
The SAE is trained by making a forward pass through the network. For batch training all the inputs are stacked in a
column vector. The weights are in row vectors so that input to the activations z, is a vector of column vectors. In Matlab
the sigmoid function is applied to the entire z vector. The cost function is calculated on all the inputs, and the estimate ρ�
of the activations for each neuron in the hidden layer is calculated by taking the mean across all the input samples.
3.2 Sparse Autoencoders applied to Natural Images
Examples of prewhitened natural images are shown in figure 1. The images are 512x512 pixels.
Figure 4: Example of Natural Images
Randomly selected “image patches” that are 8x8 pixels are shown in figure 2. For this example there were 10000 image patches selected. The SAE for this example has 25 hidden nodes, with a sparsity parameter of .01. The learned basis vectors are shown in figure 2.
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
Figure 5: SAE Basis Vectors are “Edges”
Learned from Natural Images for Sparsity Parameter ρ = .01
The reconstruction error for the learned basis vectors is 1 dB SNR for both the training and test sets. A sample of 25 image patches is shown in figure 3, with the corresponding SAE reconstruction on the right. There are 64 pixels for each image patch with 25 hidden neurons in the SAE. The SAE sparsity parameter is .01 for this example. Note that the SAE’s ability to generalize the intensity pattern of the images. The data compression for this example is 64:(.01*25) = 256:1.
Figure 6: Image Patches Left and SAE Rendering on Right for ρ = .01
Figure 7: 256 Bin Histogram of Image Patches Left
SAE Test Data Rendering Right for ρ = .01
The entropy kurtosis and SNR of the image patches and the SAE rendering of train/test data are shown in table 1. Notice that the entropy has decreased, so that fewer bits/pixel now represent the same data. Moreover, the kurtois has increased
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
which make the probability distribution of the pixel intensity more peaked.
Table 1: Entropy and Kurtosis and SNR of Natural Image Patches
and SAE Training/Test Data Rendering for ρ = .01
Entropy Kurtosis & SNR
Raw Image Patches SAE rendering Train
SAE rendering Test
Entropy Bits 7.1 5.7 5.7
Kurtosis 4.8 19.5 21.6
SNR dB - 1.1 1.0
Increasing the sparsity parameter ρ significantly changes the basis vectors characteristics. The basis vectors for sparsity parameters of ρ = .02 and ρ = .05are shown in figure 5. The basis vectors for a .02 sparsity parameter still resemble “edges”. However, for a sparsity parameter of .05, the “edges” are not as apparent. Now the basis vectors appear more as shaded regions.
Figure 8: Basis Vectors for ρ = .02 Left, and ρ = .05 Right
Figure 9: Image Patches Left and SAE Rendering on Right for ρ = .05
Table 2: Entropy and Kurtosis and SNR of Natural Image Patches
and SAE Training/Test Data Rendering for ρ = .05
Entropy Kurtosis & SNR
Random Voice Patches
SAE rendering Train
SAE rendering Test
Entropy Bits 7.1 6.4 6.4
Kurtosis 4.8 7.3 7.4
SNR dB - 3.6 3.5
Increasing the sparsity parameterρ, decreases the compression ratio to 64:(.05*25) = 51.2. Additionally, the entropy goes up and the kurtosis is decreased and the SNR of the reconstructed image increases marginally. Histograms of the raw image distributions and the SAE reconstructed image are shown in figure 7, for a sparsity parameter of ρ = .05. Note that the peakedness of the histogram of the SAE reconstructed images is reduced. This is consistent with a decrease in the kurtosis.
Figure 10: 256 Bin Histogram of Image Patches Left and SAE Test Data Rendering Right for ° = .05
3.3 Sparse autoencoders applied to randomly selected “voice patches”
In this section, the SAE is used to learn basis vectors for randomly selected samples of speech from the TIMIT data set. The data is based on 10k voice patches taken from the New England area or, the training folder in DR1 in the TIMIT corpus. There are 38 speakers in the DR1 training set, and each speaker has ten sentences. Therefore, 27 voice patches are randomly selected from 380 sentences to give ~10k voice patches. The “voice patches” are 400 samples long, so that the duration at 16ksps is 25msec. The motivation for this experiment is to run the SAE on randomly selected “voice patches” to determine if a set of basis vectors emerge, that are similar to the “edges” found from randomly sampling images. The basis vectors for randomly sampled speech appear as “stripes” and are shown in figure 8. For this exercise the number of hidden nodes is kept at 25 while the number of inputs is 400. The compression ratio is 400:(.01*25) = 1600:1. An example of 25 voice patches and the reconstructed “images” are shown in figure 9. The gray pixelated examples are portions of speech that are quite. The SAE rendering does not appear to do to well based on the results of figure 9. The gray examples are compose from out of phase basis vectors, and do not completely turn into shaded regions.
The entropy, kurtosis and reconstruction SNR is shown in table 3 for a sparsity parameter of ρ = .01. The SAE does not compress the data for randomly selected voice patches as can be seen from the entropy.
Figure 11: SAE Basis Vectors are “Stripes” Learned from Random Speech Samples Sparsity Parameter ρ = .01
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Figure 12: 25 Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .01
Figure 13: Randomly Selected Voice Patches Left and SAE Rendering on Right
Table 3: Entropy and Kurtosis & SNR of TIMIT DR1 Voice Patches
and SAE Training/Test Data Rendering
Entropy Kurtosis & SNR
Random Voice Patches
SAE rendering Train
SAE rendering Test
Entropy Bits 5.7 5.8 5.8
Kurtosis 17.7 31.4 31.8
SNR dB - 0.1 0.0
The sparsity parameter is adjusted to ρ = .05, and the basis vectors are shown in figure 11. The basis vectors lose their stripes. However, the reconstructed signal now appears closer to the original “image” as shown in figure 12.
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Figure 14: SAE Basis Vectors are lose there “Stripes” Learned from Random Speech Samples Sparsity Parameter ρ = .05
Figure 15: 25 Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .05
The entropy, kurtosis and SNR performance can be seen for ρ = .05 in table 4. The entropy is the same or higher than the original “image”, while the kurtosis is reduced from the ρ = .01 case. The SNR is improved slightly by 1dB.
Figure 16: Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .05
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0 50 100 150 200 250 3000
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Table 3: Entropy and Kurtosis & SNR of TIMIT DR1 Voice Patches
and SAE Training/Test Data Rendering
Entropy Kurtosis & SNR
Random Voice Patches
SAE rendering Train
SAE rendering Test
Entropy Bits 5.8 5.8 6.0
Kurtosis 17.4 21.9 21.9
SNR dB - 1.1 1.1
In this section randomly selected “voice patches” were compressed using a SAE. The basis vectors and time domain
“voice patches” are viewed as images in order to evaluate the quality of the SAE performance. The SAE does not do a
good job of representing the data, but neither did the SAE do a good job for randomly sampled images. The results of
this experiment shows that the SAE basis vectors for speech appear as vertical stripes in a similar manner that “edges”
are the predominate characteristic for images processed with a sparsity parameter of ρ = .01.
Random sampling of images and speech is not a very good choice is raw feature selection. However, the basis vectors
that are learned at low sparsity settings do appear to generate a set of basis vectors that capture the underlying
characteristics of the data. In the next sections, other methods of selecting features for the SAE will be based on using
these “primitive” basis vectors.
3.4 Adaptation of Lewicki and Smith’s matching pursuit algorithm using SAE’s.
The beauty of using SAE’s at low sparsity settings is that the learned basis vectors are generated by random sampling
with no regard to any features other than “image size” and are completely generated by the data itself. In this section, the
data will first be convolved (filtered) with primitive basis vectors in order to more intelligently sample speech. The
random sampler, makes no distinction between the quiet portions of speech and the high energy portions of speech. As a
result the SAE must waste resources learning to represent the quiet portions.
3.5 Speaker identification using gammatone wavelet decomposition speech sampling SAE and softmax classifier
This validation investigation uses four speakers selected at random from the TIMIT DR1 TRAINING corpus. The DR1
folder was selected which has speakers from New England. Each speaker says ten sentences. The speakers are not
repeated in the TEST section of the TIMIT corpora, so the sentences are divided in five sentences for training and five
for testing. The speakers are taken in pairs, so there are six combinations of speaker pairs. Each speaker has ten
sentences, which is divided into five for training and five for testing. Unsupervised feature learning is use to train the
SAE with both speakers voice patches. This is not expected to be the optimum solution but it shows that massive
amounts of data may be collected to train the SAE. Then, relatively small amounts of data can be used to train the
network using supervised learning. For optimum performance the SAE should be run on each speaker individually. This
will allow the SAE to extract basis vectors that best capture the individual characteristics of the speaker.
A voice activity detector is used to remove the silent portions of speech, so that the classifier does not to waste resources
to learn silent periods between words. This study is about speaker identification so that it is not critical to recover all the
speech, but to focus on the larger voice active portions that contain the majority of the speech energy. The VAD
functionality is shown in figures 1 and 2.
The steps for the VAD are:
• The sample rate for the TIMIT data set is 16ksps.
• Band-pass Filter TIMIT sentences with a FIR 100-600 Hz filter. This is the preprocessing filter that is used for
“Efficient Auditory Coding” [2].
• Generate Voice Gate by Band-pass filtering with 200-800Hz FIR filter.
• Take absolute value of filtered data
• Add small offset so the minimum of the absolute value drops below 0.
• Take the sign of the shifted absolute value. This makes a zero crossing detector.
• Fill in the gaps of the zero crossing detector by FIR filtering with a FIR of 200 ones
• Remove small gates less than 1000 samples long. Don’t want short speech sample for the Machine Learning
Algorithms.
• Extend voice patches so they are modulo n*400
Fig 17 FIR 200-800Hz data on left, zero crossing detector on the right
Fig 18. Filtered zero Crossing detector on left, gated voice signal on right
3.6 Gammatone design equations
Gammatone filters were conceived as a simple fit to the mammalian cochlear [23, 24]. They are the product of a gamma
function and a single tone. The equation is given as:
���� = <� o�9o�¶�� cos�2¸f� + ∅� Where: f = f���9c:9e�9cfc9º»9e:�
e = f���9c�c|9c
� = f���9c�<e|K�|�ℎ
< = f���9c<�����»|9
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
-1.5
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
∅ = f���9c�e���<��ℎ<�9
Gammatone filters have an ERB response, or Equalivent Rectangular Bandwidth. The gammatone filterbank has a
logarithmically spaced frequency separation, with logarithmically increasing bandwidth. This is very similar to the
frequency of MFCC’s, or mel-cepstral coefficients [13], which are widely used in ASR today in systems like Sphinx 4
[24]. Additionally, there are similar to wavelets in which the time response gets shorter for higher frequencies. The total
rectangular bandwidth remains constant, so low frequencies have a narrow bandwidth, while higher frequencies have a
broader bandwidth. As the bandwidth increases, the amplitude is lower to maintain the same EFB.
The gammatone impulse response for a 100 Hz center filter, and the frequency domain responses for a gammatone
filterbank with 12 filters with center frequencies from 100Hz to 6kHz are shown in figure 6.
Figure 19 Gammatone 100Hz center frequency Impulse Response left. Frequency domain for a 12 Gammatone
filterbank Fmin=100Hz, Fmax =6kHz right
3.7 Phase 1: Proof of concept speaker recognition using gammatone wavelet decomposition sampling, SAE and
softmax classifier
Gammatone decomposition of speech was used to derive the auditory spike model in the paper “Efficient Audio
Coding”[3]. The study used the entire TIMIT corpus for finding the auditory spike model. The study did not perform
any classification exercises using the efficient encoding scheme, but was limited to faithful reproduction of the audio
waveforms in the time domain. The learned kernel functions in the time and frequency domain, reprinted from Lewicki
and Smith are shown in figure 7. The kernel functions are shown to closely match cochlear revcor filter impulse
responses. The kernel functions impulse responses get shorter with increasing center frequency. This corresponds to
wider bandwidth at higher center frequencies.
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-6
-4
-2
0
2
4
6x 10
-3
Time in Seconds
line
ar
am
plit
ude
0 1000 2000 3000 4000 5000 6000 7000 8000-110
-100
-90
-80
-70
-60
-50
-40
-30
-20
-10
Frequency in Hz
am
plit
ude in d
B
Figure 20 a, Learned auditory Spikes in red, and Cochlear Revcor Filter responses in blue (gammatone
wavelets).
b, bandwidth and center frequencies of Cochlear revcor filters in blue and Auditory Spikes in red.
Lewicki and Smith had two parts to their study, learning the kernel functions, and using the kernel functions to
efficiently encode speech. The proof of concept phase will be to use gammatone wavelet kernel functions for speech
decomposition and sampling. The phase 1 study will use data selected by this sampling technique to perform automatic
speaker recognition. Normalized speech feature vectors will be selected based on a MSE fit to gammatone wavelets. To
simplify the investigation, the gammatone wavelets will be of a fixed size. This will allow batch gradient descent to be
used for training.
A voice activity detector will be used in phase 1 to select “voice patches”. The feature vectors for each voice patch are
saved in Matlab cells. This allows classification scores for individual feature vectors (gammatone wavelets MSE
“matches”), voice patches as well as whole sentences.
It should be noted that many decomposition strategies can be used to decompose speech based on gammatone wavelets
kernel functions. For this study, the raw speech is first filtered by the gammatone filter bank. After filtering, the largest
peaks from each filters outputs are normalized, and a MSE fit to the gammatone wavelets for that filter is calculated. If
the MSE is below a threshold, the raw speech is selected by compensating for the group delay of that filter. Additionally,
it is normalized, and the resulting feature vector is selected as input for the SAE. The SAE creates a new set of basis
vectors from the gammatone wavelets sampled speech that are tuned to the classification task set up by the softmax. For
example, in the case of speaker recognition, the feature vectors will capture differences in voice characteristics. Another
task to be investigated will be to perform dialect recognition based of the eight different regions from the TIMIT corpus.
It is postulated that the process of efficient encoding of speech based on an MSE fit to kernel functions best captures the
essential features of speech needed for learning. The SAE sets up the initial conditions for the composite network to
learn during training. The approach of using audio kernel functions to learn basis vector representations is consistent
with the theories of Mach, Barlow, Atick, Olshausen and Fields, and Smith and Lewicki.
For the phase 1 investigation, the gammatone filter bank is implemented as a bank of FIR filters of fixed length. The FIR
filter taps have the reverse correlation (revcor) weights corresponding to the gammatone wavelets.
The gammtones in are defined as a set of Equivalent Rectangular Bandwidth 5th
order IIR filters [25]. For the phase 1
investigation, the impulse response is truncated to 16.6 msec, or 265 samples at a 16kHz sample rate.
The impulse responses, for a bank of 64 filters are shown in figure 8.
Figure 21 Gammatone wavelets for 64 filters, Fs = 16ksps, Fmin = 100Hz, Fmax = 6kHz
Gammatone wavelet decomposition is given by the following steps:
1. Filter the output of the voice activity detector with each filter in the filter bank.
2. Find of peaks of the filtered output for each filter that are greater than 60% of the maximum peak.
3. Take the dot product of the impulse response of the filter and the speech data corresponding to the peak.
4. Normalize the dot produce by dividing by the norm of the data times the norm of the impulse response
5. Multiply the impulse response of the filter by the normalized dot product.
6. Compute the residual by subtracting the normalized impulse response from the speech sample.
7. If the variance of the residual is less than 93% of the variance of the speech data, retain the speech sample as a
feature vector.
8. Scale the feature vector by 1/normalization coefficient used for the dot product.
9. Save the feature index for reconstruction.
Once the voice signal has been decomposed, it may be reconstructed by simply adding scaled gammatone impulse
responses of the correct gain and index together to regenerate the speech signal. The signal to noise ratio of the
reconstructed signal may be found by subtracting the reconstructed signal from the original speech signal to reveal a
residual. The SNR is computed by taking the ratios of the variance of the original to speech to the variance of the
residual.
Various attempts were made to get the best possible SNR with the fewest basis vectors. The original number of basis
vectors used was sixty four. After much experimentation and trial and error, the optimum number of basis vectors that
yielded good signal to noise performance was found to be forty two. This produces signal to noise ratios of
approximately ten decibels. The gammatone wavelets were also truncated in the time domain to produce the best
possible signal to noise ratios. The best performance was found empirically to be 265 samples per gammatone wavelet.
Some typical speech signals, with the reconstructions and the residuals are shown in figures 9-12.
0 50 100 150 200 250 300 350 400 450 500-1
0
1
0 50 100 150 200 250 300 350 400 450 500-1
0
1
0 50 100 150 200 250 300 350 400 450 500-1
0
1
0 50 100 150 200 250 300 350 400 450 500-1
0
1
0 50 100 150-1
0
1
0 50 100 150-1
0
1
0 50 100 150-1
0
1
Figure 22 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right female
FSJK1
Figure 23 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right Male
MSPW0
Figure 24 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right male
MEDR0
0 500 1000 1500 2000 2500 3000 3500 4000 4500-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Gammatone Decomposition Number Basis Vectors: 29 SNR: 11.5452
Original Voice Data
Gammatone Reconstruction
0 500 1000 1500 2000 2500 3000 3500 4000 4500-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Original Wavefrom verses Residual after Gammatone Decomp SNR:11.5452
Gammatone Reconstruction Residual
Original Voice Data
0 5 10 15 20 25 30 35 40 450
5
10
15
20
25
30Gammatone Decomposition Basis Vector Histogram: Number Basis Vectors: 29
0 500 1000 1500 2000 2500 3000-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Gammatone Decomposition Number Basis Vectors: 30 SNR: 11.2418
Original Voice Data
Gammatone Reconstruction
0 500 1000 1500 2000 2500 3000-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Original Wavefrom verses Residual after Gammatone Decomp SNR:11.2418
Gammatone Reconstruction Residual
Original Voice Data
0 5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
35
40
45Gammatone Decomposition Basis Vector Histogram: Number Basis Vectors: 30
0 500 1000 1500 2000 2500-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Gammatone Decomposition Number Basis Vectors: 27 SNR: 9.2851
Original Voice Data
Gammatone Reconstruction
0 500 1000 1500 2000 2500-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Original Wavefrom verses Residual after Gammatone Decomp SNR:9.2851
Gammatone Reconstruction Residual
Original Voice Data
0 5 10 15 20 25 30 35 40 450
5
10
15
20
25
30
35Gammatone Decomposition Basis Vector Histogram: Number Basis Vectors: 27
Figure 25 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right female
FMEM0
3.8 Phase 1 Enhancements
The validation phase was based on decomposing speech similar to the method of Smith and Lewicki using gammatone
kernel wavelets. For convenience all the wavelets were the same length so that a single SAE could be used to test the
approach. For the enhancement phase, the wavelets will now be of variable length and will operate on the output of a
bank of gammatone filters. This is an in-between step from the proof-of-concept phase, to the additive model
implementation. This method has the advantage that the decompositions are now used to construct the individual filter
outputs, and therefore, are operating on data that is at a higher signal-to-noise ratio.
Filter 1 Wavelet Decomp Sampler SAE 1
… SOFTMAX
Filter N Wavelet Decomp Sampler SAE N
Figure 26 Phase I enhancement block diagram showing stacking of wavelet feature vectors
The block diagram for the phase I enhancements is shown in figure 13 above.
The gammatone wavelet decomposition sampler will now operate in the time domain on data that has been filtered by a
gammatone filter bank.
0 200 400 600 800 1000 1200 1400 1600-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Gammatone Decomposition Number Basis Vectors: 38 SNR: 10.8768
Original Voice Data
Gammatone Reconstruction
0 200 400 600 800 1000 1200 1400 1600-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Original Wavefrom verses Residual after Gammatone Decomp SNR:10.8768
Gammatone Reconstruction Residual
Original Voice Data
0 5 10 15 20 25 30 35 40 450
20
40
60
80
100
120
140Gammatone Decomposition Basis Vector Histogram: Number Basis Vectors: 38
3.9 Phase II: Speaker recognition using additive model of the slow envelope and Temporal Fine Structure with a
SAE/Softmax classifier.
The phase I study showed feature extraction of speech based on gammachirp sampling in combination with a
SAE/Softmax classifier was very effective in speaker identification, especially at low SNR’s. The phase II investigation
is based on an actual model of the human peripheral audio system. This model has been proposed by Les Atlas and is
based of years of research on cochlear implants working with Brian Moore at Cambridge University [26]. The model is
is shown in figure 14.
…
ERB Filters-Gammatones Half wave rectify Lowpass = SE: TFS =Bandpass SE+TFS Sum ERB
Figure 27 Block diagram of the Additive model of the slow envelope and Temporal Fine Structure
The output of the gammatone filter bank is followed by a halfwave rectifier. The slow envelope is a lowpass filtered
version of the halfwave rectified signal. The temporal fine structure is a bandpass version of the halfwave rectified signal
in which only the fundamental and second harmonic are allowed to pass. The slow envelope and TFS are added together
in this model for each gammatone filter output. These are then summed to get composite speech. The outputs of a 12
gammatone filter bank are shown in the figure 17 in the results section.
A goal of phase II is to eliminate the VAD as this will fail in high noise environments. The simple energy detection in a
frequency band is easily captured by noise. One outcome the phase II approach will be to develop a very robust VAD
based on gammatone decomposition and feedforward SAE/Softmax network. A significant difference between phase I
and phase II is that the decompositions take place the at the filter outputs, one for the slow envelope and another for the
TFS. This is operating at a higher SNR due to the reduction in bandwidth compared to the phase I investigation that used
the entire speech bandwidth.
The phase II investigation has the following steps:
1. Repeat Phase I, but now used the additive model. Decompose the slow envelope using impulse responses for
each of the slow envelopes filter responses. Decompose the fast envelope using the impulse responses for each
of the TFS outputs. The gammatone wavelets will now be different lengths. There will be one SAE for each
filter output.
2. Expand the number of speakers to include all speakers in a region from the TIMIT corpus. Do this for each of
the 8 regions.
3. Identify the dialects in each of the 8 regions.
4. Identify all speakers in the TIMIT corpus by first isolating them to a region, based on dialect. Then identify the
individual in that region.
5. Develop a VAD SAE/Softmax that uses two classes, 1). Speech Present 2). No Speech Present. Train the
speech detector based on
Slow Envelope ERB 1
TFS ERB 1
Slow Envelope ERB n
TFS ERB n
4. RESULTS
4.1 Phase 1: Proof of concept speaker recognition using gammatone wavelet decomposition sampling, SAE and a
softmax classifier
The results the classification accuracy for individual feature vectors, clusters of feature vectors that compose a voice
patch, as well as overall sentence classification accuracy are shown in tables 1-6 below. The SNR is varied for each of
the six speaker combinations. This input vector size for all gammatone wavelets is 265 which correspond to a sample
time of 16.6msec per feature vector. The fixed length of 265 allows batch processing as all the feature vectors are
stacked into column vectors. There are 84 hidden nodes in the SAE with a sparsity constraint of 4/84 active nodes in the
network on average. The number of hidden nodes, and the sparsity constraint was found by computer search that
represents the knee of the curve in terms of SAE reconstruction error. The training and test vectors were decomposed
using an infinite SNR. Noise is added after the decomposition to the test data only, to show the SNR performance of the
classifier. Adding noise to the data before decomposition will dramatically affect the VAD performance for both the
training and test data and was therefore omitted. The VAD will capture voice patches based on noise and not signals at
low SNR. The intent of the proof of concept phase is to show the SAE/Softmax classifier used for image processing can
also be adapted for voice processing. Furthermore, the efficient encoding decomposition technique is used to extract
meaningful features that are not just used to autoencoding, but to perform an actual classification task.
The training and test data is forward passed through the network and the SNR’s of the SAE is recorded. The SNR for the
training data is simply the SNR due to reconstruction error. The SNR of the test data is based on AWGN plus the
reconstruction error.
The results show that unsupervised feature learning using the SAE and softmax classifier produces 100% accuracy on
sentences at 3dB and 40dB SNR’s. The performance for the pair of MPSW0 FMEM0 at 10dB SNR produced a 90%
classification score. It is not clear why the 3dB SNR outperforms the 10dB SNR in terms of classification accuracy. The
training was done at infinite SNR in direct analogy to a matched filter in a communications system. A BPSK modem, for
example, with a root raised cosine channel filter has a known impulse response. This channel filter impulse response is
used as a matched filter to make symbol decisions. Perhaps higher noise levels allow the classifier to generalize better
and achieve similar performance to the high SNR performance at 40dB.
This proof of concept study focused on speaker identification. The goal was to determine if the method of SAE and
softmax classifier used for image processing have a direct implementation in voice processing. The gammatone wavelet
decomposition sampler was introduced to provide a pre-processing method for feature extraction. The results are
encouraging as the performance at low SNR appears to work as well as the noiseless case.
SAE Parameters
MPSW0_MEDR0 SNR = 40dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.366336
SNR in dB Sparse Auto Encoder test:
14.353541
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
57.431144%
Test Accuracy VoicePatch:
82.278481%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MPSW0_MEDR0 SNR = 10dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.359360
SNR in dB Sparse Auto Encoder test:
10.257157
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
56.971012%
Test Accuracy VoicePatch:
81.012658%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MPSW0_MEDR0 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.370852
SNR in dB Sparse Auto Encoder test:
6.523747
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
56.471439%
Test Accuracy VoicePatch:
81.012658%
Test Accuracy Sentence:
100.000000% Table 1 MPSW0_MEDR0 Speaker Recognition SNR performance
SAE Parameters
FSJK1_FMEM0 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.438659
SNR in dB Sparse Auto Encoder test:
6.616162
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
61.496447%
Test Accuracy VoicePatch:
75.949367%
Test Accuracy Sentence:
100.000000%
SAE Parameters
FSJK1_FMEM0 SNR = 10dB
Sparcity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparcity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.495097
SNR in dB Sparse Auto Encoder test:
10.468286
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
62.848658%
Test Accuracy VoicePatch:
72.151899%
Test Accuracy Sentence:
100.000000%
SAE Parameters
FSJK1_FMEM0 SNR = 40dB
Sparcity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparcity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.476532
SNR in dB Sparse Auto Encoder test:
14.465558
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
63.023650%
Test Accuracy VoicePatch:
72.151899%
Test Accuracy Sentence:
100.000000% Table 2 FSJK1_FMEM0 Speaker Recognition SNR performance
SAE Parameters
MEDR0_FSJK1 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.344862
SNR in dB Sparse Auto Encoder test:
6.484468
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
64.209192%
Test Accuracy VoicePatch:
96.052632%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MEDR0_FSJK1 SNR = 10dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.326024
SNR in dB Sparse Auto Encoder test:
10.268590
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
65.255151%
Test Accuracy VoicePatch:
94.736842%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MEDR0_FSJK1 SNR = 40dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.295548
SNR in dB Sparse Auto Encoder test:
14.283003
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
65.800317%
Test Accuracy VoicePatch:
96.052632%
Test Accuracy Sentence:
100.000000% Table 3 MEDR0_FSJK1 Speaker Recognition SNR performance
SAE Parameters
MEDR0_FMEM0 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.465636
SNR in dB Sparse Auto Encoder test:
6.643830
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
59.515741%
Test Accuracy VoicePatch:
69.512195%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MEDR0_FMEM0 SNR = 10dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.446876
SNR in dB Sparse Auto Encoder test:
10.432653
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
60.286401%
Test Accuracy VoicePatch:
68.292683%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MEDR0_FMEM0 SNR = 40dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.447896
SNR in dB Sparse Auto Encoder test:
14.435708
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
60.362921%
Test Accuracy VoicePatch:
65.853659%
Test Accuracy Sentence:
100.000000%
Table 4 MEDR0_FMEM0 Speaker Recognition SNR performance
SAE Parameters
MPSW0_FMEM0 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.440929
SNR in dB Sparse Auto Encoder test:
6.621928
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
59.952995%
Test Accuracy VoicePatch:
73.170732%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MPSW0_FMEM0 SNR = 10dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.438910
SNR in dB Sparse Auto Encoder test:
10.426857
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
60.226279%
Test Accuracy VoicePatch:
73.170732%
Test Accuracy Sentence: 90.000000%
SAE Parameters
MPSW0_FMEM0 SNR = 40dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.457319
SNR in dB Sparse Auto Encoder test:
14.444083
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
60.073240%
Test Accuracy VoicePatch:
67.073171%
Test Accuracy Sentence:
100.000000% Table 5 MPSW0_FMEM0 Speaker Recognition SNR performance
SAE Parameters
MPSW0_FSJK1 SNR = 3dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.091070
SNR in dB Sparse Auto Encoder test:
6.474035
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
63.289308%
Test Accuracy VoicePatch:
97.014925%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MPSW0_FSJK1 SNR = 10dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.133589
SNR in dB Sparse Auto Encoder test:
10.209106
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
64.031447%
Test Accuracy VoicePatch:
98.507463%
Test Accuracy Sentence:
100.000000%
SAE Parameters
MPSW0_FSJK1 SNR = 40dB
Sparsity Parameter: 0.047619
Hidden Size: 84
Lambda: 3.000000e-03
Beta-Sparsity Penalty: 3
Number Iterations SAE L-BFGS to
run: 400
SNR in dB, Sparse Auto Encoder
training: 14.142304
SNR in dB Sparse Auto Encoder test:
14.130692
SOFTMAX Classification Score
Number of Classes: 2.000000
Soft Max Lambda: 0.003000
Test Accuracy Feature Vector:
64.742138%
Test Accuracy VoicePatch:
98.507463%
Test Accuracy Sentence:
100.000000% Table 6 MPSW0_FSJK1 Speaker Recognition SNR performance
4.2 Phase II: Speaker recognition using additive model of the slow envelope and Temporal Fine Structure with a
SAE/Softmax classifier.
The slow envelopes and TFS of a 12 gammatone filter bank are shown in figure 15 below.
0 2000 4000 6000 8000 10000 120000
0.5
1
1.5x 10
-3 SLOW ENVELOPE ERB Center Freq = 100Hz
0 2000 4000 6000 8000 10000 12000-5
0
5x 10
-3 Temporal Fine Structure ERB Center Freq = 100Hz
0 0.5 1 1.5 2 2.5
x 104
-5
0
5
10
15x 10
-3 SLOW ENVELOPE ERB Center Freq = 180Hz
0 0.5 1 1.5 2 2.5
x 104
-0.02
0
0.02
0.04Temporal Fine Structure ERB Center Freq = 180Hz
0 0.5 1 1.5 2 2.5
x 104
-0.01
0
0.01
0.02
0.03
SLOW ENVELOPE ERB Center Freq = 275Hz
0 0.5 1 1.5 2 2.5
x 104
-0.02
0
0.02
0.04Temporal Fine Structure ERB Center Freq = 275z
0 0.5 1 1.5 2 2.5
x 104
-0.02
0
0.02
0.04
0.06
SLOW ENVELOPE ERB Center Freq = 400Hz
0 0.5 1 1.5 2 2.5
x 104
-0.1
0
0.1
0.2
0.3Temporal Fine Structure ERB Center Freq = 400Hz
0 0.5 1 1.5 2 2.5
x 104
-0.02
0
0.02
0.04
0.06
SLOW ENVELOPE ERB Center Freq = 550Hz
0 0.5 1 1.5 2 2.5
x 104
-0.1
0
0.1
0.2
0.3Temporal Fine Structure ERB Center Freq = 550Hz
0 0.5 1 1.5 2 2.5
x 104
-0.05
0
0.05
0.1
0.15SLOW ENVELOPE ERB Center Freq = 730Hz
0 0.5 1 1.5 2 2.5
x 104
-0.2
0
0.2
0.4
0.6Temporal Fine Structure ERB Center Freq = 730Hz
Figure 15 Slow envelope and TFS for a 12 gammatone filter bank
0 0.5 1 1.5 2 2.5
x 104
-0.02
0
0.02
0.04
0.06
SLOW ENVELOPE ERB Center Freq = 950Hz
0 0.5 1 1.5 2 2.5
x 104
-0.1
0
0.1
0.2
0.3Temporal Fine Structure ERB Center Freq = 950Hz
0 0.5 1 1.5 2 2.5
x 104
0
0.01
0.02
0.03
0.04SLOW ENVELOPE ERB Center Freq = 1240Hz
0 0.5 1 1.5 2 2.5
x 104
-0.05
0
0.05
0.1
0.15Temporal Fine Structure ERB Center Freq = 1240Hz
0 0.5 1 1.5 2 2.5
x 104
0
0.01
0.02
0.03
0.04SLOW ENVELOPE ERB Center Freq = 1580Hz
0 0.5 1 1.5 2 2.5
x 104
-0.1
-0.05
0
0.05
0.1Temporal Fine Structure ERB Center Freq = 1580Hz
0 0.5 1 1.5 2 2.5
x 104
0
0.01
0.02
0.03
0.04SLOW ENVELOPE ERB Center Freq = 2000Hz
0 0.5 1 1.5 2 2.5
x 104
-0.1
-0.05
0
0.05
0.1Temporal Fine Structure ERB Center Freq = 2000Hz
0 0.5 1 1.5 2 2.5
x 104
-5
0
5
10
15x 10
-3 SLOW ENVELOPE ERB Center Freq = 2500Hz
0 0.5 1 1.5 2 2.5
x 104
-0.04
-0.02
0
0.02
0.04Temporal Fine Structure ERB Center Freq = 2500Hz
0 0.5 1 1.5 2 2.5
x 104
-5
0
5
10
15x 10
-3 SLOW ENVELOPE ERB Center Freq = 3200Hz
0 0.5 1 1.5 2 2.5
x 104
-0.04
-0.02
0
0.02
0.04Temporal Fine Structure ERB Center Freq = 3200Hz
5. SCHEDULE
5.1 Phase I: Proof of concept speaker recognition using gammatone wavelet decomposition sampling (GWDS),
SAE and softmax classifier. Completed for this proposal Spring 2014
5.2 Phase I enhancements: Complete Fall 2014
1. Modify GWDS to operate on the output of each filter, and not the input speech. Change wavelets to have
variable lengths. Modify SAE structure so there is 1 SAE for each filter output. Repeat phase I AWGN
speaker recognition experiments.
2. Expand number of speakers in the DR1 TIMIT region to include 10 speakers in the training set. Extract
feature vectors from all 10.
3. Add entropy calculations to the pre and post coded speech data.
4. Train SAE for all ten speakers, this will generate a region specific set of basis vectors.
5. Repeat 2 and 3 above for another district from the TIMIT corpus.
6. Use supervised learning to classify dialects of region of the country for these two TIMIT regions with
AWGN’s of 3, 10 and 30dB.
5.3 Phase II: Speaker recognition using additive model of the slow envelope and Temporal Fine Structure with a
SAE/Softmax classifier. Complete Spring 2015
1. Implement additive model decomposition sampling. This will have 2 SAE’s for each filter output. Also, the
decomposition takes place on the filter outputs, not the input speech. Find basis vectors for the slow envelope
and temporal fine structure at the output of each filter. Repeat dialect and speaker identification verses AWGN
and compare the results from 5.2.
6. APPENDIX
6.1 Max Entropy for the binary symmetric channel derivation
� = −��������� + º�����º�� º = 1 − �,
So that
� = max R− l�������� + �1 − ��������1 − ���nT Taking the derivative of H with respect to p and setting it equal to 0 yields:
|�|� = −|����������|� − |������1 − ���|� + |��������1 − ����|�
|�|� = − D ���e2 + �����E + 1�1 − ���e2 − D ��1 − ���e2E + �����1 − �� − 1�e2−����� + �1 − ���1 − ���e2 + �����1 − �� = 0
−����� + �����1 − �� = 0
�����1 − �� = �����
2M�¼p��o � = 2M�¼p
1 − � = �
� = .5
The can be easily extended to show the case when n is greater than 2.
6.2 Backpropagation partial derivatives for updating the weights and bias
The auto encoder uses the inputs as the output targets. The output error is given as,
e_ = x�_ − !%, For the auto encoder,
x�_ = x_, So that, e_ = x_ − a_,
The error energy for the j
th output node for the n
th training epoch is:
b%�e� = 12 9%��e� The total instantaneous error energy is,
b�e� = 12#9%��e� To update the output weights find the partial derivative of the total instantaneous error energy with respect to the output
weights. Applying the chain rule, |b�e�|K�%�e� =|b�e�|9%�e�
|9%�e�|<©�e�|<©�e�|¦��e�
|¦��e�|K�%�e� |b�e�|9%�e� = 9%�e� And, |9%�e�|<©�e� = −1
Also, |<©�e�|¦��e� = fs�¦©�e�� Finally,
|f�¦©�|K%��e� = <��e�
The local gradient for the output layer, which is the sensitivities for the activations, are by definition,
½%�e� = |b�e�|¦©�e� =|b�e�|9%�e�
|9%�e�|<©�e�|<©�e�|¦©�e�
½%�e� = −9%�e�f′�¦©�e�� For a sigmoid function, fs�¦©�e�� = <��e��1 − <��e��, For a tanh function,
fs�¦©�e�� = �1 − <��e��� The output weight update is then, ¿K�% = ½%�e�<��e�
The weight update equations for the hidden layer are more complicated due to the lack of an explicit error term.
The local gradient is defined as,
½%�e� = |b�e�|�%|¦©�e� = |b�e�|9%�e�|9%�e�|�%�e�
|�%�e�|¦©�e�
REFERENCES
[1] B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for
natural images. Nature, 381(6583):607–9, 1996 [2] J.J. Hunt P. Dayan G.J. Goodhill. Sparse Coding Can Predict Primary Visual Cortex Receptive Field Changes
Induced by Abnormal Visual Input. PLoS Biol 9(5): e1003005. 2013 [3] E.C. Smith and M.S. Lewicki. Efficient auditory coding. Nature, 439(7079):978–82, 2006. [4] A. Saxe M. Bhand, R. Mudur, B. Suresh, A. Ng Unsupervised learning of primary cortical receptive fields and
receptive field plasticity. NIPS, page 1971-1979, 2011 [5] H. Lee Y. Largman P. Pham A. Ng. Unsupervised feature learning for audio classification using convolutional
deep belief networks. Advances in Neural Information Processing Systems (NIPS) 22, 2009. [6] B.C. Moore. The Role of Temporal Fine Structure Processing in Pitch Perception, Masking, and Speech
Perception for Normal-Hearing, and Hearing- Impaired People. JARO 9:399-406 2008 [7] X. Li. Temporal Fine Structure and Applications to Cochlear Implants. PhD dissertation University of Washington
2013 [8] R.V. Shannon, F. Zeng V. Kamath J. Wygonski M. Ekelid Speech recognition with primarily temporal cues.
Science, vol. 270, pp 303-304 1995 [9] C.E. Stilp K.R. Kluender. Cochlear-scaled Entropy, not consonants, vowels, or time, best predicts speech
intelligibility. PNAS vol. 107, no 27. 2010 [10] M. Schmuckler D. Gilden. Auditory perception of fractal contours. J Exp Psychol Hum Percept Perform 19: 641–
660. 1993 [11] T. Overath R. Cucask S.Kumar K. Kregstien J. Warren M. Grube R. Carlyon D. Griffiths An Information
Theroretic Characterization of Auditory Encoding. PLoS Biol 5(11):e288 2007 [12] A.D. Patel E. Balaban. Temporal patterns of human cortical activity reflect tone sequence structure. Nature, 440.
2000 [13] T.Quatieri. Discrete-Time Speech Signal Processing. Prentice Hall PTR 2002 [14] V.B. Montcastle, An organizing principle for cerebral function: the unit module and distributed system. Pages 7-
50. MIT press, Cambridge, MA. 1978 [15] M. Kendrick. Tasting the Light: Device Let the Blind “See” with their Tongues. Scientific American Aug 13, 2009 [16] J.G. Proakis. Digital Communicaitons 5
th Edition. McGraw Hill 2007
[17] P. Pojman. Ernst Mach. The Standford Encyclopedia of Philosophy. Winter edition 2011 [18] E.C. Banks. The Philosophical Roots of Ernst Mach’s Economy of thought. Synthese Vol. 139 Issue 1 pp 23-53
2004 [19] E.C. Tolman. Cognitive Maps in Rats and Men. The Psychological Review, 55(4) 189-208 1948 [20] K. Craik. The Nature of explanation. Cambridge University Press 1943 [21] S. Haykin. Neural Networks and Learning Machines, third edition. Pearson Education, Inc. Prentice Hall 2009 [22] M. Shu A. Fyshe. Sparse Autoencoders for Word Decoding from Magnetoencephalography. Not yet published [23] Andrew Ng. Unsupervised Feature Learning and Deep Learning Tutorial
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial. Stanford University Web site. 2013
[24] R. Patterson I. Nimmo-Smith. An Efficient Auditory Filterbank Based on the Gammatone Function. Institute of
Acoustics on Auditory Modelling 1987 [25] Malcolm Slaney (1998) "Auditory Toolbox Version 2", Technical Report #1998-010, Interval Research
Corporation, 1998. [26] Les Atlas (2012) Decomposition of speech and sound into Modulations and Carriers.
http://msrvideo.vo.msecnd.net/rmcvideos/173320/dl/173320.pdf , Microsoft Research & University of Washington [27] Rosso O.A., Martin M.T., Figliola A., Keller K., Plastino A. (2006). EEG Analysis using wavelet-based
information tools. Journal of Neuroscience Methods 153 163-182 [28] Klein D.J., Koing P., Kording K., (2003) Sparse Spectrotemporal Coding of Sounds. EURASPI Journal on
Applied Signal Processing 2003:7,659-667