Speech Processing in the Time Domain Using Sparse … · 2018-11-16 · Speech Processing in the Time Domain Using ... SAE. Speech recognition and dialect detection is done to validate

Speech Processing in the Time Domain Using

Sparse Autoencoders for Unsupervised Feature Learning

And An Additive Model of the Peripheral Auditory System

Thomas Bryan

PhD proposal

Electrical and Computer Engineering

Florida Institute of Technology

1. INTRODUCTION

1.1 “Efficient coding” and human sensory perception

Human neural perception is unsurpassed in terms of its ability to process complex neural inputs, particularly in vision

and audio processing. The brain processes information at a relatively slow rate, on the order of 10’s of Hz, but it does so

with a massively parallel structure. Today, convincing evidence exists [1-4, 28] that the brain stores sparse overcomplete

“basis vector” representations of input stimuli. The basis vectors are overcomplete as the number of basis vectors is

larger than the effective dimensionality of the input space. The “essence of information”, is captured by a linear

superposition of these basis vectors.

It has been postulated that these basis vectors are learned through a process of unsupervised feature learning. Five of the

best-in-class machines learning algorithms have been applied to visual, audio and sensory touch, and all five algorithms

are qualitatively similar [4]. They all produce sparse representations of the input data, and are trained using unsupervised

feature learning in a greedy layer by layer fashion. Finally, they all produce results that match the corresponding

receptive field in cortical portions of the brain. In this paper we apply one these, the sparse autoencoder, (SAE) to

automatic speech recognition tasks.

Sparse coding has been applied to the output of cochlear filter banks that produce frequency domain representations that

are similar to spectrograms [5, 28]. These time-frequency, or spectrotemporal representations may be processed using

image processing techniques. The learned basis vectors are components of spectrogram-like representations such as

horizontal lines corresponding to formant frequencies. These basis vectors represent the slow envelope of the neurons in

the peripheral auditory system.

The first part of the study is to directly apply the SAE to time domain speech waveforms. Time domain basis vectors

have been derived that have been demonstrated to match the peripheral neural responses in upper mammals [3]. The

learned basis vectors were shown to closely match the cochlear impulse response in mammals and are described by

gammatone filters [24, 25]. As a result of this finding, gammatone filter impulse responses will be used as the kernel

function for wavelet decomposition of speech using a non-orthogonal basis set. SAE input speech feature vectors will be

selected based on a MSE fit to the gammatone wavelets.

The first part of the study makes no assumptions about the model of the peripheral audio system. It simply selects

portions of speech that are most similar to gammatone impulse responses, normalizes them and processes them using a

SAE. Speech recognition and dialect detection is done to validate the approach. This is done by removing the output

stage of the SAE and adding a softmax classifier.

The second part of the study uses an additive model of the peripheral audio system that incorporates the slow envelope

and the temporal fine structure. In this approach an SAE is used at the output of each filter, and there is one SAE for the

slow envelope, and one for the temporal fine structure. The temporal fine structure is used directly, so there is not a

spectrogram-like representation for it. The composite model has spectrogram-like features coming from the slow

envelop, as well as a time domain features due to the temporal fine structure. The additive model should support

functions such as pitch detection, music detection, and should work better than the phase one experiments in the

presence of noise and reverberation.

1.2 The slow envelope and temporal fine structure of the peripheral auditory system

Models of the audio pathway in the human ear are now fairly well understood [6-7]. The Cochlear frequency response is

composed of a bank of logarithmically spaced parallel filters. At the output of each filter are stereocilia (hairs) that have

1 to 8 nerve cells attached to them. The hairs half-wave rectify the audio signal and is subsequently low pass filtered by

auditory nerves “phase locking”, to produces the slow envelope. The rectified output passes through a band-pass filter

that passes the fundamental and the second harmonic. This is the fine temporal structure.

It has been known that speech intelligence comes from the slow envelope [8]. This is the stimuli that cochlear implant

patients perceive. Additionally, it is known that they do not have the ability to perceive pitch, to hear music, or to

perceive speech in noisy environments. This is due to the lack of the temporal fine structure.

If the Hilbert phase of the fine temporal structure is taken, this becomes a multiplicative model. This is a modulator

implementation whereby the temporal fine structure is multiplied by the slow envelope to reproduce the audio. If the

temporal fine structure is added to the slow envelop this becomes the additive model. This is a very simplified model and

is not intended to accurately model the peripheral audio neural pathways, as the filter bandwidths for the lowpass and

bandpass filters are not clearly understood. Instead, this model is to serve as the basis for a signal processing algorithm

that is inspired by the research in cochlear implants. The proof is in the pudding, as they say, if the model is valid it

should add pitch detection capability, and be robust to reverberation and interference as compared to models that only

represent the slow envelope.

For the past two decades, the multiplicative model of the temporal fine structure has been adopted by research scientists.

In this model, the fine temporal structure is multiplied by slow envelope to produce signals to the audio cortex. This

model uses the Hilbert phase of preprocessed audio signals, which has 180 degree phase flips at the phase

discontinuities, and is not consistent with the functions of auditory nerve cells. However, the multiplicative model, based

on the Hilbert phase is the most widely accepted model in use today. A new additive model has recently been proposed

that uses the fine temporal structure directly, that is added to the slow envelope [26]. This new additive model in

consistent with the neurobiology and is therefore, the model that will be adopted for this investigation.

1.3 Entropy and human auditory perception

For decades researchers have been studying in speech processing based on linguistic features, such as vowels, glides,

stops, fricatives and plosives. Many studies have sought to find out what parts of speech contain the most information.

Many experiments have been conducted over the years to determine if vowels or consonants contain most of the

information required for perception. It is widely known that written English is understandable without the vowels. This

is due the high redundancy in the language. This is not true for spoken English. Many studies a sought to remove,

vowels, consonants, transitions between vowels and consonants, all with no meaningful, distinguishable preference of

any particular linguistic feature.

Recently, studies have been conducted that remove high entropy portions of the speech and these studies correlate

perception to features. Entropy for one of the studies [9] has been called Cochlear Scaled Entropy, and has been defined

as the difference, or dot products between adjacent frames of data at the output of Cochlear frequency response filters.

Other researchers [10-11] have define audio entropy based on random fractals know as fractional Brownian noises.

These noises are characterized by the frequency domain exponent. For white noise the exponent is 0, for 1/f noise, the

exponent is 1 and for a random walk, the exponent is 2. The entropy increases with decreasing exponent so that white

noise has the highest entropy, or randomness. The random walk has the lowest entropy as the next value of pitch for

example, will be highly correlated with the present one. The noise is used to modulate pitch, loudness and duration.

There have been psychological experiments done in which subjects can distinguish entropy between pitch, and loudness,

but not duration. Moreover, similar studies have shown a direct correlation to blood oxygenation levels in the brain [12]

to entropy for fractional Brownian motion based on MRI measurements. From these experiments, the evidence shows

that the brain responses, and perception, are strong functions of the audio entropy of the stimulating signal.

The Cochlear Scaled entropy and the fractal Brownian entropy model define entropy in terms of randomness or change.

While these definitions are consistent with entropy from a thermodynamic standpoint, especially in terms of Brownian

motion, they do not translate well to Shannon’s idea of entropy in terms of coding. Shannon defined channel capacity in

terms of source entropy, and “equivocation” of the channel, and showed that codes could be found that perform at the

limits of the mutual information, the source entropy minus the equivocation. A central theme for Shannon’s coding is

based on having a fixed set of source symbols of length N, and a set of code words of length M, where the code length of

M is greater than the symbol length N. Additionally, the dictionary of code words is of fixed length. Shannon defined the

capacity of the communication channel by properly decoding received code words back into the original transmitted

symbols.

Smith and Lewicki [3] developed an efficient auditory coding algorithm that decomposes speech into a sparse set of

kernel functions, or “codes”. The use of these codes provides a method for defining the source entropy of speech that is

more consistent with Shannon’s definition of entropy as opposed to simply viewing source entropy as high rates of

“change”. Furthermore, the kernel function decomposition method captures the underlying structure of speech and

should maximize the “SNR in the detector”. The algorithm by Smith and Lewicki serves as the basis for speech feature

extraction of “codes” in this study. The method they present is based on an autoencoder with linear activations. The

linear activations will be replaced by sigmoidal activations in hopes of learning higher order statistics from the speech.

1.4 Matched filter speech coding

Modern communication systems that are able to extract signals in high noise, high multi-path fading environments and

perform within tenths of a dB for Shannon’s limit [17]. At the core of these communications systems is the matched

filter. The matched filter optimally maximizes the signal to noise ratio in the receiver. It does this by matching the

receive waveform to the transmit waveform, after channel equalization. This same principle can be applied to speech

coding. However, due to the inherent complexity of speech the “matched filters” will be allowed to overlap in time.

Encoding of speech has three major challenges [13]; 1) time dilation due to differences in fast and slow talkers, 2) high

dynamic range of amplitude variations, 3) time frequency resolution problems.

It is not expected that an independent set of dictionary codes may be used for speech like matched filters in

communication systems, due the richness and variability or speech. However, it should be possible to find time domain

overlapped codes that may be summed together that reproduce speech waveforms. The intuition for the existence of

these codes comes from the model of the Cochlear filters in the peripheral auditory system [6-7]. From basic

communication theory, the output of a discrete filter is the sum off individually weighted impulses responses.

Furthermore, there are n numbers of parallel filters in the Cochlear. The outputs of each individual filter may

approximated by keeping a limited number of the weighted impulses. These outputs may then be summed together to

reproduce a compressed code based representation of the speech. The challenge in finding good codes is to find which

weighted impulse responses that capture the best features for the given classification task. Features, or code words, for

speaker recognition will likely not be the best for other tasks such as, phoneme detection, and sentiment or opinion

mining. For example, it will be shown that a SAE is able to reconstruct intelligible speech by random sampling of the

speech waveform. However, the classification accuracy for speaker identification is no better than random guessing. The

SAE finds an optimum set of initial conditions, in a greedy layer by layer approach. The output layer of the SAE is

removed, and a softmax classifier is added to hidden layer. The entire network is then trained with backpropagation.

New input weights emerge, based on the output of the classifier, and these new input weights become the feature vectors

or codes. Finding good codes, then, is a three step process:

1. Raw feature extraction.

2. Sparse autoencoding to obtain good initial conditions for a classification network.

3. Tuning these features via backpropagation.

The idea of representing speech as code words is not new. The most widely used method for speech transcription uses a

decomposition of speech into complex exponentials, or sinusoids called spectrograms.

Spectrograms are one of the most widely used methods today for speech processing. Spectrograms are Short Time

Fourier Transforms (STFT’s) [13]. STFT’s are based on overlapped windowed data samples, followed by Fast Fourier

Transform. The STFT performs a decomposition of the time domain into windowed sine and cosine vectors of fixed

length. These vectors may be thought of a dictionary of code words of fixed length. One simple method of calculating

the source entropy of speech would be to threshold the magnitude if each frequency bin. If it exceeds the threshold,

histogram that “code word”, normalize the distribution so that all bins sum to 1, and calculate the entropy.

The problem with STFT’s is that there is an inherent time-frequency resolution due to “windowing”. Long windows lead

to high frequency resolution and poor time resolution. Conversely, short windows lead to good time resolution and poor

frequency resolution. Some data is always lost in the process, as each frequency bin is the dot product between the

speech and the sine/cosine vector. For example, if the speech has a phase inversion halfway through the sample vector,

the dot product might be close to zero while the actual energy is quite high. This can be seen when doing cepstal analysis

and liftering the pitch period of speech [13]. If the window is pitch synchronous, the glottal pulse never disappears.

However, for none pitch synchronous windowing, the liftered pitch pulse fades in and over frames of data.

The time-frequency resolution may be address by taking the MEL frequencies from the FFT’s [13]. Here, the FFT bins

are arranged into logarithmically spaced center frequencies with triangular overlapped frequency bins. This results in

wider bandwidths at high frequencies, and lower bandwidths at low frequencies. This is similar to the frequency

response of wavelets. This addresses the time-frequency problem, but the underlying problem with the loss of

information from the STFT remains.

Returning to the additive model [6-7], it is a natural choice to base the “matched filter” representation of speech on the

impulse responses of the cochlear bandpass filter bank. These impulses responses, or gamma-tones, form the input

stimuli of human auditory perception. Taken collectively, these gamma-tone impulse responses comprise a cochlear-

gram when the magnitude is taken.

Cochlear-grams are frequency domain representations of speech that are based on a model of a set of parallel band-pass

filters in the human ear. There are logarithmically spaced in frequency similar to MFCC’s [13]. Moreover, they are

overlapped in the frequency domain as well, which is also similar to MFCC’s.

It is believed that the human peripheral auditory system efficiently encodes auditory signals sent to the auditory cortex of

the brain [3]. Smith and Lewicki [3] decompose speech based on a set of learned kernel functions. They also show that

gamma-tones closely match the “spiking” models in the auditory nerves of cats.

Using this method, speech is convolved with a revcor filter impulse responses or gamma-tones. Each gamma-tone

impulse response is assigned a temporal coordinate, and an amplitude scaling value. The audio signal may be

reconstructed by summing the individual gamma-chirps to generate the composite audio wave form. There are a finite

number of parallel cochlear filters, which leads to a fixed dictionary of source encoding code words. This will support

the calculation of source entropy. Moreover, this method is attractive because it addresses the time dilation problem by

repetitive use of similar gamma-tones over long formant frequencies. Additionally, the amplitude problem is addressed

because the gamma-tone decomposition is based on a mean square error fit to the data, which is independent of data

amplitude. The time-frequency problem is addressed in a manner similar to wavelets or MFCC’s, in that the higher

frequency components have wider bandwidths and faster impulse responses [13]. Finally, the gamma-tones are known

model the phase locking output of auditory nerve fibers when driven the AWGN.

2. LITERATURE REVIEW

2.1 Historical background on sparse representations

Ernst Mach [17] was the part of the first generation of physiological psychologists, and is considered a forerunner of

modern neural networks by modern psychologists. Mach discovered that “the eye has a mind of its own” and that; we

perceive not direct stimuli, but relations of stimuli. Mach believed that we do not perceive reality; rather, we perceive

the after effects of the nervous systems adaptation to new stimuli. Mach believed that animals survive in their

environment based on their ability to adapt to a wider range of temporal and spatial surrounding. Over time organisms

develop memory that better enables them to adapt. As they evolve, communications is possible and there are able to

learn from others. From this learning science is created. Mach developed the theory of the “economy of thought” [18]

that the simplest most parsimonious theories economize memory by using abstract concepts and laws instead of

attending to the details of individual events.

Tolman, a “field theorist”, and put forth the idea of cognitive maps [19]. Tolman performed psychological experiments

on rats in a maze, and theorized that the rats learn a cognitive map of the maze. This is substantially, different from the

stimuli/response psychological theories at the time were based on a telephone switchboard school of thought. In this

school of thought, good “connections” get reinforced, and bad “connections” get disconnected.

Craik introduced the concept of mental models [20] and was one of the first practitioners of cognitive science. Craik

believed that the mind constructs small scale models of reality that are used to anticipate events. He believed the process

of perception and reasoning takes place in three steps:

1. The translation of the external process into word, numbers, or other symbols, which can function as a model of

the world.

2. A process of reasoning from these symbols leading to others.

3. The retranslations back from these symbols into external processes, or at least a recognition that they

correspond to external processes.

Stimulation of sense organs result in neural patterns, reasoning stimulates other neural patterns that are retranslated into

excitation of the motor organs.

The study of cognitive fields was formalized after Shannon introduced information theory. Philosophers, physiologist

and psychologist began to apply the Shannon’s concepts of redundancy and efficiency to the primary processing in

sensory organs. “Efficient coding” soon became a central theme in how sensory inputs are mapped to neural codes that

captures the statistical structure of sensory inputs. The application of efficient coding resulted in the sparse

representations of features that are learned through unsupervised feature learning. Moreover, these features closely

match the physiology of the sensory organs.

2.2 Claude Shannon 1948 A Mathematical Theory of Communication, Bell System Telephone Journal

In this epic work, Shannon develops information theory for modern communication theory based on the invention of

Pulse Code Modulation and Pulse Position Modulation for telephony. He defines information, based on the work of

Hartley, as at the log of the probability of an event. If the �� is the basis, the information content is in bits. Shannon

defines a block diagram for the communications channel in term of an information source, a transmitter, a channel with

noise added, a receiver, and a destination that receives a message. In this model the information source can be anything

from television signals, to telephony or radio. The transmitter maps the information source to the channel. Examples of a

transmitter are FM modulators, or a voice encoder system. The channel is any medium the signal passes through

between the transmitter and the receiver. Note that the channel in Shannon’s definition is a probabilistic model, and not

the physical channel typically thought of by communications engineers. The receiver performs the inverse function as

the transmitter, and reconstructs the message and passes it along to the destination, the person the message is intended

for.

Shannon classifies communications systems as discrete, continuous, or mixed, and goes on to focus only the discrete

case for the first half of the paper. His version of discrete is not a sampled data system. Instead Shannon defines discrete

as both the message and the symbols are a sequence of discrete symbols. Telegraphy is a discrete system where the

message is a sequence of letters, and the symbols are dot, dashes and spaces. With the discrete communication system in

mind Shannon derives the capacity for the communication channel. He first introduces the “Discrete Noiseless Channel”.

For the discrete noiseless channel, a Teletype has 32 symbols of equal duration. If all the symbols the equally likely, the

maximum capacity of the channel is log232 = 5bits/symbol*n symbols second = 5n bits/sec. Normalizing the symbol rate

to 1 symbol/sec, yields 5bits/Hz.

In the case of the English language the number of symbols is 26, with a space required between words. If all symbols the

average information per symbols would be log227 = 4.7549bits/symbol.

Not all symbols are equally likely, and not all symbol sequences may be allowed. For this discussion, Shannon shows

state diagrams for communications sources. Shannon defers to the appendix but states that if a state diagram is known

with it transition probabilities, the channel capacity can be calculated.

The next topic covered is the discrete information source. Here statistical knowledge of the source is determined that will

aid in reducing the required capacity of the channel. For example, the capacity of the channel for the English language

with the proper frequency of occurrence for each letter reduces the required channel capacity from 4.7549bits/symbol to

4.1811 bits/symbol. Further reductions in allowable sequences occur by taking 2 letter sequences or 3 letter sequences.

Even greater reduction is achieved using word repletion rates, and valid sequences of words. By using allowable word

sequences, English text contains about 1.5 bits/symbol.

Shannon next talks about using Markov models for discrete sources and describes these models as ergodic processes. He

loosely states that each sequence generated by these processes will have that same statistics.

In section 6 of the paper, entitled Choice Uncertainty and Entropy, Shannon introduces the definition of entropy as a

search for a method to determine the best choice from one of many possible outcomes. He labels the “choice”�, and

lays out three properties for H:

1. H should be continuous in �� . 2. If all �� are equal, and there are n possibilities, then H should be a monotonically increasing function of n, as

there are more choices with increasing n.

3. If choice is broken down into successive choices, H should be the weighted sum of the individual choices.

The only function that meets these three choices is

� = −� ∑ �� Shannon sets K=1, and calls H entropy, the average number of bits of information per symbol.

When an outcome is known, p = 1 and H = 0:

� = −�0��0� + 1��1�� = −�−∞+ 0��/�� Here, −∞ is generally treated as 0.

When all outcomes are equally likely independent events the entropy is maximized. For the binary case:

� = −�. 5��. 5� + .5��. 5�� = 1��/�� The entropy of a joint event is:

��!, �� = −#��, $��,%

��, $� An inequality for joint entropy is:

��!� = −#��, $��#��, $�%�,%

�� = −#��, $��#��, $�

��,%

��!, �� ≤ ��!� + �� The equality holds when both events are independent, when

��, $� = ��$� Finally, conditional entropy is defined based on the following idea. For the joint event x and y, the probability that x

assumes a value i, there is a probability that y can assume a value j given by

��$� = ��, $�∑ ��, $�%

Conditional entropy is defined as

�'(y) = − ∑ ��, $��,% �$� This is the amount of uncertainty about y that is removed by observing x.

Substituting the value of ��$�,

�'�� = −#��, $�� ( ��, $�∑ ��, $�% )�,%

, �'�� = −#��, $�

�,%��, $� +#��, $��#��, $�

%�,%

�'�� = ��!, �� − ��!� ��!, �� = �'�� + ��!�

��!� + �� ≥ ��!, �� = ��!� + �'�� Hence,

�� ≥ �'�� From this Shannon concludes that the uncertainty of y is never increased by the knowledge of x, it can only be

decreased, or equal when x and y are independent.

In the next section Shannon talks about the entropy of a source. The source entropy is maximized when all symbols are

equally likely. The ratio between the actual entropy and the maximum entropy is called the relative entropy. This is the

maximum compression possible. The redundancy is 1 minus the relative entropy. The redundancy of English over eight

letters is about 50%, so half the letters are chosen freely, and half are determined by the structure of the language.

The Fundamental Theorem for a Noiseless Channel is introduced in the next section. Here the channel capacity C is

defined in bits per second. The Entropy is in units of bits per symbol. The theorem states communications is possible for

rates up to +, − e symbols per second, where e is an arbitrarily small number. For a simple example, if H = 5 bits/symbol

and the capacity is 5 bits per second then communications is possible up to 1 symbol per second.

Part II of the paper is called The Discrete Channel With Noise. The first section is called Equivocation and Channel

Capacity. Here Shannon discusses a noisy communication channel that is making 1% errors. The source is sending 1’s

and 0 with equal probability at a rate of 1000kbps. Since the source produces symbols with equal probability the source

rate is 1000 bits per second. The expected rate out of the receiver is not 990 bits/second. Likewise, if the channel is

making 100% errors, the received rate is not 500 bits/second. The receiver rate is

. = ��!� − �/�!� Here Shannon refers to H1�x� as the equivocation of the channel. If a 0 is received, the a posterior probability that a zero

was sent is .99, and is .01 that a 1 was sent. The situation is similar if a 1 is received.

The equivocation for 1% errors is

�/�!� = −3. 99 ∗ ��2�. 99� + .01 ∗ ��2�. 01�7 = .081��/�9:

. = �1 − .081�1000 = 919��/�9:

For 50% errors the equivocation is

�/�!� = −3. 5 ∗ ��2�. 5� + .5 ∗ ��2�. 5�7 = 1��/�9:

. = �1 − .1�1000 = 0��/�9:

It is tempting to assume that random guessing would produce 50% errors. However, from the mathematical definition of

conditional entropy, there is 1 bit of equivocation, and no information is received.

Shannon’s fundamental theorem of a discrete channel with noise can be summed up by

; = �<!3��!� − �/�!�7 The max argument corresponds to the best source encoder for a particular source. Since capacity is a positive number in

terms of bits/symbols, capacity is the entropy of the source minus the equivocation of the channel. Shannon goes on the

say that as long as the equivocation is less than the entropy of the source, efficient codes exist, that make communication

possible.

While Shannon invented the field of information theory for communication systems, it was quickly adapted by

researchers and scientist studying principals of human perception. Based on the work of Shannon, it was quickly

hypothesized that “efficient coding” in the peripheral auditory and optical systems took place in order to provide

maximum information content to the auditory and vision cortexes of the brain.

2.3 H.B. Barlow 1961 Possible Principles underlying the Transformations of Sensory messages Chapter 13 In:

Sensory Communication, W. Rosenblith (Ed.), M.I.T. Press, pp. 217-234.

Barlow studied how frog neurons respond to visual stimuli. He was interested in the hunting or snapping response, as

well as the escape response from artificially generated visual inputs. Barlow was one of the first to apply information

theory to sensory information in the nervous systems of animals and reptiles. His fundamental theory is that redundancy

is removed from visual sensory input before it is sent on the neural pathways to the brain. His hypothesis has three

assertions:

1. Incoming messages have certain “passwords” that carry key significance to the animal.

2. There are filters or recoders, whose pass characteristics may be controlled with requirements from other parts

of the nervous system.

3. “They recode sensory messages, extracting high relative entropy from highly redundant sensory input.”

Barlow’s “password” idea came from studies of frogs in which a snapping, or hunting response was elicited based on

visual stimulation. The idea of feedback came from studies of cat retinas and how they react to visual stimulation.

Barlow’s third assertion is later explained as striping away redundancy to produce “economy of thought” in order to

bring simplicity and order to complex sensory input.

Barlow goes on to model the recoders with the following simplifying assumptions:

1. “Sensory pathways are treated as noiseless systems using discrete signals.”

2. “The discrete signals are single impulses in particular nerve fibers in particular time intervals. For any one fiber,

and time interval, the impulse is present of absent, so the code is binary.”

3. “The constraints on the capacity of the nerve pathway are the number of fibers F, the number of discrete time

intervals R, and average number of impulses per second per fiber I. The average number of impulses per fiber is

assumed to be a variable constraint.”

Barlow, then talks about intrinsic noise that is present on the neural pathways, and then proceeds to ignore it as this

might obscure his fundamental point.

Barlow then talks about the capacity of a neural pathway in terms of 10 nerve fibers carrying binary codes over an

interval of 1/10 of a second. He assumes all messages are independent and mutually exclusive so that the average

information or entropy is

�=> = −#�??

��?

For a given symbol rate T, the information flow per unit time is

�=>@

The information rate for a fiber bundle is given as

; = −A. B C. �� C. + D1 − C.E log�1 − C.�I The relative entropy is

�@;

And the redundancy is

1 − �@;

Barlow’s main point is this discussion is that is the neural pathway is of limited capacity C. Therefore then efficient

codes must exist to remove the redundancy from the input “messages”.

Barlow then goes on to discuss simple redundancy reducing codes. The purpose of these redundancy codes is for an

“economy of impulses” or for “emphasizing the unusual”. From this discussion, Barlow predicted the following for the

neural impulse code model.

Usual events should be represented by a decrease in number of impulses.

Codes should exist in accordance with the probabilities of the events encountered.

Codes should respond to complex features of the inputs, not to properties that are simple in physical or anatomical terms.

Barlow further assumes that redundancy reduction take place in several stages. Redundancy is a process of subdivision.

Higher levels of redundancy should remove more complex forms of redundancy. To emphasize the need for redundancy

reduction, Barlow points out the number of cells is the cortical region of the brain far outnumber of the number of cells

in the nerve channel. From this his concludes that an enormous reduction in the amount of impulses “seems to be

possible” without any loss of information.

Barlow summarizes by saying the paper is not based on physiological models, but says there are hypothesis only. From

this he justifies the “password” hypothesis based on laboratory experiments in which animals respond to stimuli. The

control hypothesis also comes from laboratory observations of animal’s reaction to stimuli. Barlow devotes most of the

paper to redundancy reduction on the neural pathways, and concludes with the following statement:

“To strip the redundancy from the previous pages, what I have said is this: it is foolish to investigate sensory

mechanisms blindly- one must also look at the ways in which animals make use of their senses. It would be surprising if

the use to which they are put was not reflected in the design of the sense organs and their nervous pathways-”

Barlow doesn’t formulate a model for the reduction in redundancy, other than to show a simple example of 10 nerve

fibers. These fibers use a binary on/off code, with a symbol time of 1/10 of a second. He does use Shannon’s theory to

show for the simplifying assumptions, that the information capacity of the nerve fibers is limited in information capacity.

2.4 Atick, Joseph J., 1992 Could information theory provide an ecological theory of sensory processing? Network:

Computation in Neural Systems 22.1-4 (2011): 4-44.

Atick studied the ganglion cells of the retina in detail. He develops one of the first neural networks that model the

behavior of the ganglion cells. Atick discusses the statistical compositions of natural images and demonstrates that the

functions of the simple cells in the primary visual cortex capture these same statistics. Atick expounds on the theory of

minimum entropy, or factorial codes that minimize difference between inter symbol dependencies. Consider a message w composed of l symbols, the probability of a message,

��K� = ��, … . . , �M�. If the messages are independent, then the probability of a message is simply,

N�K� = N��N��… N��M� The entropy �� can be written as the sum over symbols (or pixel), entropies, �� as:

��O� = −# # N��P

?��

M

��log�N�� ≡ #��

M

��

Atick points out that in general the symbols are not statistically independent and so the total entropy does not equal the

sum of the individual pixel entropies. In general the message entropy is:

��O� ≤ #��M

��

If all symbols in a message are equally likely,RN�� = �P , ∀��T the maximum entropy is achieved which is denoted as

the capacity C.

max��O�� = maxW#��M

��X = �log�Y ≡ ;

The redundancy is given as:

. = 1 − ��O�;

Atick states there are two contributions to redundancy.

. = 1; W; −#��

��X + 1; W#��

��−��O�X

The first term is the redundancy due to unequal symbol probabilities. This term goes to zero if all the symbols are

equally likely. The second term is used to describe redundancy due to joint entropies of statistically correlated symbols.

If all the symbols are independent the second term vanishes. Codes that reduce or eliminate redundancy are called

minimum redundancy codes. This is the approach taken for communications systems where the goal is to operate as

close as possible to the channel capacity. Codes that minimize the difference,�� − ��O�, are called minimum

entropy codes. The goal of minimum entropy codes is to eliminate the statistical dependencies between symbols at the

expense of decreasing the entropy of the individual symbol probabilities. The goal is to express the joint entropy of

symbols as a product of independent symbol probabilities, resulting in “factorial” codes. This can only be done by

tolerating some redundancy in symbol probabilities. For neural coding, Atick, notes that some redundancy adds

robustness in the presence of noise, and can therefore be tolerated.

`

2.5 B. Olshausen and D. Field 1996 Emergence of simple-cell receptive field properties by learning a sparse code

for natural images Nature vol. 381pp. 607-609

Olshausen and Field build upon the work of Atick and develop minimum entropy codes that capture the underlying

statistics of natural images. Moreover, these codes behave in the same manner as the receptive fields of the primary

visual cortex.

Olshausen and Field have extensive knowledge of the receptive fields in simple cells in the primary visual cortex of

mammals. They describe these fields as localized, oriented, and bandpass, comparable to the basis functions of wavelets.

They develop a sparse basis vector encoding scheme that captures the properties of localized, oriented and bandpass.

This paper introduces the SAE as a method of extracting a sparse basis vector representation of natural images. The

autoencoder performs in a manner similar to ICA in that it learns a sparse basis vector representation with high kurtosis.

Additionally, the features extracted from the SAE look very similar the “edges” learned by ICA.

An image may be made constructed from a weighted linear summation of basis vectors.

I�x, y� = #a\\

φ\�x, y� The goal of image encoding is to find a statistically independent set of basis vectors φ\ that form a complete code that

span a set of natural images. The notion of statistically independence here is the same as “minimum entropy” or factorial

codes of Barlow (1989) and Atick(1992).

A discussion of Principal Component Analysis, (PCA) concludes that this method is not consistent with the

neurobiology. PCA produces an orthonormal representation of the images that capture the direction of maximum

variance. The pairwise data points are decorrelated, ⟨a\a_⟩ = ⟨a\⟩⟨a_⟩. However, the data that results from PCA do not

match any know receptive fields, as the data is not localized. Furthermore, natural images containing curved edges do

not render well with orthogonal components. In short PCA does not capture the statistical structure of the data.

Olshausen goes on to say the statistical structure of images may be seen when the joint entropy, H�a�,a�, … aa� < ∑ H�a\�\ , is less than the sum of individual entropies. The strategy of reducing the individual entropies is suggested as a

means to gain statistical independence. This is attributed to Barlow’s (1989) minimum entropy code. It is then

conjectured that natural images have “sparse structure” and that any image may be made with a small number of basis

vectors selected from a large dictionary.

The cost function for the sparse code is introduced as:

b = −3�c9�9cd9Cef�c�<��e7 − g3��<c�e9��f<�7 Where,

�c9�9cd9Cef�c�<��e = −#hC�!, �� −#<��

i��!, ��j',/

�

and,

3��<ce9��f<�7 = −#k l<�m n�

Where m is a scaling factor. Several different penalty functions were tried through experimentation, −9o'p ,�1 + ��!��, and|!|, and all produced similar results.

The cost function E is minimized by gradient descent averaged over a number of images. The derivatives of <� are

found from:

<r� = �� −#i��!, ��C�!, ��',/

−#i��!, ��i%�!, ��<% − gm%ks l<�m n

The learning rule for updating i is then:

∆i��!, �� = u ⟨<� hC�!? , � � −#<�i��!?, � ��

j⟩ Here u, is a learning rate parameter, and ∑ <�i��!?, � �� is the reconstructed image.

The magic of this algorithm is in the “sparsification” penalty term, which punishes individual feature vectors for

overuse, while at the same time finding the best set of φ\’s that minimize the reconstruction error. This penalty strategy

does not directly enforce a policy of equally likely feature vectors. However, indirectly it does, by forcing the “usage

percentage” for any one feature vector below a threshold. If the threshold is low enough, and very uncommon feature

vectors are not used, the overall population of feature vectors is “herded” into a small usage percentage window.

The algorithm allows for an over complete representation, as the number of feature vectors can be greater than the size

of the input. The vectors are non-orthogonal, which allow for a richer representation of complex images from simple

feature vectors. In short, the feature vectors capture the underlying statistical structure of the images.

This paper concludes that the basis vectors are very similar to the receptive fields. Additionally, the entropy after

training is 4.0 bits as compared to 4.6 bits before training, and the Kurtosis has increased from 7 to 20, for an image

reconstruction error that is 10% of the variance of the original image.

It is pointed out that other cost functions and optimizations, produce similar results. This will be seen later when the

sparsity penalty is based on the Kullback-Liebler divergence. The Kullback-Liebler penalty term directly forces equally

likely feature vector usage. Additionally, a different cost function will be used that minimizes the mean square error

between the input and the reconstructed output. The training will be based on back-propagation, so the gradients will

also be different. However, this approach will be shown to produce indistinguishable results from the feature vectors

learned here. This paper does not address audio processing. However, it is a landmark paper that ties the theory of

minimum entropy codes to the neurobiology of optical cortical fields. Moreover, it presents the first unsupervised

learning algorithm that finds sparse basis vector representations of natural images consistent with the know properties of

simple cells in the primary visual cortex of mammals.

2.6 E.Smith and M.Lewicki 2006 Efficient Auditory Coding Nature vol. 439|23

Smith and Lewicki introduce an audio encoding algorithm that closely matches the physiology of auditory nerve fibers

in mammals. RevCor, or reverse correlation filters are the linear equivalent of the impulse response of auditory nerve

fibers. The representation of speech is based on time and amplitude weighting from a set of kernel functions. The audio

stream may be represented by:

!�� = ##��?i?�� − ��?��?

+ v�� Where t\x and s\x are the temporal positions and amplitude scaling of the ith component of the mth kernel. The error

term is due to residual, or the difference between the original audio waveform and the coded signal. The “matching

pursuit” algorithm performs two important functions:

It is used to find the amplitude and time positions, t\x and s\x

The kernel function is updated.

The matching pursuit algorithm iteratively decomposes the audio signal by projecting the input data onto a set of kernel

functions. The kernel with the largest inner product is subtracted from the input vector, and its time and amplitude are

recorded. The algorithm is halted when the amplitudes of the kernels fall below a threshold. The paper says it halted

when s\x fell below .1, using TIMIT data.

The equation for !�� above can be rewritten as

��!|z� = {��!|z, �� |�

≈ ��!|z, �̂��̂� Here, s� is an approximation to posterior maximum, comes from a set of coefficients generated by matching pursuit. It is

assumed that the noise in p�x|Φ, s��, is Gaussian, and that p�s�� is sparse. The kernel function is optimized by doing

gradient ascent on the approximate log data probability,

��i? log��!|z�� = ��i? 3��!|z, �̂��̂� + log��̂��7 = −12m��

��i? �! −##��?i?�� − ��?��?

��

= 1m��#�̂�?i?3! − !�7��

Here 3x − x�7�� is the residual from kernel φx , at position t\x. The update equation is just the weighted average of the

residual error.

The quantized amplitudes were treated a sample from a random variable. The entropy was estimated as histograms of the

quantized values. This method produces the number of bits/symbol. Rate is then the measure of the number of code

words per second times the number of bits/symbol.

The algorithm was run on two categories of natural sounds as well as speech. The natural sounds were divided transient

sounds such as cracking twigs and crunching leaves. The other category was made of ambient sounds such as rain and

rushing water and wind, or rustling sounds.

The coding efficiency is defined as the number of codes required to reach an arbitrary level of fidelity. Coding efficiency

of learned kernels and gammatone wavelets are shown to be much more efficient than FFT’s or Daubenchies wavelets,

for SNR’s below 30dB. At 15 dB SNR, the learned kernels have a signal rate of 8kbps, while the wavelets and FFT’s

are operating at 30kpbs. Therefore, the coding efficiency is three times greater than the wavelets or FFT’s.

The matching pursuit algorithm is complicated in that it allows the size of kernel functions to grow and shrink in size

over time. This allows a set of kernel functions to emerge that behave like gammatone wavelets, whereby higher

frequencies have shorter lengths, and lower frequency kernels have longer responses. The paper claims that kernel

functions that are initialized with Gaussian noise look remarkable similar to cochlear impulse responses in the peripheral

audio pathways of mammals.

2.7 Rosso O.A., Martin M.T., Figliola A., Keller K., Plastino A. (2006). EEG Analysis using wavelet-based

information tools. Journal of Neuroscience Methods 153 163-182

This paper uses wavelet-based information tools in order to predict epileptic seizures from EEG data. The general

concept of order and randomness of signals is discussed and Shannon’s entropy is given as a measure of the statistical

complexity of a signal. Spectral Entropy, which is based on the entropy of the short-time Fourier transform (STFT) is

shown to be inadequate to capture the time evolution of EEG signals, due to the “windowing” problem. Another problem

with FFT based methods is that EEG signals are not stationary, and FFT’s must be averaged; or have the autocorrelation

taken in advance, in order to represent the power spectral density of the signal faithfully. The solution is to use the

orthogonal discrete wavelet transform, or ODWT.

The ODWT follows the time evolution of frequency patterns with optimal time-frequency resolution and makes no

assumptions about the stationarity of the signal. The ensuing entropy from the wavelet transform is called “Shannon’s

wavelet entropy”. An in depth discussion of different measures of statistical complexity based on entropy models from

Tsallis, (TWS) escort-Tsallis (GWS), and Renyi are presented (RWS).

The wavelet model is reviewed and is based on a quickly vanishing oscillating mother kernel of the following form:

�=,�� = |<|o�/�� D� − �< E

Where <, � are scale and translation parameters that allow scaling and stretching and time translation of the mother

kernel, �. The continuous wavelet transform (CWT) is given as:

��k�<, �� = |<|o�� { k��

o��∗ D� − �< E |� = ⟨k, �=,�⟩

The CWT can be inverted to reproduce the input signal. The CWT produces an infinite number of coefficients, is

difficult to compute and produces a highly redundant representation of the signal. To overcome the limitations, the

discrete wavelet transform (DWT) is introduced and provides a non-redundant, highly efficient wavelet representation of

the original signal. The DWT uses an orthonormal representation based on a mother wavelet with<% = 2o%, �%,� = 2o%�,

with $, � ∈ �, a set of integers. The resolution levels are represented by $, and the kernel number by �. The family �%,�� = 2%/��2%� − �� produces a wavelet series ⟨k, �=,�⟩ = ;%� with as many coefficients as there are samples in

the original sample without the loss of any information and constitutes an orthonormal basis of ��. For a sampled data

system with equally space time samples and M total samples.

k�� = # #;%�

o�

%�oP��%,�� = # c%

o�

%�oP��

Where Y% = ��O�. The wavelet coefficients are ;%� = ⟨k, �=,�⟩ and the energy at each resolution for $ = −1,−2… ,Y%is

b% = �c%�� = #�;%��

The total energy is

b�� = ‖k‖� = ##�;%��

= #b%%��%��

The relative wavelet energy (RWE) is �% = b%/b��. The sum ∑ �%% = 1, and is represented by the probability

distribution N�� = ��%�. The Shannon wavelet entropy is given as:

k �¡�¢N��£ = −#�%ln��%�%��

For a completely random waveform the frequency distribution will be uniform, and the RWE will be equal at all

resolution levels. At this point the entropy is maximized for all representations of entropy, Talllis, escort-Tallis, (TWS

and GWS) and Renyi (RWS).

This paper uses the wavelet transform and wavelet entropy to follow the frequency-time series of EEG’s for patients

undergoing epileptic seizures. The central concepts of efficient coding of time-frequency non-stationary signals and their

corresponding entropy are uniquely tied together via a orthonormal DWT representation of the signal.

The basic concept of the wavelet will be used for a non-orthogonal representation of speech, based on gammatones as

the kernel function. Additionally, the concept of entropy established will serve as the basis for entropy calculation of the

efficient encoding process.

2.8 McLeod, S. A. (2008). Selective Attention. Retrieved from http://www.simplypsychology.org/attention-

models.html

Theories of auditory cognition from a psychological perspective are discussed and the models proposed by Broadbent

and Triesman are reviewed. The models deal with selective attention and describe it in terms of an information

bottleneck in which the brain cannot simultaneously process all the sensory inputs it is receiving. The brain is only able

to concentrate on one task at a time.

Broadbent conducted “dichotic” listening experiments in which two different messages are sent simultaneously to each

ear. The subject can only focus attention on one message at a time. Broadbent sent 3 digit numbers to each ear

simultaneously, and subjects reported better results by interleaving left and right ears, rather than reporting just one ear.

Broadbent developed his model based on these experiments. Broadbent believed that all sensory information enters a

sensory buffer. Inputs to the buffer are filtered based on physical characteristics, and only select information is allowed

to pass to the output. Inputs in the buffer that are not passed quickly decay. Broadbent believed that non-shadowed or

unattended messages are removed at an early stage in processing. Semantic processing occurs in Broadbent’s model only

after the physical filtering occurs. Criticism of Broadbent’s model is based on being able to hear your name when not

paying attention in the cocktail party scenario.

Treisman modified Broadbent’s model with an attenuation theory, which states the unattended messages are attenuated,

not lost in the short term memory buffer. The messages in the short term buffer are processed based on physical

characteristics, syllabic patterns and individual words. Treisman postulated that a dictionary of words exist with different

triggering thresholds. Words like ‘fire’ and ‘help’ and your name have low attenuation thresholds, and are thus allowed

to pass the filter stage. Treisman’s work is criticized because it does not address how the semantic processing takes

place, nor has the attenuation model ever been validated. Deutsch and Deutsch theorized that the processing of key

words takes place ahead of the physical characteristic filter.

3. PROPOSAL

Ernst Mach believed that “the eye has a mind of its own”. In a similar fashion, the ear must also have a mind of its own.

Another way to say it is the ear only passes along messages that the brain is interested in. The models of Broadbent,

Treisman and Deutsch and Deutsch suggest that preprocessing stages in auditory cognition takes place by a physical

characteristic filter.

The idea of Deutsch and Deutsch that dictionary of key words bypasses the physical characteristic filter does not make

sense. What is more likely is that the ear encodes key words into code patterns that the brain prioritizes. As Ernst Mach

pointed out, relationships between stimulus is what we respond to. The physical characteristics, such as best correlations

to the “auditory spikes” of Smith and Lewicki, might be the way the ear encodes data. Furthermore, the brain might

respond to difference in combinations of the codes. The ear may have evolved, or learned to decompose the auditory

simuli into a summation of gammatone wavelets. It makes intuitive sense that the gammatone wavelets will cluster

around formants in vowels. If the distances between code words carriers the information, then a very powerful code

emerges that represents “differences between code clusters” of physical codes. In this case the encoding for a single

vowel would be represented for all speakers as distances between codes. For example, harmonically related gammatone

wavelets are expected to cluster around format frequencies of vowels, whereas, broad band high frequency pulse

combinations should occur around plosives. Smith and Lewicki found their spiking model of auditory nerve fibers

closely match gammatone wavelets. If follows that physical characteristic filtering in the models of Broadbent, Treisman

might also be based on gammatone wavelets.

Phase 1 of this study is to apply the SAE to natural images and to show that the SAE learns basis vectors that appear as

“edges”. This will verify this implementation of the SAE learns “edges” similar to Olfhausen and Fields representations.

The entropy, kurtosis and reconstruction SNR are evaluated for sparsity parameter ρ = .01 and ρ = .05. It will be

shown that for low sparsity ρ = .01, the kurtosis is high, the entropy is low and the SNR of the reconstructed “image

patches” is low. Increasing the sparsity parameter to ρ = .05 will demonstrate that the basis vectors are no longer

“edges”, but look more like shaded regions. The kurtosis is reduced, the entropy goes up and the SNR is increased.

Histograms of the pixels are also shown which show the “peakedness” of the distributions follow the kurtosis.

In the second part of the Phase 1 investigation the SAE is next applied to randomly selected “voice patches” in order to

determine if high kurtosis representations have a pattern similar to “edges” for images. The “voice patches” are

displayed as images in which the time domain is mapped into an x,y grid so that the time slices appear as column

vectors. This enables viewing the time domain as images so that the structure of the basis vectors becomes apparent.

Moreover, the ability of the SAE to reconstruct the “voice patches” is easily observed in a manner similar to viewing the

reconstructed images from part 1. The kurtosis, entropy and reconstructed SNR are investigated for sparsity parameters

of ρ = .01 and ρ = .05. The basis vectors are found to appear as vertical strips, with the width of the stripes varying

in proportion to the frequency of the speech waveform.

The Phase 2 investigation in this paper is to decompose speech into a linear superposition of wavelet functions whose

kernels are gammatone impulse responses. The kernel functions correspond to a bank of gammatone filters. The portions

of raw speech that have the best mean square error fit to gammatone wavelets, are normalized and used as input vectors

to a SAE, providing they reduce the variance of the speech waveform by a detection threshold. This prevents low

amplitude gammachirp matches from being selected based on a MSE fit to the data. The normalized gammatone is then

subtracted from the speech waveform. Once the decomposition is complete for a portion of speech, the data is

reconstructed and the residual error is measured. The SAE will then encode the raw speech that was selected by a good

match to a gammatone wavelets into a new set of feature vectors. After training the output layer of the SAE is removed,

and a softmax classifier is hooked up to the hidden layer of neurons. The complete network is then trained with

backpropagation using supervised learning.

This approach is solves the amplitude attenuation problem of Treisman, because the MSE fit to the kernel function is

what selects the “essence of speech”. If the MSE is below threshold, a good match is proclaimed and the raw data is

scaled up to the amplitude of the “hitting gammatone wavelets”, so that the SAE doesn’t waste resources learning many

different amplitudes of the same basic kernel. The metric for data selection is signal energy reduction by kernel

subtraction. The difficulty with the decomposition is how to initialize the algorithm, and how to subsequently choose

which gammachirp gives the next best MSE to fit to the partially decomposed data. It is conjectured that this is the

process the neural pathways in the ear use to encode data, and that they have a very efficient search algorithm for

decomposing audio data into kernel MSE projections.

The first part of this study is to validate this concept using speaker recognition. The second part of the study is to apply

the methods of part 1 to the additive model proposed by Les Atlas, of the slow envelope and the temporal fine structure

(TFS) of human audition.

3.1 SAE design equations

The SAE is a single layer neural network that is trained with backpropagation using gradient descent [22]. The block

diagram from [22] is shown in figure 3. The targets for the backpropagation algorithm are the inputs, so the network

learns the identity function. The number of neurons in the hidden layer may be less than the number of input features,

enabling the network to compress the input data. Additionally, a sparsity parameter based on Kullback-Leibler

divergence (KLD), is introduced to the cost function, which penalizes the neurons for activity above or below a

threshold. Often, the sparsity parameter threshold is set below 10%, so the number of active neurons becomes quite low.

This leads to even higher amounts of data compression. Moreover, the KLD, sparsity forcing function directly controls

the percentage of the time, individual neurons fire. This enforces the policy of minimum entropy coding, which removes

redundancy and simultaneously forces approximately equally likely independent symbols.

Figure 1 SAE Block diagram

The neurons in the hidden layer are soft-limiter, non-linear functions. For unipolar data a sigmoid function is used:

f�¦� = 11 + 9o§For bipolar data, a hyperbolic tangent function is used:

f�¦� = �<eℎ�¦� = 9§−9o§9§ + 9o§The responses for the activation functions are shown in figure 4.

Figure 2 Sigmoid and tanh activations

The inputs to each neuron are the inputs, multiplied by input weights. Additionally, a bias term is also added that always

has a unity value. Ignoring the input weight subscript, the input for neuron 1, is given as,

¦1 = !1 ∗ K11 + !2 ∗ K12…+ !e ∗ K1e + �1,

Likewise for z2,

¦2 = !1 ∗ K21 + !2 ∗ K22…+ !2e ∗ Ke + �1

For the nth neuron input,

¦e = !1 ∗ Ke1 + !2 ∗ Ke2…+ !ee ∗ Ke + �1

The weights may be arranged into a row matrix with the inputs arranged in columns so that the inputs to a neuron may

be compactly written in matrix form as:

¦� = �� ∗ !, The activation at the output of the hidden layer is,

<� = f�¦��, While the input to the output layer is,

¦© = �� ∗ <�, And the output is given as:

<© = f�¦©�, Note that the SAE has two non-linearity functions, one in layer two and one in the output layer. This should allow the

network to capture higher order statistics from the input data, leading to a higher kurtosis in the coded representation.

Moreover, the entropy should be lower than the original data due to the minimum entropy coding principal.

The cost function for the SAE is given by:

ª�«, ¬� = 12�#�®�� − ��?

��+ g2 ‖«‖� + ¯#��°||°��

�

%��

The KL penalty term forces the difference between a threshold or desired value ° of average frequency of activation, and

the estimated activation rate, °�%, for the jth

out of � neurons. The KL penalty term is given by:

��°|�°�%� = °�� °°�% + �1 − °�� 1 − °1 − °�% The average activation of hidden node $ with input ! is :

°� = 1�#<%�?

��!��

The cost function averages the error over m input vectors using batch gradient descent. Each update epoch spans the

entire data set. Ridge regularization is used, which prevents over fitting of the data by adding the Euclidean norm

squared, of the weights. A regularization parameter λ controls the amount of regularization.

The output of the KL penalty term for ρ = .2, is shown from [23] in figure 5. A sparsity parameterβ, controls the

amount of influence the KL penalty term has on the overall cost function.

Figure 3 KL penalty for ° = .2

The equations for updating the weights and bias’s in layer �,using gradient descent are given as:

��%�M� = ��%�M� − ³ ��%�M�

ª��, �� M� = ��M� − ³ �

��M� ª��, ��

Where, α is the learning rate parameter for gradient descent. Note that regularization is applied to the weights for each

layer l, but not the bias terms. The equations for calculating the partial derivatives address a credit assignment problem,

whereby the contribution to the overall cost function for each neuron is accounted for. The derivation is shown in the

appendix.

The SAE is trained by making a forward pass through the network. For batch training all the inputs are stacked in a

column vector. The weights are in row vectors so that input to the activations z, is a vector of column vectors. In Matlab

the sigmoid function is applied to the entire z vector. The cost function is calculated on all the inputs, and the estimate ρ�

of the activations for each neuron in the hidden layer is calculated by taking the mean across all the input samples.

3.2 Sparse Autoencoders applied to Natural Images

Examples of prewhitened natural images are shown in figure 1. The images are 512x512 pixels.

Figure 4: Example of Natural Images

Randomly selected “image patches” that are 8x8 pixels are shown in figure 2. For this example there were 10000 image patches selected. The SAE for this example has 25 hidden nodes, with a sparsity parameter of .01. The learned basis vectors are shown in figure 2.

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

Figure 5: SAE Basis Vectors are “Edges”

Learned from Natural Images for Sparsity Parameter ρ = .01

The reconstruction error for the learned basis vectors is 1 dB SNR for both the training and test sets. A sample of 25 image patches is shown in figure 3, with the corresponding SAE reconstruction on the right. There are 64 pixels for each image patch with 25 hidden neurons in the SAE. The SAE sparsity parameter is .01 for this example. Note that the SAE’s ability to generalize the intensity pattern of the images. The data compression for this example is 64:(.01*25) = 256:1.

Figure 6: Image Patches Left and SAE Rendering on Right for ρ = .01

Figure 7: 256 Bin Histogram of Image Patches Left

SAE Test Data Rendering Right for ρ = .01

The entropy kurtosis and SNR of the image patches and the SAE rendering of train/test data are shown in table 1. Notice that the entropy has decreased, so that fewer bits/pixel now represent the same data. Moreover, the kurtois has increased

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

which make the probability distribution of the pixel intensity more peaked.

Table 1: Entropy and Kurtosis and SNR of Natural Image Patches

and SAE Training/Test Data Rendering for ρ = .01

Entropy Kurtosis & SNR

Raw Image Patches SAE rendering Train

SAE rendering Test

Entropy Bits 7.1 5.7 5.7

Kurtosis 4.8 19.5 21.6

SNR dB - 1.1 1.0

Increasing the sparsity parameter ρ significantly changes the basis vectors characteristics. The basis vectors for sparsity parameters of ρ = .02 and ρ = .05are shown in figure 5. The basis vectors for a .02 sparsity parameter still resemble “edges”. However, for a sparsity parameter of .05, the “edges” are not as apparent. Now the basis vectors appear more as shaded regions.

Figure 8: Basis Vectors for ρ = .02 Left, and ρ = .05 Right

Figure 9: Image Patches Left and SAE Rendering on Right for ρ = .05

Table 2: Entropy and Kurtosis and SNR of Natural Image Patches

and SAE Training/Test Data Rendering for ρ = .05


Random Voice Patches

SAE rendering Train

SAE rendering Test


Kurtosis 4.8 7.3 7.4

SNR dB - 3.6 3.5

Increasing the sparsity parameterρ, decreases the compression ratio to 64:(.05*25) = 51.2. Additionally, the entropy goes up and the kurtosis is decreased and the SNR of the reconstructed image increases marginally. Histograms of the raw image distributions and the SAE reconstructed image are shown in figure 7, for a sparsity parameter of ρ = .05. Note that the peakedness of the histogram of the SAE reconstructed images is reduced. This is consistent with a decrease in the kurtosis.

Figure 10: 256 Bin Histogram of Image Patches Left and SAE Test Data Rendering Right for ° = .05

3.3 Sparse autoencoders applied to randomly selected “voice patches”

In this section, the SAE is used to learn basis vectors for randomly selected samples of speech from the TIMIT data set. The data is based on 10k voice patches taken from the New England area or, the training folder in DR1 in the TIMIT corpus. There are 38 speakers in the DR1 training set, and each speaker has ten sentences. Therefore, 27 voice patches are randomly selected from 380 sentences to give ~10k voice patches. The “voice patches” are 400 samples long, so that the duration at 16ksps is 25msec. The motivation for this experiment is to run the SAE on randomly selected “voice patches” to determine if a set of basis vectors emerge, that are similar to the “edges” found from randomly sampling images. The basis vectors for randomly sampled speech appear as “stripes” and are shown in figure 8. For this exercise the number of hidden nodes is kept at 25 while the number of inputs is 400. The compression ratio is 400:(.01*25) = 1600:1. An example of 25 voice patches and the reconstructed “images” are shown in figure 9. The gray pixelated examples are portions of speech that are quite. The SAE rendering does not appear to do to well based on the results of figure 9. The gray examples are compose from out of phase basis vectors, and do not completely turn into shaded regions.

The entropy, kurtosis and reconstruction SNR is shown in table 3 for a sparsity parameter of ρ = .01. The SAE does not compress the data for randomly selected voice patches as can be seen from the entropy.

Figure 11: SAE Basis Vectors are “Stripes” Learned from Random Speech Samples Sparsity Parameter ρ = .01

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Figure 12: 25 Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .01

Figure 13: Randomly Selected Voice Patches Left and SAE Rendering on Right

Table 3: Entropy and Kurtosis & SNR of TIMIT DR1 Voice Patches

and SAE Training/Test Data Rendering



SAE rendering Train

SAE rendering Test


Kurtosis 17.7 31.4 31.8

SNR dB - 0.1 0.0

The sparsity parameter is adjusted to ρ = .05, and the basis vectors are shown in figure 11. The basis vectors lose their stripes. However, the reconstructed signal now appears closer to the original “image” as shown in figure 12.

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Figure 14: SAE Basis Vectors are lose there “Stripes” Learned from Random Speech Samples Sparsity Parameter ρ = .05

Figure 15: 25 Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .05

The entropy, kurtosis and SNR performance can be seen for ρ = .05 in table 4. The entropy is the same or higher than the original “image”, while the kurtosis is reduced from the ρ = .01 case. The SNR is improved slightly by 1dB.

Figure 16: Randomly Selected Voice Patches Left and SAE Rendering on Right ρ = .05

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 50 100 150 200 250 3000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Table 3: Entropy and Kurtosis & SNR of TIMIT DR1 Voice Patches

and SAE Training/Test Data Rendering



SAE rendering Train

SAE rendering Test


Kurtosis 17.4 21.9 21.9

SNR dB - 1.1 1.1

In this section randomly selected “voice patches” were compressed using a SAE. The basis vectors and time domain

“voice patches” are viewed as images in order to evaluate the quality of the SAE performance. The SAE does not do a

good job of representing the data, but neither did the SAE do a good job for randomly sampled images. The results of

this experiment shows that the SAE basis vectors for speech appear as vertical stripes in a similar manner that “edges”

are the predominate characteristic for images processed with a sparsity parameter of ρ = .01.

Random sampling of images and speech is not a very good choice is raw feature selection. However, the basis vectors

that are learned at low sparsity settings do appear to generate a set of basis vectors that capture the underlying

characteristics of the data. In the next sections, other methods of selecting features for the SAE will be based on using

these “primitive” basis vectors.

3.4 Adaptation of Lewicki and Smith’s matching pursuit algorithm using SAE’s.

The beauty of using SAE’s at low sparsity settings is that the learned basis vectors are generated by random sampling

with no regard to any features other than “image size” and are completely generated by the data itself. In this section, the

data will first be convolved (filtered) with primitive basis vectors in order to more intelligently sample speech. The

random sampler, makes no distinction between the quiet portions of speech and the high energy portions of speech. As a

result the SAE must waste resources learning to represent the quiet portions.

3.5 Speaker identification using gammatone wavelet decomposition speech sampling SAE and softmax classifier

This validation investigation uses four speakers selected at random from the TIMIT DR1 TRAINING corpus. The DR1

folder was selected which has speakers from New England. Each speaker says ten sentences. The speakers are not

repeated in the TEST section of the TIMIT corpora, so the sentences are divided in five sentences for training and five

for testing. The speakers are taken in pairs, so there are six combinations of speaker pairs. Each speaker has ten

sentences, which is divided into five for training and five for testing. Unsupervised feature learning is use to train the

SAE with both speakers voice patches. This is not expected to be the optimum solution but it shows that massive

amounts of data may be collected to train the SAE. Then, relatively small amounts of data can be used to train the

network using supervised learning. For optimum performance the SAE should be run on each speaker individually. This

will allow the SAE to extract basis vectors that best capture the individual characteristics of the speaker.

A voice activity detector is used to remove the silent portions of speech, so that the classifier does not to waste resources

to learn silent periods between words. This study is about speaker identification so that it is not critical to recover all the

speech, but to focus on the larger voice active portions that contain the majority of the speech energy. The VAD

functionality is shown in figures 1 and 2.

The steps for the VAD are:

• The sample rate for the TIMIT data set is 16ksps.

• Band-pass Filter TIMIT sentences with a FIR 100-600 Hz filter. This is the preprocessing filter that is used for

“Efficient Auditory Coding” [2].

• Generate Voice Gate by Band-pass filtering with 200-800Hz FIR filter.

• Take absolute value of filtered data

• Add small offset so the minimum of the absolute value drops below 0.

• Take the sign of the shifted absolute value. This makes a zero crossing detector.

• Fill in the gaps of the zero crossing detector by FIR filtering with a FIR of 200 ones

• Remove small gates less than 1000 samples long. Don’t want short speech sample for the Machine Learning

Algorithms.

• Extend voice patches so they are modulo n*400

Fig 17 FIR 200-800Hz data on left, zero crossing detector on the right

Fig 18. Filtered zero Crossing detector on left, gated voice signal on right

3.6 Gammatone design equations

Gammatone filters were conceived as a simple fit to the mammalian cochlear [23, 24]. They are the product of a gamma

function and a single tone. The equation is given as:

�� = <� o�9o�¶�� cos�2¸f� + ∅� Where: f = f��9c:9e�9cfc9º»9e:�

e = f��9c�c|9c

� = f��9c�<e|K�|�ℎ

< = f��9c<��»|9

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-1.5

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

∅ = f��9c�e��<��ℎ<�9

Gammatone filters have an ERB response, or Equalivent Rectangular Bandwidth. The gammatone filterbank has a

logarithmically spaced frequency separation, with logarithmically increasing bandwidth. This is very similar to the

frequency of MFCC’s, or mel-cepstral coefficients [13], which are widely used in ASR today in systems like Sphinx 4

[24]. Additionally, there are similar to wavelets in which the time response gets shorter for higher frequencies. The total

rectangular bandwidth remains constant, so low frequencies have a narrow bandwidth, while higher frequencies have a

broader bandwidth. As the bandwidth increases, the amplitude is lower to maintain the same EFB.

The gammatone impulse response for a 100 Hz center filter, and the frequency domain responses for a gammatone

filterbank with 12 filters with center frequencies from 100Hz to 6kHz are shown in figure 6.

Figure 19 Gammatone 100Hz center frequency Impulse Response left. Frequency domain for a 12 Gammatone

filterbank Fmin=100Hz, Fmax =6kHz right

3.7 Phase 1: Proof of concept speaker recognition using gammatone wavelet decomposition sampling, SAE and

softmax classifier

Gammatone decomposition of speech was used to derive the auditory spike model in the paper “Efficient Audio

Coding”[3]. The study used the entire TIMIT corpus for finding the auditory spike model. The study did not perform

any classification exercises using the efficient encoding scheme, but was limited to faithful reproduction of the audio

waveforms in the time domain. The learned kernel functions in the time and frequency domain, reprinted from Lewicki

and Smith are shown in figure 7. The kernel functions are shown to closely match cochlear revcor filter impulse

responses. The kernel functions impulse responses get shorter with increasing center frequency. This corresponds to

wider bandwidth at higher center frequencies.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07-6

-4

-2

0

2

4

6x 10

-3

Time in Seconds

line

ar

am

plit

ude

0 1000 2000 3000 4000 5000 6000 7000 8000-110

-100

-90

-80

-70

-60

-50

-40

-30

-20

-10

Frequency in Hz

am

plit

ude in d

B

Figure 20 a, Learned auditory Spikes in red, and Cochlear Revcor Filter responses in blue (gammatone

wavelets).

b, bandwidth and center frequencies of Cochlear revcor filters in blue and Auditory Spikes in red.

Lewicki and Smith had two parts to their study, learning the kernel functions, and using the kernel functions to

efficiently encode speech. The proof of concept phase will be to use gammatone wavelet kernel functions for speech

decomposition and sampling. The phase 1 study will use data selected by this sampling technique to perform automatic

speaker recognition. Normalized speech feature vectors will be selected based on a MSE fit to gammatone wavelets. To

simplify the investigation, the gammatone wavelets will be of a fixed size. This will allow batch gradient descent to be

used for training.

A voice activity detector will be used in phase 1 to select “voice patches”. The feature vectors for each voice patch are

saved in Matlab cells. This allows classification scores for individual feature vectors (gammatone wavelets MSE

“matches”), voice patches as well as whole sentences.

It should be noted that many decomposition strategies can be used to decompose speech based on gammatone wavelets

kernel functions. For this study, the raw speech is first filtered by the gammatone filter bank. After filtering, the largest

peaks from each filters outputs are normalized, and a MSE fit to the gammatone wavelets for that filter is calculated. If

the MSE is below a threshold, the raw speech is selected by compensating for the group delay of that filter. Additionally,

it is normalized, and the resulting feature vector is selected as input for the SAE. The SAE creates a new set of basis

vectors from the gammatone wavelets sampled speech that are tuned to the classification task set up by the softmax. For

example, in the case of speaker recognition, the feature vectors will capture differences in voice characteristics. Another

task to be investigated will be to perform dialect recognition based of the eight different regions from the TIMIT corpus.

It is postulated that the process of efficient encoding of speech based on an MSE fit to kernel functions best captures the

essential features of speech needed for learning. The SAE sets up the initial conditions for the composite network to

learn during training. The approach of using audio kernel functions to learn basis vector representations is consistent

with the theories of Mach, Barlow, Atick, Olshausen and Fields, and Smith and Lewicki.

For the phase 1 investigation, the gammatone filter bank is implemented as a bank of FIR filters of fixed length. The FIR

filter taps have the reverse correlation (revcor) weights corresponding to the gammatone wavelets.

The gammtones in are defined as a set of Equivalent Rectangular Bandwidth 5th

order IIR filters [25]. For the phase 1

investigation, the impulse response is truncated to 16.6 msec, or 265 samples at a 16kHz sample rate.

The impulse responses, for a bank of 64 filters are shown in figure 8.

Figure 21 Gammatone wavelets for 64 filters, Fs = 16ksps, Fmin = 100Hz, Fmax = 6kHz

Gammatone wavelet decomposition is given by the following steps:

1. Filter the output of the voice activity detector with each filter in the filter bank.

2. Find of peaks of the filtered output for each filter that are greater than 60% of the maximum peak.

3. Take the dot product of the impulse response of the filter and the speech data corresponding to the peak.

4. Normalize the dot produce by dividing by the norm of the data times the norm of the impulse response

5. Multiply the impulse response of the filter by the normalized dot product.

6. Compute the residual by subtracting the normalized impulse response from the speech sample.

7. If the variance of the residual is less than 93% of the variance of the speech data, retain the speech sample as a

feature vector.

8. Scale the feature vector by 1/normalization coefficient used for the dot product.

9. Save the feature index for reconstruction.

Once the voice signal has been decomposed, it may be reconstructed by simply adding scaled gammatone impulse

responses of the correct gain and index together to regenerate the speech signal. The signal to noise ratio of the

reconstructed signal may be found by subtracting the reconstructed signal from the original speech signal to reveal a

residual. The SNR is computed by taking the ratios of the variance of the original to speech to the variance of the

residual.

Various attempts were made to get the best possible SNR with the fewest basis vectors. The original number of basis

vectors used was sixty four. After much experimentation and trial and error, the optimum number of basis vectors that

yielded good signal to noise performance was found to be forty two. This produces signal to noise ratios of

approximately ten decibels. The gammatone wavelets were also truncated in the time domain to produce the best

possible signal to noise ratios. The best performance was found empirically to be 265 samples per gammatone wavelet.

Some typical speech signals, with the reconstructions and the residuals are shown in figures 9-12.

0 50 100 150 200 250 300 350 400 450 500-1

0

1

0 50 100 150 200 250 300 350 400 450 500-1

0

1

0 50 100 150 200 250 300 350 400 450 500-1

0

1

0 50 100 150 200 250 300 350 400 450 500-1

0

1

0 50 100 150-1

0

1

0 50 100 150-1

0

1

0 50 100 150-1

0

1

Figure 22 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right female

FSJK1

Figure 23 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right Male

MSPW0

Figure 24 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right male

MEDR0

0 500 1000 1500 2000 2500 3000 3500 4000 4500-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Gammatone Decomposition Number Basis Vectors: 29 SNR: 11.5452

Original Voice Data

Gammatone Reconstruction

0 500 1000 1500 2000 2500 3000 3500 4000 4500-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Original Wavefrom verses Residual after Gammatone Decomp SNR:11.5452

Gammatone Reconstruction Residual

Original Voice Data

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30Gammatone Decomposition Basis Vector Histogram: Number Basis Vectors: 29

0 500 1000 1500 2000 2500 3000-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Original Voice Data


0 500 1000 1500 2000 2500 3000-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8



Original Voice Data

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30

35

40


0 500 1000 1500 2000 2500-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Original Voice Data


0 500 1000 1500 2000 2500-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8



Original Voice Data

0 5 10 15 20 25 30 35 40 450

5

10

15

20

25

30


Figure 25 Gammatone wavelets reconstructed signal, left, Residual center, histogram of basis vectors, right female

FMEM0

3.8 Phase 1 Enhancements

The validation phase was based on decomposing speech similar to the method of Smith and Lewicki using gammatone

kernel wavelets. For convenience all the wavelets were the same length so that a single SAE could be used to test the

approach. For the enhancement phase, the wavelets will now be of variable length and will operate on the output of a

bank of gammatone filters. This is an in-between step from the proof-of-concept phase, to the additive model

implementation. This method has the advantage that the decompositions are now used to construct the individual filter

outputs, and therefore, are operating on data that is at a higher signal-to-noise ratio.

Filter 1 Wavelet Decomp Sampler SAE 1

… SOFTMAX

Filter N Wavelet Decomp Sampler SAE N

Figure 26 Phase I enhancement block diagram showing stacking of wavelet feature vectors

The block diagram for the phase I enhancements is shown in figure 13 above.

The gammatone wavelet decomposition sampler will now operate in the time domain on data that has been filtered by a

gammatone filter bank.

0 200 400 600 800 1000 1200 1400 1600-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8


Original Voice Data


0 200 400 600 800 1000 1200 1400 1600-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8



Original Voice Data

0 5 10 15 20 25 30 35 40 450

20

40

60

80

100

120


3.9 Phase II: Speaker recognition using additive model of the slow envelope and Temporal Fine Structure with a

SAE/Softmax classifier.

The phase I study showed feature extraction of speech based on gammachirp sampling in combination with a

SAE/Softmax classifier was very effective in speaker identification, especially at low SNR’s. The phase II investigation

is based on an actual model of the human peripheral audio system. This model has been proposed by Les Atlas and is

based of years of research on cochlear implants working with Brian Moore at Cambridge University [26]. The model is

is shown in figure 14.

…

ERB Filters-Gammatones Half wave rectify Lowpass = SE: TFS =Bandpass SE+TFS Sum ERB

Figure 27 Block diagram of the Additive model of the slow envelope and Temporal Fine Structure

The output of the gammatone filter bank is followed by a halfwave rectifier. The slow envelope is a lowpass filtered

version of the halfwave rectified signal. The temporal fine structure is a bandpass version of the halfwave rectified signal

in which only the fundamental and second harmonic are allowed to pass. The slow envelope and TFS are added together

in this model for each gammatone filter output. These are then summed to get composite speech. The outputs of a 12

gammatone filter bank are shown in the figure 17 in the results section.

A goal of phase II is to eliminate the VAD as this will fail in high noise environments. The simple energy detection in a

frequency band is easily captured by noise. One outcome the phase II approach will be to develop a very robust VAD

based on gammatone decomposition and feedforward SAE/Softmax network. A significant difference between phase I

and phase II is that the decompositions take place the at the filter outputs, one for the slow envelope and another for the

TFS. This is operating at a higher SNR due to the reduction in bandwidth compared to the phase I investigation that used

the entire speech bandwidth.

The phase II investigation has the following steps:

1. Repeat Phase I, but now used the additive model. Decompose the slow envelope using impulse responses for

each of the slow envelopes filter responses. Decompose the fast envelope using the impulse responses for each

of the TFS outputs. The gammatone wavelets will now be different lengths. There will be one SAE for each

filter output.

2. Expand the number of speakers to include all speakers in a region from the TIMIT corpus. Do this for each of

the 8 regions.

3. Identify the dialects in each of the 8 regions.

4. Identify all speakers in the TIMIT corpus by first isolating them to a region, based on dialect. Then identify the

individual in that region.

5. Develop a VAD SAE/Softmax that uses two classes, 1). Speech Present 2). No Speech Present. Train the

speech detector based on

Slow Envelope ERB 1

TFS ERB 1

Slow Envelope ERB n

TFS ERB n

4. RESULTS

4.1 Phase 1: Proof of concept speaker recognition using gammatone wavelet decomposition sampling, SAE and a

softmax classifier

The results the classification accuracy for individual feature vectors, clusters of feature vectors that compose a voice

patch, as well as overall sentence classification accuracy are shown in tables 1-6 below. The SNR is varied for each of

the six speaker combinations. This input vector size for all gammatone wavelets is 265 which correspond to a sample

time of 16.6msec per feature vector. The fixed length of 265 allows batch processing as all the feature vectors are

stacked into column vectors. There are 84 hidden nodes in the SAE with a sparsity constraint of 4/84 active nodes in the

network on average. The number of hidden nodes, and the sparsity constraint was found by computer search that

represents the knee of the curve in terms of SAE reconstruction error. The training and test vectors were decomposed

using an infinite SNR. Noise is added after the decomposition to the test data only, to show the SNR performance of the

classifier. Adding noise to the data before decomposition will dramatically affect the VAD performance for both the

training and test data and was therefore omitted. The VAD will capture voice patches based on noise and not signals at

low SNR. The intent of the proof of concept phase is to show the SAE/Softmax classifier used for image processing can

also be adapted for voice processing. Furthermore, the efficient encoding decomposition technique is used to extract

meaningful features that are not just used to autoencoding, but to perform an actual classification task.

The training and test data is forward passed through the network and the SNR’s of the SAE is recorded. The SNR for the

training data is simply the SNR due to reconstruction error. The SNR of the test data is based on AWGN plus the

reconstruction error.

The results show that unsupervised feature learning using the SAE and softmax classifier produces 100% accuracy on

sentences at 3dB and 40dB SNR’s. The performance for the pair of MPSW0 FMEM0 at 10dB SNR produced a 90%

classification score. It is not clear why the 3dB SNR outperforms the 10dB SNR in terms of classification accuracy. The

training was done at infinite SNR in direct analogy to a matched filter in a communications system. A BPSK modem, for

example, with a root raised cosine channel filter has a known impulse response. This channel filter impulse response is

used as a matched filter to make symbol decisions. Perhaps higher noise levels allow the classifier to generalize better

and achieve similar performance to the high SNR performance at 40dB.

This proof of concept study focused on speaker identification. The goal was to determine if the method of SAE and

softmax classifier used for image processing have a direct implementation in voice processing. The gammatone wavelet

decomposition sampler was introduced to provide a pre-processing method for feature extraction. The results are

encouraging as the performance at low SNR appears to work as well as the noiseless case.

SAE Parameters

MPSW0_MEDR0 SNR = 40dB

Sparsity Parameter: 0.047619

Hidden Size: 84

Lambda: 3.000000e-03

Beta-Sparsity Penalty: 3

Number Iterations SAE L-BFGS to

run: 400

SNR in dB, Sparse Auto Encoder

training: 14.366336

SNR in dB Sparse Auto Encoder test:

14.353541

SOFTMAX Classification Score

Number of Classes: 2.000000

Soft Max Lambda: 0.003000

Test Accuracy Feature Vector:

57.431144%

Test Accuracy VoicePatch:

82.278481%

Test Accuracy Sentence:

100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.359360


10.257157





56.971012%


81.012658%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.370852


6.523747





56.471439%


81.012658%


100.000000% Table 1 MPSW0_MEDR0 Speaker Recognition SNR performance

SAE Parameters

FSJK1_FMEM0 SNR = 3dB


Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.438659


6.616162





61.496447%


75.949367%


100.000000%

SAE Parameters


Sparcity Parameter: 0.047619

Hidden Size: 84

Lambda: 3.000000e-03

Beta-Sparcity Penalty: 3


run: 400


training: 14.495097


10.468286





62.848658%


72.151899%


100.000000%

SAE Parameters


Sparcity Parameter: 0.047619

Hidden Size: 84

Lambda: 3.000000e-03

Beta-Sparcity Penalty: 3


run: 400


training: 14.476532


14.465558





63.023650%


72.151899%


100.000000% Table 2 FSJK1_FMEM0 Speaker Recognition SNR performance

SAE Parameters

MEDR0_FSJK1 SNR = 3dB


Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.344862


6.484468





64.209192%


96.052632%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.326024


10.268590





65.255151%


94.736842%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.295548


14.283003





65.800317%


96.052632%


100.000000% Table 3 MEDR0_FSJK1 Speaker Recognition SNR performance

SAE Parameters

MEDR0_FMEM0 SNR = 3dB


Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.465636


6.643830





59.515741%


69.512195%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.446876


10.432653





60.286401%


68.292683%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.447896


14.435708





60.362921%


65.853659%


100.000000%

Table 4 MEDR0_FMEM0 Speaker Recognition SNR performance

SAE Parameters

MPSW0_FMEM0 SNR = 3dB


Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.440929


6.621928





59.952995%


73.170732%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.438910


10.426857





60.226279%


73.170732%

Test Accuracy Sentence: 90.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.457319


14.444083





60.073240%


67.073171%


100.000000% Table 5 MPSW0_FMEM0 Speaker Recognition SNR performance

SAE Parameters

MPSW0_FSJK1 SNR = 3dB


Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.091070


6.474035





63.289308%


97.014925%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.133589


10.209106





64.031447%


98.507463%


100.000000%

SAE Parameters



Hidden Size: 84

Lambda: 3.000000e-03



run: 400


training: 14.142304


14.130692





64.742138%


98.507463%


100.000000% Table 6 MPSW0_FSJK1 Speaker Recognition SNR performance


SAE/Softmax classifier.

The slow envelopes and TFS of a 12 gammatone filter bank are shown in figure 15 below.

0 2000 4000 6000 8000 10000 120000

0.5

1

1.5x 10

-3 SLOW ENVELOPE ERB Center Freq = 100Hz

0 2000 4000 6000 8000 10000 12000-5

0

5x 10

-3 Temporal Fine Structure ERB Center Freq = 100Hz

0 0.5 1 1.5 2 2.5

x 104

-5

0

5

10

15x 10


0 0.5 1 1.5 2 2.5

x 104

-0.02

0

0.02

0.04Temporal Fine Structure ERB Center Freq = 180Hz

0 0.5 1 1.5 2 2.5

x 104

-0.01

0

0.01

0.02

0.03

SLOW ENVELOPE ERB Center Freq = 275Hz

0 0.5 1 1.5 2 2.5

x 104

-0.02

0

0.02

0.04Temporal Fine Structure ERB Center Freq = 275z

0 0.5 1 1.5 2 2.5

x 104

-0.02

0

0.02

0.04

0.06


0 0.5 1 1.5 2 2.5

x 104

-0.1

0

0.1

0.2


0 0.5 1 1.5 2 2.5

x 104

-0.02

0

0.02

0.04

0.06


0 0.5 1 1.5 2 2.5

x 104

-0.1

0

0.1

0.2


0 0.5 1 1.5 2 2.5

x 104

-0.05

0

0.05

0.1

0.15SLOW ENVELOPE ERB Center Freq = 730Hz

0 0.5 1 1.5 2 2.5

x 104

-0.2

0

0.2

0.4


Figure 15 Slow envelope and TFS for a 12 gammatone filter bank

0 0.5 1 1.5 2 2.5

x 104

-0.02

0

0.02

0.04

0.06


0 0.5 1 1.5 2 2.5

x 104

-0.1

0

0.1

0.2


0 0.5 1 1.5 2 2.5

x 104

0

0.01

0.02

0.03


0 0.5 1 1.5 2 2.5

x 104

-0.05

0

0.05

0.1


0 0.5 1 1.5 2 2.5

x 104

0

0.01

0.02

0.03


0 0.5 1 1.5 2 2.5

x 104

-0.1

-0.05

0

0.05


0 0.5 1 1.5 2 2.5

x 104

0

0.01

0.02

0.03


0 0.5 1 1.5 2 2.5

x 104

-0.1

-0.05

0

0.05


0 0.5 1 1.5 2 2.5

x 104

-5

0

5

10

15x 10


0 0.5 1 1.5 2 2.5

x 104

-0.04

-0.02

0

0.02


0 0.5 1 1.5 2 2.5

x 104

-5

0

5

10

15x 10


0 0.5 1 1.5 2 2.5

x 104

-0.04

-0.02

0

0.02


5. SCHEDULE

5.1 Phase I: Proof of concept speaker recognition using gammatone wavelet decomposition sampling (GWDS),

SAE and softmax classifier. Completed for this proposal Spring 2014

5.2 Phase I enhancements: Complete Fall 2014

1. Modify GWDS to operate on the output of each filter, and not the input speech. Change wavelets to have

variable lengths. Modify SAE structure so there is 1 SAE for each filter output. Repeat phase I AWGN

speaker recognition experiments.

2. Expand number of speakers in the DR1 TIMIT region to include 10 speakers in the training set. Extract

feature vectors from all 10.

3. Add entropy calculations to the pre and post coded speech data.

4. Train SAE for all ten speakers, this will generate a region specific set of basis vectors.

5. Repeat 2 and 3 above for another district from the TIMIT corpus.

6. Use supervised learning to classify dialects of region of the country for these two TIMIT regions with

AWGN’s of 3, 10 and 30dB.


SAE/Softmax classifier. Complete Spring 2015

1. Implement additive model decomposition sampling. This will have 2 SAE’s for each filter output. Also, the

decomposition takes place on the filter outputs, not the input speech. Find basis vectors for the slow envelope

and temporal fine structure at the output of each filter. Repeat dialect and speaker identification verses AWGN

and compare the results from 5.2.

6. APPENDIX

6.1 Max Entropy for the binary symmetric channel derivation

� = −�� + º��º�� º = 1 − �,

So that

� = max R− l�� + �1 − ��1 − ��nT Taking the derivative of H with respect to p and setting it equal to 0 yields:

|�|� = −|��|� − |��1 − ��|� + |��1 − ��|�

|�|� = − D ��e2 + ��E + 1�1 − ��e2 − D ��1 − ��e2E + ��1 − �� − 1�e2−�� + �1 − ��1 − ��e2 + ��1 − �� = 0

−�� + ��1 − �� = 0

��1 − �� = ��

2M�¼p��o � = 2M�¼p

1 − � = �

� = .5

The can be easily extended to show the case when n is greater than 2.

6.2 Backpropagation partial derivatives for updating the weights and bias

The auto encoder uses the inputs as the output targets. The output error is given as,

e_ = x�_ − !%, For the auto encoder,

x�_ = x_, So that, e_ = x_ − a_,

The error energy for the j

th output node for the n

th training epoch is:

b%�e� = 12 9%��e� The total instantaneous error energy is,

b�e� = 12#9%��e� To update the output weights find the partial derivative of the total instantaneous error energy with respect to the output

weights. Applying the chain rule, |b�e�|K�%�e� =|b�e�|9%�e�

|9%�e�|<©�e�|<©�e�|¦��e�

|¦��e�|K�%�e� |b�e�|9%�e� = 9%�e� And, |9%�e�|<©�e� = −1

Also, |<©�e�|¦��e� = fs�¦©�e�� Finally,

|f�¦©�|K%��e� = <��e�

The local gradient for the output layer, which is the sensitivities for the activations, are by definition,

½%�e� = |b�e�|¦©�e� =|b�e�|9%�e�

|9%�e�|<©�e�|<©�e�|¦©�e�

½%�e� = −9%�e�f′�¦©�e�� For a sigmoid function, fs�¦©�e�� = <��e��1 − <��e��, For a tanh function,

fs�¦©�e�� = �1 − <��e�� The output weight update is then, ¿K�% = ½%�e�<��e�

The weight update equations for the hidden layer are more complicated due to the lack of an explicit error term.

The local gradient is defined as,

½%�e� = |b�e�|�%|¦©�e� = |b�e�|9%�e�|9%�e�|�%�e�

|�%�e�|¦©�e�

REFERENCES

[1] B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for

natural images. Nature, 381(6583):607–9, 1996 [2] J.J. Hunt P. Dayan G.J. Goodhill. Sparse Coding Can Predict Primary Visual Cortex Receptive Field Changes

Induced by Abnormal Visual Input. PLoS Biol 9(5): e1003005. 2013 [3] E.C. Smith and M.S. Lewicki. Efficient auditory coding. Nature, 439(7079):978–82, 2006. [4] A. Saxe M. Bhand, R. Mudur, B. Suresh, A. Ng Unsupervised learning of primary cortical receptive fields and

receptive field plasticity. NIPS, page 1971-1979, 2011 [5] H. Lee Y. Largman P. Pham A. Ng. Unsupervised feature learning for audio classification using convolutional

deep belief networks. Advances in Neural Information Processing Systems (NIPS) 22, 2009. [6] B.C. Moore. The Role of Temporal Fine Structure Processing in Pitch Perception, Masking, and Speech

Perception for Normal-Hearing, and Hearing- Impaired People. JARO 9:399-406 2008 [7] X. Li. Temporal Fine Structure and Applications to Cochlear Implants. PhD dissertation University of Washington

2013 [8] R.V. Shannon, F. Zeng V. Kamath J. Wygonski M. Ekelid Speech recognition with primarily temporal cues.

Science, vol. 270, pp 303-304 1995 [9] C.E. Stilp K.R. Kluender. Cochlear-scaled Entropy, not consonants, vowels, or time, best predicts speech

intelligibility. PNAS vol. 107, no 27. 2010 [10] M. Schmuckler D. Gilden. Auditory perception of fractal contours. J Exp Psychol Hum Percept Perform 19: 641–

660. 1993 [11] T. Overath R. Cucask S.Kumar K. Kregstien J. Warren M. Grube R. Carlyon D. Griffiths An Information

Theroretic Characterization of Auditory Encoding. PLoS Biol 5(11):e288 2007 [12] A.D. Patel E. Balaban. Temporal patterns of human cortical activity reflect tone sequence structure. Nature, 440.

2000 [13] T.Quatieri. Discrete-Time Speech Signal Processing. Prentice Hall PTR 2002 [14] V.B. Montcastle, An organizing principle for cerebral function: the unit module and distributed system. Pages 7-

50. MIT press, Cambridge, MA. 1978 [15] M. Kendrick. Tasting the Light: Device Let the Blind “See” with their Tongues. Scientific American Aug 13, 2009 [16] J.G. Proakis. Digital Communicaitons 5

th Edition. McGraw Hill 2007

[17] P. Pojman. Ernst Mach. The Standford Encyclopedia of Philosophy. Winter edition 2011 [18] E.C. Banks. The Philosophical Roots of Ernst Mach’s Economy of thought. Synthese Vol. 139 Issue 1 pp 23-53

2004 [19] E.C. Tolman. Cognitive Maps in Rats and Men. The Psychological Review, 55(4) 189-208 1948 [20] K. Craik. The Nature of explanation. Cambridge University Press 1943 [21] S. Haykin. Neural Networks and Learning Machines, third edition. Pearson Education, Inc. Prentice Hall 2009 [22] M. Shu A. Fyshe. Sparse Autoencoders for Word Decoding from Magnetoencephalography. Not yet published [23] Andrew Ng. Unsupervised Feature Learning and Deep Learning Tutorial

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial. Stanford University Web site. 2013

[24] R. Patterson I. Nimmo-Smith. An Efficient Auditory Filterbank Based on the Gammatone Function. Institute of

Acoustics on Auditory Modelling 1987 [25] Malcolm Slaney (1998) "Auditory Toolbox Version 2", Technical Report #1998-010, Interval Research

Corporation, 1998. [26] Les Atlas (2012) Decomposition of speech and sound into Modulations and Carriers.

http://msrvideo.vo.msecnd.net/rmcvideos/173320/dl/173320.pdf , Microsoft Research & University of Washington [27] Rosso O.A., Martin M.T., Figliola A., Keller K., Plastino A. (2006). EEG Analysis using wavelet-based

information tools. Journal of Neuroscience Methods 153 163-182 [28] Klein D.J., Koing P., Kording K., (2003) Sparse Spectrotemporal Coding of Sounds. EURASPI Journal on

Applied Signal Processing 2003:7,659-667

Documents

Speech Processing in the Time Domain Using Sparse … · 2018-11-16 · Speech Processing in the Time Domain Using ... SAE. Speech recognition and dialect detection is done to validate