36
Trends in Large-Scale Data Analysis Mark Liberman University of Pennsylvania

Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Trends in Large-Scale Data Analysis

Mark LibermanUniversity of Pennsylvania

Page 2: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

ABSTRACT:

For hundreds of years, scientists and engineers have solved problems of prediction, classification and optimization using physical and statistical models. Computing technology has brought an exponential explosion of data collection and storage, and corresponding changes in modeling and analysis methods. Some of these changes are just easier and faster versions of established techniques, but others represent the development of a fundamentally new concept of information processing, under evocative but vague headings like "Neural Networks", "Deep Learning" and "Artificial Intelligence". This talk will sketch the nature, promise and problems of these developments.

5/3/2019 PREM Symposium 2019 2

Page 3: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

What I do: Science and technology of language and speech

What I don’t do: Science and technology of materials(Innovative or otherwise)

So why am I here at the PREM 2019 Symposiumon “Data Science for Innovative Materials”?

1. Jorge invited me2. Some interesting aspects of “Data Science” are shared

across applications areas

(I think…)

5/3/2019 PREM Symposium 2019 3

Page 4: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Some (shared?) history:

Evolution (or oscillation?)from physical & mathematical modeling

… to statistical modeling

… to “deep learning”

… to ???

5/3/2019 PREM Symposium 2019 4

Page 5: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Thoughts → Words → Vocal Gestures → Sounds → Words → Thoughts

Thoughts → Words → Vocal Gestures → Sounds → Words → Thoughts

Vocal Gestures → Vocal Tract States = Acoustic Transfer Functions

Voice Source + Acoustic Transfer Function → Sounds

1940-1975: Focusing on the physics of speech communication

5/3/2019 PREM Symposium 2019 5

Mid-20th-century history of my field --

Page 6: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Chiba, T and Kajiyama, M. “The Vowel: Its Nature and Structure”, Tokyo-Kaiseikan Pub. Co., Ltd., Tokyo (1941)

5/3/2019 PREM Symposium 2019 6

Page 7: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

“Formant” Model (Chiba & Kajiyama 1941, Fant 1951, …) –

Assumptions:1. Source-filter independence

(Source = larynx; Filter = supra-laryngeal vocal tract)2. Vocal tract acoustics = plane waves propagating along the axis of a radially-symmetrical tube

(closed at the larynx, open at the lips)

Results:1. Filter transfer function is the sum of a set of complex resonances = “formants”

(caused by standing waves in the hypothesized tube)2. Only 3 of these resonances are materially affected by (smooth) changes in the tube diameter

(and the effects of higher resonances can therefore be approximated by a single term)

5/3/2019 PREM Symposium 2019 7

Page 8: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Dunn, Hugh K. "The calculation of vowel resonances, and an electrical vocal tract." The Journal of the Acoustical Society of America 22, no. 6 (1950): 740-753:

By treating the vocal tract as a series of cylindrical sections, or acoustic lines, it is possible to use transmission line theory in finding the resonances. With constants uniformly distributed along each section, resonances appear as modes of vibration of the tract taken as a whole. […] An electrical circuit based on the transmission line analogy has been made to produce acceptable vowel sounds. This circuit is useful in confirming the general theory and in research on the phonetic effects of articulator movements. The possibility of using such a circuit as a phonetic standard for vowel sounds is discussed.

5/3/2019 PREM Symposium 2019 8

Page 9: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Gunnar Fant. Transmission Properties of the Vocal Tract with Application to the Acoustic Specifications of Phonemes. Acoustics Laboratory, Massachusetts Institute of Technology, 1951. Gunnar Fant. Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. No. 2. Walter de Gruyter, 1960.

5/3/2019 PREM Symposium 2019 9

Page 10: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Flanagan, James L., Kenzo Ishizaka, and Kathy L. Shipley. "Synthesis of speech from a dynamic model of the vocal cords and vocal tract." Bell System Technical Journal 54, no. 3 (1975): 485-506.

We describe a computer model of the human vocal cords and vocal tract that is amenable to dynamic control by parameters directly identified in the human physiology. The control format consequently provides an efficient, parsimonious description of speech information. The control parameters represent subglottal lung pressure, vocal-cord tension and rest opening, vocal-tract shape, and nasal coupling. Using these inputs, we synthesize vowel-consonant-vowel syllables to demonstrate the dynamic behavior of the cord/tract model. We show that inherent properties of the model duplicate phenomena observed in human speech; in particular, cord/tract acoustic interaction, cord vibration, and tract-wall radiation during occlusion, and voicing onset-offset behavior. Finally, we describe an approach to deriving the physiological controls automatically from printed text, and we present sentence-length synthesis obtained from a preliminary system.

5/3/2019 PREM Symposium 2019 10

Explorations of more complete dynamic physical models --

Page 11: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

5/3/2019 PREM Symposium 2019 11

Problems:1. Static sound-to-tube inversion is an underdetermined problem

(even for longitudinal plane waves in radially-symmetrical tubes without wall losses)2. Solutions for rapidly-changing articulatory kinematics are harder3. Dynamic models (robot talkers) are even harder

(and we have little idea how the physics and physiology really work)

Results:1. Despite widely-held belief in the crucial role of dynamic articulatory models

(and many attempts to use them in speech technology)there have never been any engineering applications.

2. Engineers’ interest in such models faded after 1980 or so.

Page 12: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

The formant model has had a less negative history, because:1. The model fits the data (sort of, sometimes)2. It yields a big reduction in dimensionality --

Formants are 3 slowly-varying inexact numbers, ~100*3 per secondDigital audio for speech must be sampled at least 8000 times per second

3. The 3 formant dimensions “make sense” phonetically

5/3/2019 PREM Symposium 2019 12

Page 13: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

But there are still many problems with this approach:1. The underlying physical model leaves out many details• The vocal tract is not radially symmetrical• There are source-tract interactions• The nasal cavity creates additional poles and zeros• Near closures create zeros• There are subglottal resonances during the open phase of glottal oscillation

2. The underlying physical model may have more serious (non-linear) problems• There are apparently complex spatially-separated swirling flows

(aerodynamic rather than acoustic phenomena)• The expected longitudinal standing waves seem to be absent

3. Estimation of formant parameters from sound is catastrophically unstable(even in synthetic data, where by construction the model fits perfectly,

arbitrarily small differences in input yield large differences in parameter estimation)

4. Similar problems exist for excitation parameters (e.g.“pitch” and “voice quality”)5/3/2019 PREM Symposium 2019 13

Page 14: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Teager, H. M., and S. M. Teager. "Evidence for nonlinear sound production mechanisms in the vocal tract." In Speech production and speech modelling, pp. 241-261. Springer, Dordrecht, 1990.

Much of what speech scientists believe about the mechanisms of speech production and hearing rests less on an experimental base than on a centuries-old faith in linear mathematics. Based on experimental evidence we believe that the momentum waves, or the interactions of the inertia-laden flows leading to various modes of oscillation, within the vocal tract are neither passive nor acoustic. Measurements of flow within the vocal tract indicate that acoustic impedance, or the pressure-flow ratio, is violated. The pressure across any cross section of the tract is constant and does not exhibit the differentials expected from the markedly different separated flows across that same cross section.

5/3/2019 PREM Symposium 2019 14

Page 15: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

“Consider a perfectly spherical cow, radiating milk isotropically…”

5/3/2019 PREM Symposium 2019 15

Page 16: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

So after the mid-1970s:

Engineers mostly abandoned human-defined physical models (time functions of vocal tract parameters or formants) in favor of general spectral parameters

for which small input differences → small output differences

They chose overall statistical models that are simple enoughfor their (many) parameters to be learned from data,

and speech “perception” or “production” can be treated as global optimization of probabilities.

5/3/2019 PREM Symposium 2019 16

Page 17: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

e.g. MFCCs = “Mel Frequency Cepstrum Coefficients”

Mel = nonlinear warping of frequency scaleModeled on human auditory psychophysicsApproximates information density of speech

5/3/2019 PREM Symposium 2019 17

Page 18: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

“Cepstrum” – a signal processing pun

= cosine transform of the log amplitude spectrum

The spectrum of the spectrum –from the frequency domain to the quefrency domain.

Why?

Tends to remove correlations among nearby elements in smooth spectra –allows use of diagonal covariance matrices in statistical modeling.

5/3/2019 PREM Symposium 2019 18

Page 19: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

And time-series methods like “Hidden Markov Models”= stochastic functions of a Markov chain:

where can infer the hidden state sequence from observations via Bayes’ Rule and Viterbi decoding…

5/3/2019 PREM Symposium 2019 19

Page 20: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

And there’s an efficient technique for learning system parametersfrom training data:

Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. "A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains." The annals of mathematical statistics 41, no. 1 (1970): 164-171.

Liporace, L. A. PTAH on continuous multivariate functions of Markov chains. Technical Report 80193, Institute for Defense Analysis, Communication Research Department, 1976.

5/3/2019 PREM Symposium 2019 20

Page 21: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

This approach worked – at a cost:

1. Over-simplified models (independence assumptions, etc.)2. Enormous complexity (many millions of parameters)3. Many detailed options for architectures and estimation methods• Choice requires optimization over a complex algorithmic space• Progress depends on thousands of small improvements

…but progress happened!

5/3/2019 PREM Symposium 2019 21

Page 22: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Hill-climbing in DARPA Speech-To-Text programs:

5/3/2019 PREM Symposium 2019 22

Page 23: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Four lessons from that experience:

1. Learning is better than programming;2. Global optimization of gradient local decisions is crucial;3. Top-down and bottom-up knowledge must be combined;4. Metrics on shared benchmarks matter.

5/3/2019 PREM Symposium 2019 23

Page 24: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

“Learning is better than programming” –

…but many aspects of early 2000s HLT systems were still “programmed”via “feature engineering” at both endsand many structural and algorithmic choices in the middle…

1. Top-down language models rely on combinations of characters into ”words” and “phrases”with pronunciations given by a dictionaryand/or by letter-to-sound rules

2. Bottom-up acoustic models rely on MFCCs or similar3. In the middle, there are many structural and algorithmic choices

5/3/2019 PREM Symposium 2019 24

Page 25: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

SO…“Deep Learning” to the rescue!

F(x) = L(N(L(N(…L(x)))))

• where x is an arbitrary vector input• L(x) is an affine function ax+b• N(x) is a non-linear function applied to each vector element separately

…plus some other goodies around the edges…

5/3/2019 PREM Symposium 2019 25

Page 26: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

This is a universal computing model,in the sense that such a system

can be programmed to computeany finite function.

And even better, general optimization techniquescan learn model parameters from training data.

(…in the limit, sort of, sometimes..)

5/3/2019 PREM Symposium 2019 26

Page 27: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

So we can do away with “feature engineering”and design a “sequence-to-sequence” model

whose inputs are audio waveform samplesand whose outputs are text characters.

After all, MFCC analysis is just a bunch of inner products,so why not learn the basis functions and band definitions

rather than programming them?

And a more complicated version of the same story applies to text analysis/synthesis.

5/3/2019 PREM Symposium 2019 27

Page 28: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Deep Learning solutions do work better –

…but at a cost.

5/3/2019 PREM Symposium 2019 28

Page 29: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Deep Learning “programs” are increasingly complicated --CNNs, RNNs, LSTMs, “transformers”: …

5/3/2019 PREM Symposium 2019 29

Page 30: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Avoiding “feature engineering”increases the number of parameters that need to be learnedand the amount of training data and training time

needed to learn them.

5/3/2019 PREM Symposium 2019 30

Page 31: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Why should our systems have to re-learn everything --logic, mathematics, physics, acoustics, chemistry, dictionaries, etc. --

all over againfor every new problem?

And we’re beginning to see a return to an old idea:systems that have pieces of relevant science “baked in”.

5/3/2019 PREM Symposium 2019 31

So people are starting to ask…

Page 32: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

In other words, the old epistemological pendulum is starting to swing back from empiricism towards rationalism.

Metaphor: AI programming should be like video game programming.

5/3/2019 PREM Symposium 2019 32

Page 33: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

Anumanchipalli, Gopala K., Josh Chartier, and Edward F. Chang. "Speech synthesis from neural decoding of spoken sentences." Nature 568, no. 7753 (2019):

Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics.In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.

5/3/2019 PREM Symposium 2019 33

Page 34: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

5/3/2019 PREM Symposium 2019 34

Page 35: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

5/3/2019 PREM Symposium 2019 35

Page 36: Trends in Large-Scale Data Analysis - Linguistics · Problems: 1. Static sound-to-tube inversion is an underdetermined problem (even for longitudinal plane waves in radially-symmetrical

?5/3/2019 PREM Symposium 2019 36