Upload
ai-frontiers
View
572
Download
0
Embed Size (px)
Citation preview
Deep Learning in AlexaAI Frontiers
Santa Clara, CA
January 11, 2017
Nikko Strom, Sr. Principal Scientist, Alexa Machine Learning
Longer-form talk at AWS re:Invent 2016
https://www.youtube.com/watch?v=TYRckcVm4WE
Deep Learning in Alexa
(MAC202)
Large-scale distributed training
Up to 80 EC2 g2.2xlarge GPU
instances working in sync to train
a model
Thousands of
hours of speech
training data stored
in S3
Large-scale distributed training
All nodes must communicate
updates to the model to all
other nodes.
GPUs compute model
updates fast – Think updates
per second
A model update is hundreds
of MB
0
100,000
200,000
300,000
400,000
500,000
600,000
0 20 40 60 80
Fra
mes p
er
second
Number of GPU workers
DNN training speed
Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.
Signal
processingAcoustic model
Decoder
(inference)
Post
processing
Feature
vectors
[4.7, 2.3, -1.4, …]
Phonetic
probabilities
[0.1, 0.1, 0.4, …]
Words
increase to 70 degrees
Text
Increase to 70⁰
Sound
Speech recognition
Transfer learning from English to German
Hidden layer 1
Hidden layer 2
Last hidden layer
æI ɑɜ ʊ … eæI ɑɜ u: … œ
Output layer
The cocktail party problem… play
some
jazz!
…blah,
blah,
blah,
blah…
…blah,
blah,
blah,
blah…
…blah,
blah,
blah,
blah…
…blah,
blah.
…blah,
blah,
blah,
blah…
…blah,
blah,
blah,
blah…
Anchored speech detection
Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016.
Alexa, play some jazz!
Wake word Request
“Anchor” Speech consistent with anchor
Encoder Decoder
Anchored speech detection
Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016.
Alexa, play some jazz!
t
LSTM
Encoder
speech features
from wake wordspeech features from request
endpoint decision
anchor embedding
LSTM
Decoder
Speech synthesis
Text
Text normalization
Grapheme-to-phoneme conversion
Waveform generation
Speech
She has 20$ in her pocket.
she has twenty dollars in her pocket
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
Concatenative synthesis
Di-phone
segment
database
Di-phone unit selectionSpeechInput
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
Prosody for natural sounding reading
Bi-directional recurrent network
pitch duration
• Phonetic features
• Linguistic features
• Semantic word vectors
targets for
segment
intensity
Long-form example
“Over a lunch of diet cokes and lobster salad one
balmy fall day in Boston, Joseph Martin, the
genial, white-haired, former dean of Harvard
medical school, told me how many hours of pain
education Harvard med students get during four
years of medical school.”
Before After