Nikko Ström at AI Frontiers: Deep Learning in Alexa

Deep Learning in AlexaAI Frontiers

Santa Clara, CA

January 11, 2017

Nikko Strom, Sr. Principal Scientist, Alexa Machine Learning

Alexa’s growing family

Alexa in the wild

Alexa’s friends

68⁰

Alexa’s skills

Deep Learning at scale

Longer-form talk at AWS re:Invent 2016

https://www.youtube.com/watch?v=TYRckcVm4WE

Deep Learning in Alexa

(MAC202)

Speech data

= 140,160 hours16 years

≈14,016 hours of speech

Large-scale distributed training

Up to 80 EC2 g2.2xlarge GPU

instances working in sync to train

a model

Thousands of

hours of speech

training data stored

in S3

Large-scale distributed training

All nodes must communicate

updates to the model to all

other nodes.

GPUs compute model

updates fast – Think updates

per second

A model update is hundreds

of MB

0

100,000

200,000

300,000

400,000

500,000

600,000

0 20 40 60 80

Fra

mes p

er

second

Number of GPU workers

DNN training speed

Strom, Nikko. "Scalable Distributed DNN Training using Commodity GPU Cloud Computing." INTERSPEECH. Vol. 7. 2015.

Speech Recognition

Signal

processingAcoustic model

Decoder

(inference)

Post

processing

Feature

vectors

[4.7, 2.3, -1.4, …]

Phonetic

probabilities

[0.1, 0.1, 0.4, …]

Words

increase to 70 degrees

Text

Increase to 70⁰

Sound

Speech recognition

Transfer learning from English to German

Hidden layer 1

Hidden layer 2

Last hidden layer

æI ɑɜ ʊ … eæI ɑɜ u: … œ

Output layer

The cocktail party problemAlexa!

Blah

Blah

Blah

Blah

Blah

Blah

The cocktail party problem… play

some

jazz!

…blah,

blah,

blah,

blah…

…blah,

blah,

blah,

blah…

…blah,

blah,

blah,

blah…

…blah,

blah.

…blah,

blah,

blah,

blah…

…blah,

blah,

blah,

blah…

Anchored speech detection

Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016.

Alexa, play some jazz!

Wake word Request

“Anchor” Speech consistent with anchor

Encoder Decoder

Anchored speech detection

Roland Maas, Sree Hari Krishnan Parthasarathi, Brian King, Ruitong Huang, Björn Hoffmeister. “Anchored Speech Detection.” INTERSPEECH. 2016.

Alexa, play some jazz!

t

LSTM

Encoder

speech features

from wake wordspeech features from request

endpoint decision

anchor embedding

LSTM

Decoder

Speech

synthesis

Speech synthesis

Text

Text normalization

Grapheme-to-phoneme conversion

Waveform generation

Speech

She has 20$ in her pocket.

she has twenty dollars in her pocket

ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Concatenative synthesis

Di-phone

segment

database

Di-phone unit selectionSpeechInput

ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Prosody for natural sounding reading

Bi-directional recurrent network

pitch duration

• Phonetic features

• Linguistic features

• Semantic word vectors

targets for

segment

intensity

Long-form example

“Over a lunch of diet cokes and lobster salad one

balmy fall day in Boston, Joseph Martin, the

genial, white-haired, former dean of Harvard

medical school, told me how many hours of pain

education Harvard med students get during four

years of medical school.”

Before After

Thank you!

Technology

Nikko Ström at AI Frontiers: Deep Learning in Alexa