Upload
olivia-klose
View
341
Download
0
Embed Size (px)
Citation preview
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Olivia Klose, Software Development Engineer, Microsoft
Dr. Marcel Tilly, Program Manager, Microsoft
https://www.technologyreview.com/lists/technologies/2013/
Deep Neural Networks
… is inspired by the neural network in the brain
# of Neurons in the brains (~100 billion)
= # of Trees in the Amazon Rainforest (~ 300 billion)
# of Synapses (~ 100 - 1000 trillion)
= # of Leaves in the Amazon Rainforest
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
Improving
domain
knowledge
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
stuck
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
Deep learning
+ Big Data
+ scalable
tools
http://arxiv.org/abs/1609.03528
http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition
Speech Recognition Breakthrough for the Spoken, Translated Word
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
Software “robots”
Separate and manage
audio streams
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
• Machine Learning
• Deep Neural Network
• New language = new training
this is
hum pig
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
this is
hum pig
• Punctuation
• Capitalization
• Disfluency removal
• Lattice Rescoring
this is
hum pig.
This is
hum pig.
This is
pig.This is
big.
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
this is
hum pig
this is
hum pig.
This is
hum pig.
This is
pig.This is
big.
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.This is
big.
• Microsoft Translator core API
• Statistical Machine Translation
• 45 supported languages
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
Microsoft Translator TTS API
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.This is
big.
Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.This is
big.
front view top viewside viewinput depth inferred body parts
(no tracking or smoothing)
https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/
https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/
bicycleroad
building
road
cat
road
building
cargrass
watercow
https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/
28,2
25,8
16,4
11,7
7,3 6,75,1
3.5
ILSVRC 2010
NEC America
ILSVRC 2011
Xerox
ILSVRC 2012
AlexNet
ILSVRC 2013
Clarifi
ILSVRC 2014
VGG
ILSVRC 2014
GoogleNet
Human
Performance
ILSVRC 2015
ResNet
ImageNet Classification top-5 error (%)
Microsoft researchers win ImageNet computer vision challenge
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet,
8 layers
(ILSVRC
2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19
layers
(ILSVRC
2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22
layers
(ILSVRC 2014)
ResNet, 152 layers
(ILSVRC 2015)
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
Open-source, cross-platform toolkit for learning and evaluating deep neural networks.
Expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks
Production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server. http://cntk.ai
B1=Parameter(HDim)
W1=Parameter(HDim, SDim)
X=Input(SDim)
labels=Input(LDim)
T1=Times(W1, X)
P1=Plus(T1, B1)
S1=Sigmoid(P1)
B2=Parameter(LDim, 1)
W2=Parameter(LDim, HDim)
T2=Times(W2, S1)
P2=Plus(T2, B1)
CrossEntropy=CrossEntropyWithSoftmax(labels, P2)
ErrPredict=ErrorPrediction(labels, P2)
FeatureNodes=(X)
LabelNodes=(labels)
CriteriaNodes=(CrossEntropy)
EvalNodes=(ErrPredict)
OutputNodes=(P2)
VisionComputer Vision | Emotion | Face | Video
SpeechComputer Recognition | Speaker Recognition
Speech | Translator
LanguageBing Spell Check | Language Understanding
Linguistic Analysis | Text Analytics | Web Language Model
KnowledgeAcademic Knowledge | Entity Linking
Knowledge Exploration | Recommendations
SearchBing Auto Suggest | Bing Image Search | Bing News Search
Bing Video Search | Bing Web Search
Cognitive
Services
Give your solutions
a human side
http://microsoft.com/cognitive
Computer Vision API
Content of Image:
Categories v0: [{ “name”: “animal”, “score”: 0.9765625 }]
V1: [{ "name": "grass", "confidence": 0.9999992847442627 },
{ "name": "outdoor", "confidence": 0.9999072551727295 },
{ "name": "cow", "confidence": 0.99954754114151 },
{ "name": "field", "confidence": 0.9976195693016052 },
{ "name": "brown", "confidence": 0.988935649394989 },
{ "name": "animal", "confidence": 0.97904372215271 },
{ "name": "standing", "confidence": 0.9632768630981445 },
{ "name": "mammal", "confidence": 0.9366017580032349, "hint": "animal" },
{ "name": "wire", "confidence": 0.8946959376335144 },
{ "name": "green", "confidence": 0.8844101428985596 },
{ "name": "pasture", "confidence": 0.8332059383392334 },
{ "name": "bovine", "confidence": 0.5618471503257751, "hint": "animal" },
{ "name": "grassy", "confidence": 0.48627158999443054 },
{ "name": "lush", "confidence": 0.1874018907546997 },
{ "name": "staring", "confidence": 0.165890634059906 }]
Describe0.975 "a brown cow standing on top of a lush green field“
0.974 “a cow standing on top of a lush green field”
0.965 “a large brown cow standing on top of a lush green field”