Upload
jeff-sipko
View
2.153
Download
3
Tags:
Embed Size (px)
DESCRIPTION
The slides accompanying the "Becoming a Kinect Hacker^h Innovator" talk from SAPO Codebits 2011. https://codebits.eu/intra/s/session/223
Citation preview
Becoming a Kinect Hacker Innovator
Who We Are
Jeff Sipko– SDE on KinectShare / KinectHack– Part of Good Science, working on Fun Labs
KinectShare
Overview
HistoryCapabilitiesCodeDemo
Atari 2600Best video game system…ever…
Started as a $30,000 prototype
Vision: Shift the world from thinking“We need to understand technology” to "Technology needs to understand us"
Option A:
Why Kinect?
Why Kinect?
Option You:
The Challenge
• Find the people in the scene, ignore background• Find their limbs and joints, which person is which• Find and track their gestures• Map the gestures to meaning and commands
• Also: recognize faces• Also: recognize voices and commands• Also: reduce hardware costs to consumer levels
P.S.
And
pla
y th
e ga
me!
“What are those things?”
Multi-array Microphone
RGB Camera
IR Camera
IR Camera: Depth Computation
IR Camera: Depth Map
IR Camera: Provided DataDepth and segmentation map
Skeletal - Provided Data
Vision Algorithm (Summary) Quickly and accurately predict 3D positions of body joints From a single depth image, using no temporal information
Object recognition approach Intermediate body parts representation that maps the difficult pose
estimation problem into a simpler per-pixel classification problem
Large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc.
Generate confidence-scored 3D proposals of several body joints by re-projecting the classification result and finding local modes
System runs at 200 frames per second on consumer hardware Evaluation shows high accuracy on both synthetic and real test sets State of the art accuracy in comparison with related work and
improved generalization over exact whole-skeleton nearest neighbor matching
In Practice• Collect training data – thousands of visits to global households,
filming real users, the Hollywood motion capture studio generated billions of images
• Apply state-of-the-art object recognition research• Apply state-of-the-art real-time semantic segmentation
• Build a training set – classify each pixel’s probability of being in any of 32 body segments, determine probabilistic cluster of body configurations consistent with those, present the most probable
• Millions of training images Millions of classifier parameters• Hard to parallelize New algorithm for distributed decision-tree
training• Fun Fact: Major use of DryadLINQ (large-scale distributed cluster
computing)
Motorized Tilt
±28° up / down
The Audio System
Input Stream(What the mic array hears)
Post-MEC(What APIs present)
MEC
Demo: Multichannel Echo Cancellation
Beam Forming / Source Localization
Automatically points to loudest sound source
Manually steer the direction of the listening beam in 10° increments
Speech Recognition
Acoustic model Language Countries/Regionsde-DE German Germanyen-AU English Australia, New Zealanden-GB English Ireland, United Kingdomen-US English Canada, United Stateses-ES Spanish Spaines-MX Spanish Mexicofr-CA French Canadafr-FR French Franceit-IT Italian Italyja-JP Japanese Japan
Kinect Speech Recognition Languages (as of June 2011 SDK)
Sample Grammar(from Simple Speech Recognition sample)
<?xml version="1.0" encoding="utf-8"?><grammar xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0-literals" sapi:alphabet="x-microsoft-ups" xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="en-US" root="rootrule" version="1.0">
<rule id="rootrule" scope="public"> <one-of> <item>view inventory <tag>view inventory</tag> </item> <item>show quests <tag>show quests</tag> </item> <item>pause game <tag>pause game</tag> </item> <item>open <token sapi:pron="S P EH L B UH K">spellbook</token> <tag>open spellbook</tag> </item>[…]
</one-of></rule> </grammar>
Custom pronunciation
Property tag
Sample Grammar: Localized(from Simple Speech Recognition sample)
<?xml version="1.0" encoding="utf-16"?><grammar xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0-literals" sapi:alphabet="x-microsoft-ups" xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="es-ES" root="rootrule" version="1.0">
<rule id="rootrule" scope="public"> <one-of> <item>ver inventario <tag>view inventory</tag> </item> <item>muestra aventuras <tag>show quests</tag> </item> <item>pausa el juego <tag>pause game</tag> </item> <item>abre libros de hechizos <tag>open spellbook</tag> </item>
[…]
</one-of></rule> </grammar>
Property tag
What does this allow? Watch, track, and render people’s motion Recognize faces and facial expressions Recognize voices, words, and tone What else? What is coming?
Background Removal Seated Skeletal Tracking Finger Tracking Head Tracking Digital Object Creation ???? = You
EmotionallyAware
The Kinect SDK
Provides both Unmanaged and Managed API Unmanaged API – Concepts work in C++ Managed API – Concepts work in both VB/C#
Samples & documentation to get you started
Assumes some programming experience
http://kinectforwindows.org
What’s in the SDK? Raw sensor streams
Access to raw data streams from the depth sensor, color camera sensor, and four-element microphone array enables developers to build upon the low-level streams that are generated by the Kinect sensor.
Skeletal tracking The capability to track the skeleton image of one or two people moving within the
Kinect field of view make it easy to create gesture-driven applications. Advanced audio capabilities
Audio processing capabilities include sophisticated acoustic noise suppression and echo cancellation, beam formation to identify the current sound source, and integration with the Windows speech recognition API.
Sample code and documentation The SDK includes more than 100 pages of technical documentation. In addition to
built-in help files, the documentation includes detailed walkthroughs for most samples provided with the SDK.
Easy installation The SDK installs quickly, requires no complex configuration, and the complete installer
size is less than 100 MB. Developers can get up and running in just a few minutes with a standard standalone Kinect sensor unit (widely available at retail outlets).
Designed for non-commercial purposes; a commercial version is expected later.
Windows 7 – C++, C#, or Visual Basic in Microsoft Visual Studio 2010.
Windows SDK: Architecture
The Tools
Visual Studio 2010 (Express or other) .NET Framework 4.0 Kinect for Windows SDK Microsoft Speech Framework Coding4Fun Kinect Toolkit XNA Game Studio
Let’s Get Started!
… Literally!
Runtime nui = new Runtime();Runtime.Initialize();Andnui.VideoStream.Open(…);nui.DepthStream.Open(…);
… and then later:Runtime.Shutdown();
Gimme Data Polling Method
nui.DepthStream.GetNextFrame(timeout) nui.VideoStream.GetNextFrame(timeout) nui.SkeletonEngine.GetNextFrame(timeout)
Event Method nui.DepthFrameReady += new
EventHandler<ImageFrameReadyEventArgs>(fn) nui.SkeletonFrameReady += new
EventHandler<SkeletonFrameReadyEventArgs>(fn) nui.VideoFrameReady += new
EventHandler<ImageFrameReadyEventArgs>(fn)
What you get
Color Image:Byte array in B8G8R8 format by defaultAlso supports YUV
Depth Image:13 high-order bits contain the distance in
mm3 low-order bits contain the player index
Only valid values are 0, 1, 2
Speech Start the Recognition Engine
var sre = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));
Create a grammarChoices colors = new Choices(); colors.Add(new string[] {"red"}); colors.Add(new string[] {"green"}); colors.Add(new string[] {"blue"}); GrammarBuilder gb = new GrammarBuilder();gb.Append(colors); // Create the Grammar instance.Grammar g = new Grammar(gb);sre.LoadGrammar(g);
Speech – Grammar From File FileStream fs = new FileStream(grammarPath +
”foobar.cfg", FileMode.Create); SrgsGrammarCompiler.Compile(grammarPath +
”foobar.grxml", (Stream)fs); fs.Close(); Grammar g = new Grammar(grammarPath +
”foobar.cfg", "rootrule");
Speech – cont
Register for speech eventssre.SpeechRecognized += new
EventHandler<SpeechRecognizedEventArgs>(sre_SpeechRecognized);
and
void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { MessageBox.Show(e.Result.Text); }
Show me the demo!
Kinect Fun Labs…
+ +
Q & A
Bueller?…Bueller?