Becoming a kinect hacker innovator v2

Preview:

DESCRIPTION

The slides accompanying the "Becoming a Kinect Hacker^h Innovator" talk from SAPO Codebits 2011. https://codebits.eu/intra/s/session/223

Citation preview

Becoming a Kinect Hacker Innovator

Who We Are

Jeff Sipko– SDE on KinectShare / KinectHack– Part of Good Science, working on Fun Labs

KinectShare

Overview

HistoryCapabilitiesCodeDemo

Atari 2600Best video game system…ever…

Started as a $30,000 prototype

Vision: Shift the world from thinking“We need to understand technology” to "Technology needs to understand us"

Option A:

Why Kinect?

Why Kinect?

Option You:

The Challenge

• Find the people in the scene, ignore background• Find their limbs and joints, which person is which• Find and track their gestures• Map the gestures to meaning and commands

• Also: recognize faces• Also: recognize voices and commands• Also: reduce hardware costs to consumer levels

P.S.

And

pla

y th

e ga

me!

“What are those things?”

Multi-array Microphone

RGB Camera

IR Camera

IR Camera: Depth Computation

IR Camera: Depth Map

IR Camera: Provided DataDepth and segmentation map

Skeletal - Provided Data

Vision Algorithm (Summary) Quickly and accurately predict 3D positions of body joints From a single depth image, using no temporal information

Object recognition approach Intermediate body parts representation that maps the difficult pose

estimation problem into a simpler per-pixel classification problem

Large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc.

Generate confidence-scored 3D proposals of several body joints by re-projecting the classification result and finding local modes

System runs at 200 frames per second on consumer hardware Evaluation shows high accuracy on both synthetic and real test sets State of the art accuracy in comparison with related work and

improved generalization over exact whole-skeleton nearest neighbor matching

In Practice• Collect training data – thousands of visits to global households,

filming real users, the Hollywood motion capture studio generated billions of images

• Apply state-of-the-art object recognition research• Apply state-of-the-art real-time semantic segmentation

• Build a training set – classify each pixel’s probability of being in any of 32 body segments, determine probabilistic cluster of body configurations consistent with those, present the most probable

• Millions of training images Millions of classifier parameters• Hard to parallelize New algorithm for distributed decision-tree

training• Fun Fact: Major use of DryadLINQ (large-scale distributed cluster

computing)

Motorized Tilt

±28° up / down

The Audio System

Input Stream(What the mic array hears)

Post-MEC(What APIs present)

MEC

Demo: Multichannel Echo Cancellation

Beam Forming / Source Localization

Automatically points to loudest sound source

Manually steer the direction of the listening beam in 10° increments

Speech Recognition

Acoustic model Language Countries/Regionsde-DE German Germanyen-AU English Australia, New Zealanden-GB English Ireland, United Kingdomen-US English Canada, United Stateses-ES Spanish Spaines-MX Spanish Mexicofr-CA French Canadafr-FR French Franceit-IT Italian Italyja-JP Japanese Japan

Kinect Speech Recognition Languages (as of June 2011 SDK)

Sample Grammar(from Simple Speech Recognition sample)

<?xml version="1.0" encoding="utf-8"?><grammar xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0-literals" sapi:alphabet="x-microsoft-ups" xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="en-US" root="rootrule" version="1.0">

<rule id="rootrule" scope="public"> <one-of> <item>view inventory <tag>view inventory</tag> </item> <item>show quests <tag>show quests</tag> </item> <item>pause game <tag>pause game</tag> </item> <item>open <token sapi:pron="S P EH L B UH K">spellbook</token> <tag>open spellbook</tag> </item>[…]

</one-of></rule> </grammar>

Custom pronunciation

Property tag

Sample Grammar: Localized(from Simple Speech Recognition sample)

<?xml version="1.0" encoding="utf-16"?><grammar xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0-literals" sapi:alphabet="x-microsoft-ups" xmlns:sapi="http://schemas.microsoft.com/Speech/2002/06/SRGSExtensions" xml:lang="es-ES" root="rootrule" version="1.0">

<rule id="rootrule" scope="public"> <one-of> <item>ver inventario <tag>view inventory</tag> </item> <item>muestra aventuras <tag>show quests</tag> </item> <item>pausa el juego <tag>pause game</tag> </item> <item>abre libros de hechizos <tag>open spellbook</tag> </item>

[…]

</one-of></rule> </grammar>

Property tag

What does this allow? Watch, track, and render people’s motion Recognize faces and facial expressions Recognize voices, words, and tone What else? What is coming?

Background Removal Seated Skeletal Tracking Finger Tracking Head Tracking Digital Object Creation ???? = You

EmotionallyAware

The Kinect SDK

Provides both Unmanaged and Managed API Unmanaged API – Concepts work in C++ Managed API – Concepts work in both VB/C#

Samples & documentation to get you started

Assumes some programming experience

http://kinectforwindows.org

What’s in the SDK? Raw sensor streams

Access to raw data streams from the depth sensor, color camera sensor, and four-element microphone array enables developers to build upon the low-level streams that are generated by the Kinect sensor.

Skeletal tracking The capability to track the skeleton image of one or two people moving within the

Kinect field of view make it easy to create gesture-driven applications. Advanced audio capabilities

Audio processing capabilities include sophisticated acoustic noise suppression and echo cancellation, beam formation to identify the current sound source, and integration with the Windows speech recognition API.

Sample code and documentation The SDK includes more than 100 pages of technical documentation. In addition to

built-in help files, the documentation includes detailed walkthroughs for most samples provided with the SDK.

Easy installation The SDK installs quickly, requires no complex configuration, and the complete installer

size is less than 100 MB. Developers can get up and running in just a few minutes with a standard standalone Kinect sensor unit (widely available at retail outlets).

Designed for non-commercial purposes; a commercial version is expected later.

Windows 7 – C++, C#, or Visual Basic in Microsoft Visual Studio 2010.

Windows SDK: Architecture

The Tools

Visual Studio 2010 (Express or other) .NET Framework 4.0 Kinect for Windows SDK Microsoft Speech Framework Coding4Fun Kinect Toolkit XNA Game Studio

Let’s Get Started!

… Literally!

Runtime nui = new Runtime();Runtime.Initialize();Andnui.VideoStream.Open(…);nui.DepthStream.Open(…);

… and then later:Runtime.Shutdown();

Gimme Data Polling Method

nui.DepthStream.GetNextFrame(timeout) nui.VideoStream.GetNextFrame(timeout) nui.SkeletonEngine.GetNextFrame(timeout)

Event Method nui.DepthFrameReady += new

EventHandler<ImageFrameReadyEventArgs>(fn) nui.SkeletonFrameReady += new

EventHandler<SkeletonFrameReadyEventArgs>(fn) nui.VideoFrameReady += new

EventHandler<ImageFrameReadyEventArgs>(fn)

What you get

Color Image:Byte array in B8G8R8 format by defaultAlso supports YUV

Depth Image:13 high-order bits contain the distance in

mm3 low-order bits contain the player index

Only valid values are 0, 1, 2

Speech Start the Recognition Engine

var sre = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));

Create a grammarChoices colors = new Choices(); colors.Add(new string[] {"red"}); colors.Add(new string[] {"green"}); colors.Add(new string[] {"blue"}); GrammarBuilder gb = new GrammarBuilder();gb.Append(colors); // Create the Grammar instance.Grammar g = new Grammar(gb);sre.LoadGrammar(g);

Speech – Grammar From File FileStream fs = new FileStream(grammarPath +

”foobar.cfg", FileMode.Create); SrgsGrammarCompiler.Compile(grammarPath +

”foobar.grxml", (Stream)fs); fs.Close(); Grammar g = new Grammar(grammarPath +

”foobar.cfg", "rootrule");

Speech – cont

Register for speech eventssre.SpeechRecognized += new

EventHandler<SpeechRecognizedEventArgs>(sre_SpeechRecognized);

and

void sre_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { MessageBox.Show(e.Result.Text); }

Show me the demo!

Kinect Fun Labs…

+ +

Q & A

Bueller?…Bueller?

Thank You! Obrigado!

Forumshttp://kinectforwindows.org/resources

Email me jeff.sipko@microsoft.com

Recommended