Acquiring 3D Indoor Environments with Variability and Repetition

  • View

  • Download

Embed Size (px)


Acquiring 3D Indoor Environments with Variability and Repetition. Young Min Kim Stanford University. Niloy J. Mitra UCL / KAUST. Dong-Ming Yan KAUST. Leonidas Guibas Stanford University. Data Acquisition via Microsoft Kinect. Our tool: Microsoft Kinect. Real-time - PowerPoint PPT Presentation

Text of Acquiring 3D Indoor Environments with Variability and Repetition

Acquiring 3D Indoor Environments with Variability and Repetition

Acquiring 3D Indoor Environments with Variability and RepetitionYoung Min Kim Stanford UniversityNiloy J. MitraUCL/ KAUSTDong-Ming YanKAUSTLeonidas GuibasStanford University

1Thank you for the introduction. My name is Young Min Kim and this is a great honor for me to be here. I am going to talk about acquiring 3D indoor environments with variability and repetition. This is a joint work with Niloy Mitra, Dong-ming yan, and Leonidas Guibas.1

Data Acquisition via Microsoft KinectRaw data:Noisy point cloudsUnsegmentedOcclusion issues

Our tool: Microsoft KinectReal-timeProvides depth and color Small and inexpensive

2Our work uses the data collected from Microsoft Kinect. It is a revolutionary device, which provides real-time depth and color information in affordable price.

However, the raw data collected from the device is noisy, unsegmented, and suffer from occlusion.2Dealing with Pointcloud DataObject-level reconstruction

Scene-level reconstruction

[Chang and Zwicker 2011]

[Xiao et. al. 2012]

3We use the data from Microsoft Kinect, and reconstruct the indoor environment.There have been a lot of work in the community dealing with reconstruction from pointcloud data, and only a couple works are presented here.In object level, you can reconstruct detailed models with joints.In scene level, you can reconstruct facets of buildings by focusing on flat walls.We propose a system that combines object-level and scene-level reconstruction.

3Mapping Indoor EnvironmentsMapping outdoor environmentsRoads to drive vehiclesFlat surfacesGeneral indoor environments contain both objects and flat surfacesDiversity of objects of interestObjects are often clutteredObjects deform and move

Solution: Utilize semantic information4The scene level reconstruction is similar to outdoor mapping technology. You have roads to drive vehicles. And the reconstruction is largely based on flat surfaces.

However, we would like map general indoor environments where objects exist in addition to flat surfaces.There are multiple objects at the same time, and they present in cluttered setting.In addition, the objects will deform and move around as people interact with the objects every day.

This is very challenging situation. Instead of trying to reconstruct the environment as accurate as possible, we are proposing a light-weight system that utilizes the semantic information.4Nature of Indoor EnvironmentsMan-made objects can often be well-approximated by simple building blocksGeometric primitivesLow DOF joints

Many repeating elements Chairs, desks, tables, etc.

Relations between objects give good recognition cues

5To induce the semantic information to utilize, we made a few observations. One is that most of man-made objects can be explained with simple geometry primitives and joints with low-degree of freedom. Secondly, indoor environment is often comprised of repeating elements, with same chairs, and same desks etc. Thirdly, the relationship of objects with the basic elements of environment gives a powerful prior to locate the objects from the noisy and sparse data.

Our approach build the model of the repeating elements, and use them to quickly locate the objects of interest.

5Indoor Scene Understanding with Pointcloud DataPatch-based approach

Object-level understanding

[Silberman et. al. 2012]

[Koppula et. al. 2011]

[Shao et. al. 2012]

[Nan et. al. 2012]6There were a few other recent tries for indoor scene understanding with point cloud data.The first group of the approaches uses patches to train graphical models to either find object labels or support relationships.The second group aims for object-level understanding with help of 3-D models to reason in object-level, which were presented earlier in this session.6Comparisons[1] An Interactive Approach to Semantic Modeling of Indoor Scenes with an RGBD Camera[2] A Search-Classify Approach for Cluttered Indoor Scene Understanding[1][2]oursPrior model3D database3D databaseLearnedDeformationScalingPart-based scalingLearnedMatchingClassifierClassifierGeometricSegmentationUser-assistedIterationIterationDataMicrosoft KinectMantis VisionMicrosoft Kinect7Since we are presenting very similar papers during this session, especially the previous two papers, here is my understanding of the three different approaches. Our learning stage build a proxy model of the specified environment instead of using 3D database, which are the cases for the previous two papers. While the learned models are limited for the specified environment, we have the advantage of knowing the exact dimensions and the deformation parameters for the models. The other papers only cover scaling..By knowing the correct dimensions of the parts, we can use purely geometric approach instead of machine-learning based approach and quickly recognize the objects, and automatically refine segmentation even with noisy data.7Contributions Novel approach based on learning stageLearning stage builds the model that is specific to the environmentBuild an abstract model composed of simple parts and relationship between partsUniquely explain possible low DOF deformationRecognition stage can quickly acquire large-scale environmentsAbout 200ms per object

8In short, we present a novel approach based on the learning stage, that builds the model that is specific to the environment.During the learning stage, we build an abstract model composed of simple parts, and relationship between them. By using the relationship, we can keep the model simple, but at the same time, we can uniquely explain the low degree of freedom deformations.After the model is built, the recognition stage can quickly acquire large-scale environment, at the speed of around 200 ms per object.


ApproachLearning: Build a high-level model of the repeating elementsRecognition: Use the model and relationship to recognize the objectstranslationalrotational9Now lets take a look at the approach in more detail.We take an approach comprised of two stages. First is the learning stage, where we build a high-level model of the repeating elements. Then during the recognition stage, we use the model and relationship to recognize objects.

(Intuitively, we are utilizing the prior based on primitives and joint information to overcome the limitation of the low-quality data.)9ApproachLearningBuild a high-level model of the repeating elements

10Lets start with the learning stage. The input is registered point sets of the object of interest, and we process them to build a model. The models are simple and light-weighted, and we will show that it is useful abstraction to robustly deal with noisy data.

10Output Model: Simple, Light-Weighted AbstractionPrimitivesObservable facesConnectivityRigidRotationalTranslationalAttachmentRelationshipPlacement information


11The output model is composed of primitives and connectivity between them. For the connectivity information we choose from rigid, rotational, translational and attachment. We also incorporate possible placement information, such as the repeating direction and contact information.11

Joint Matching and FittingIndividual segmentationGroup by similar normalsInitial matchingFocus on large partsUse size, height, relative positionsKeep consistent matchJoint primitive fittingAdd joints if necessaryIncrementally complete the model

12During the learning stage, we jointly match parts and fit primitives in multiple instances of the same object. We first roughly segment the individual inputs by grouping the neighbors with similar normals.

We take a look at large parts and match them across multiple measurements using the size, height and relative positions of the parts. We only keep matches that are consistent across multiple measurements.

Then we fit primitives jointly for the matched parts. When we jointly fit the primitives, we also consider relative size or position and introduce 1 DOF translation or 1 DOF rotation as deformation parameters. We incrementally add remaining primitives to the model until it is complete.

12ApproachLearningBuild a high-level model of the repeating elements

13After the learning stage, we have a high-level model of repeating elements.13

ApproachLearningBuild a high-level model of the repeating elementsRecognitionUse the model and relationship to recognize the objects

14During the recognition stage, we recognize the modeled objects in the scene. The process is challenging due to noise of the measurements, deformation of objects, and ambiguous segmentation, but we could successfully have a high-level understanding of the environment.

14HierarchyGround plane and deskObjectsIsolated clustersPartsGroup by normals

The segmentation is approximate and to be corrected later

15The input to the recognition stage is collected from the scene where the modeled objects reside. We first extract the groundplane, which is the most dominant plane, and the desktop which is the second dominant plane parallel to the ground.

The remaining points are clustered as candidate objects of o1, o2 etc.

Individual candidate objects are segmented further by grouping points with similar normals, which are shown as ps.

Note that the segmentation of different hierarchies is approximate, and should be corrected later.15Bottom-Up ApproachInitial assignment for parts vs. primitivesSimple comparison of height, normal, sizeRobust to deformationLow false-negatives

Refined assignment for objects vs. modelsIteratively solve for position, deformation and segmentationLow false-positives

parts16To quickly locate the