9
A Lightweight Approach for On-the-Fly Reflectance Estimation Kihwan Kim 1 Jinwei Gu 1 Stephen Tyree 1 Pavlo Molchanov 1 Matthias Nießner 2 Jan Kautz 1 1 NVIDIA 2 Technical University of Munich http://research.nvidia.com/publication/reflectance-estimation-fly Abstract Estimating surface reflectance (BRDF) is one key com- ponent for complete 3D scene capture, with wide appli- cations in virtual reality, augmented reality, and human computer interaction. Prior work is either limited to con- trolled environments (e.g., gonioreflectometers, light stages, or multi-camera domes), or requires the joint optimization of shape, illumination, and reflectance, which is often com- putationally too expensive (e.g., hours of running time) for real-time applications. Moreover, most prior work requires HDR images as input which further complicates the cap- ture process. In this paper, we propose a lightweight ap- proach for surface reflectance estimation directly from 8- bit RGB images in real-time, which can be easily plugged into any 3D scanning-and-fusion system with a commod- ity RGBD sensor. Our method is learning-based, with an inference time of less than 90ms per scene and a model size of less than 340K bytes. We propose two novel net- work architectures, HemiCNN and Grouplet, to deal with the unstructured input data from multiple viewpoints under unknown illumination. We further design a loss function to resolve the color-constancy and scale ambiguity. In addi- tion, we have created a large synthetic dataset, SynBRDF, which comprises a total of 500K RGBD images rendered with a physically-based ray tracer under a variety of natu- ral illumination, covering 5000 materials and 5000 shapes. SynBRDF is the first large-scale benchmark dataset for re- flectance estimation. Experiments on both synthetic data and real data show that the proposed method effectively re- covers surface reflectance, and outperforms prior work for reflectance estimation in uncontrolled environments. 1. Introduction Capturing scene properties in the wild, including its 3D geometry and surface reflectance, is one of the ultimate goals of computer vision, with wide applications in vir- tual reality, augmented reality, and human computer inter- action. While 3D geometry recovery has achieved high ac- curacy, especially with recent RGBD-based scanning-and- fusion approaches [7, 22, 33], surface reflectance estima- (a) (b) (c) Figure 1. Overview of our method: we take RGBD image se- quences as inputs (a). During a reconstruction process, each view contributes voxels as an observation. In (b), we visualize the color of visible voxel samples from a specific view (red circle among red dots indicating locations of other views). These samples from all views are evaluated through either HemiCNN (Sec. 3.3.1) or Grouplet networks (Sec. 3.3.2) to estimate the BRDF in real time. In (c), we show a rendered image with predicted BRDF and cap- tured lighting and shape. More examples are shown in Sec. 4. tion still remains challenging. At one extreme, most of the fusion methods assume Lambertian reflectance and re- cover surface texture only. At the other extreme, most of the prior work on surface BRDF (bidirectional reflectance dis- tribution function) estimation [14, 20, 19] aims to recover the full 4D BRDF function, but are often limited to con- trolled, studio-like environments (e.g., gonioreflectometers, light stages, multi-camera domes, planar samples). Recently, a few methods [35, 18, 27, 26, 30, 17, 1] were proposed to recover surface reflectance in uncontrolled en- vironments (e.g., unknown illumination or shape) by utiliz- ing statistical priors on natural illumination and/or BRDF. These methods formulate the inverse rendering problem as a joint, alternative optimization among shape, reflectance, and/or lighting. Despite their accuracy, these methods are computationally quite expensive (e.g., hours or days of running time and tens of gigabytes of memory consump- tion) and are often run in a post-process rather than a real- time setting. Moreover, these methods often require high- resolution HDR images as input, which further complicates the capturing process for real-time or interactive applica- tions on consumer-grade mobile devices. 20

A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

A Lightweight Approach for On-the-Fly Reflectance Estimation

Kihwan Kim1 Jinwei Gu1 Stephen Tyree1 Pavlo Molchanov1 Matthias Nießner2 Jan Kautz1

1NVIDIA 2Technical University of Munich

http://research.nvidia.com/publication/reflectance-estimation-fly

Abstract

Estimating surface reflectance (BRDF) is one key com-

ponent for complete 3D scene capture, with wide appli-

cations in virtual reality, augmented reality, and human

computer interaction. Prior work is either limited to con-

trolled environments (e.g., gonioreflectometers, light stages,

or multi-camera domes), or requires the joint optimization

of shape, illumination, and reflectance, which is often com-

putationally too expensive (e.g., hours of running time) for

real-time applications. Moreover, most prior work requires

HDR images as input which further complicates the cap-

ture process. In this paper, we propose a lightweight ap-

proach for surface reflectance estimation directly from 8-

bit RGB images in real-time, which can be easily plugged

into any 3D scanning-and-fusion system with a commod-

ity RGBD sensor. Our method is learning-based, with an

inference time of less than 90ms per scene and a model

size of less than 340K bytes. We propose two novel net-

work architectures, HemiCNN and Grouplet, to deal with

the unstructured input data from multiple viewpoints under

unknown illumination. We further design a loss function to

resolve the color-constancy and scale ambiguity. In addi-

tion, we have created a large synthetic dataset, SynBRDF,

which comprises a total of 500K RGBD images rendered

with a physically-based ray tracer under a variety of natu-

ral illumination, covering 5000 materials and 5000 shapes.

SynBRDF is the first large-scale benchmark dataset for re-

flectance estimation. Experiments on both synthetic data

and real data show that the proposed method effectively re-

covers surface reflectance, and outperforms prior work for

reflectance estimation in uncontrolled environments.

1. Introduction

Capturing scene properties in the wild, including its 3D

geometry and surface reflectance, is one of the ultimate

goals of computer vision, with wide applications in vir-

tual reality, augmented reality, and human computer inter-

action. While 3D geometry recovery has achieved high ac-

curacy, especially with recent RGBD-based scanning-and-

fusion approaches [7, 22, 33], surface reflectance estima-

(a) (b) (c)

Figure 1. Overview of our method: we take RGBD image se-

quences as inputs (a). During a reconstruction process, each view

contributes voxels as an observation. In (b), we visualize the color

of visible voxel samples from a specific view (red circle among

red dots indicating locations of other views). These samples from

all views are evaluated through either HemiCNN (Sec. 3.3.1) or

Grouplet networks (Sec. 3.3.2) to estimate the BRDF in real time.

In (c), we show a rendered image with predicted BRDF and cap-

tured lighting and shape. More examples are shown in Sec. 4.

tion still remains challenging. At one extreme, most of

the fusion methods assume Lambertian reflectance and re-

cover surface texture only. At the other extreme, most of the

prior work on surface BRDF (bidirectional reflectance dis-

tribution function) estimation [14, 20, 19] aims to recover

the full 4D BRDF function, but are often limited to con-

trolled, studio-like environments (e.g., gonioreflectometers,

light stages, multi-camera domes, planar samples).

Recently, a few methods [35, 18, 27, 26, 30, 17, 1] were

proposed to recover surface reflectance in uncontrolled en-

vironments (e.g., unknown illumination or shape) by utiliz-

ing statistical priors on natural illumination and/or BRDF.

These methods formulate the inverse rendering problem as

a joint, alternative optimization among shape, reflectance,

and/or lighting. Despite their accuracy, these methods are

computationally quite expensive (e.g., hours or days of

running time and tens of gigabytes of memory consump-

tion) and are often run in a post-process rather than a real-

time setting. Moreover, these methods often require high-

resolution HDR images as input, which further complicates

the capturing process for real-time or interactive applica-

tions on consumer-grade mobile devices.

1 20

Page 2: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

In this paper, we propose a lightweight and practical

approach for surface reflectance estimation directly from

8-bit RGB images in real-time. The method can be eas-

ily plugged into any 3D scanning-and-fusion system with a

commodity RGBD sensor and enables physically-plausible

renderings at novel lighting/viewing conditions, as shown

in Fig. 1. Similar to prior work [35, 18], we use a simpli-

fied BRDF representation and focus on estimating surface

albedo and gloss rather than full 4D BRDF function. Our

method is learning-based, with an inference time of less

than 90ms per scene and a model size of less than 340K

bytes.

In order to deal with unstructured input data (e.g., each

surface point can have a different number of observations

due to occlusions) in the context of neural networks, we

propose two novel network architectures – HemiCNN and

Grouplet. HemiCNN projects and interpolates the sparse

observations onto a 2D image, which enables the use of

standard convolutional neural networks. Grouplet learns di-

rectly from random samples of observations and uses multi-

layer perception networks. Both networks are also designed

to be lightweight in both inference time and model size for

real-time applications on consumer-grade mobile devices.

In addition, since the illumination is unknown, we have de-

signed a novel loss function to resolve the color-constancy

and scale ambiguity (i.e., only given input images, we do

not know whether surfaces are reddish or the lighting is red-

dish, or whether surfaces are dark or the lighting is dim). To

the best of our knowledge, this is the first lightweight, real-

time approach for surface reflectance estimation in the wild.

We also created a large-scale synthetic dataset —

SynBRDF— for reflectance estimation. SynBRDF covers

5000 materials randomly sampled from OpenSurfaces [3],

5000 shapes from ScanNet [6], and a total of 500K RGBD

images (both HDR and LDR) rendered from multiple view-

points with a physically-based ray tracer under 20 natural

environmental illumination conditions, making it an ideal

benchmark dataset for complete image-based 3D scene re-

construction.

Finally, we incorporated the proposed approach with

RGBD scanning-and-fusion for complete 3D scene capture

(see Sec. 4 and Fig. 8). We trained our networks with Syn-

BRDF and directly applied the trained models on real data

captured with a commodity RGBD sensor. Experiments on

both synthetic data and real data show that the proposed

method effectively recovers surface reflectance and outper-

forms prior work for surface reflectance estimation in un-

controlled environments.

2. Related Work

Surface Reflectance Estimation in Uncontrolled Envi-

ronments Most prior work in this direction formulate the

inverse rendering problem as a joint optimization among

the three radiometric ingredients—lighting, geometry, and

reflectance—from observed images. Barron et al. [1] as-

sume Lambertian surfaces and optimize all the three com-

ponents. Others [30, 8, 17, 9] optimize reflectance and illu-

mination with known 3D geometry, either from motion or

based on statistical priors on natural illumination and ma-

terials. Recently, Wu et al. [35] and Lombardi et al. [18]

proposed to jointly estimate lighting, reflectance, and 3D

shape from a RGBD sensor, even in the presence of inter-

reflection. Chandraker et al. [5] investigated the theoretical

limits of material estimation from a single image. Despite

their effectiveness, these methods solve complicated opti-

mization problems iteratively, which is computationally too

expensive for real-time applications (e.g., hours of running

time). As the optimization relies heavily on the paramet-

ric forms of statistical priors, these methods generally re-

quire good initialization and HDR images as input. In con-

trast, our proposed method is a lightweight and practical

approach that can estimate surface reflectance directly from

8-bit RGB images on-the-fly, which is suitable for real-time

applications.

Material Perception and Recognition Our work is also

inspired from prior work on material perception and recog-

nition from images. Pellacini et al. [28] designed a

perceptually-uniform representation of the isotropic Ward

BRDF model [32], and Wills et al. [34] extend to data-

driven models with measured reflectance. Fores et al. [11]

studied the metrics used for BRDF fitting [23]. Flem-

ing et al. [10] found that natural illumination is key for the

perception of surface reflectance. Bell et al. [3] released a

large dataset—OpenSurfaces—with annotated surface ap-

pearance from real-world materials. These prior works in-

spired us in designing the regression loss and creating a

synthetic dataset for training. For learning-based material

recognition, Liu et al. [16] proposed a Bayesian approach

based on a bag of visual features. Bell et al. [4] used CNNs

(convolutional neural networks) for material recognition

from material context input. Recently, Wang et al. [31] pro-

posed a CNN-based method for material recognition from

light field images. These prior work shows neural networks

are capable of learning discriminative features for material

perception from images.

Reflectance Maps Estimation and Intrinsic Image De-

composition Intrinsic image decomposition aims to fac-

tor an input image into a shading-only image and a

reflectance-only image. Recently, CNNs has been success-

fully employed for intrinsic image decomposition [15, 21]

from a single image. Bell et al. [2] proposed a dense CRF-

based method and released a large intrinsic image dataset

generated by crowdsourcing. Zhou et al. [36] used deep

learning to infer data-driven priors from sparse human an-

notations. Rematas et al. [29] used CNNs to estimate re-

21

Page 3: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

(a) (b) (c) (d) (e) (f)Figure 2. Overview of our framework: (a) BRDF examples from OpenSurface [3], (b) Input image Ii and depth Di streams. Integrated

volume is shown in (c), where the colors shown from ith view Ii (the red circle with pose Ti) are visualized. Small red dots refer to the

locations of other views. (d) shows the data that we extract from each voxel vk for training: normal nk, observation vector oi (the green

arrow), and color values Cik from the observation Ii at the voxel vk. In (e), these measurements together with color statistics (Fi and Bi)

are fed into one of the two networks, HemiCNN (Sec. 3.3.1) and Grouplet (Sec. 3.3.2) for BRDF estimation.

flectance maps (defined as 2D reflectance under fixed, un-

known illumination) from a single image. These methods

recover only the illumination-dependent reflectance map,

while our method estimates the full BRDF that enables ren-

dering under novel illumination and viewing conditions.

3. Method

Our goal is to develop a module for real-time surface

reflectance estimation that can be plugged into any 3D

scanning-and-fusion methods for 3D scene capture, with

potential applications in VR/AR. In this paper, we make a

significant step towards this goal, and propose two novel

networks for homogeneous surface reflectance estimation

from RGBD image input.

3.1. Framework and Reflectance Model

Our framework takes as input RGBD image/depth se-

quences from a commodity depth sensor (Fig. 1). We

denote RGB color observation as C : ΩC → R3, images

with I : ΩI → R3 and depth maps with D : ΩD → R.

The N acquired RGBD frames consist of RGB color

images Ii, and depth maps Di (with frame index

i ∈ 1 . . . N ). We also denote the absolute camera poses

Ti = (R, t) ∈ SE(3), t ∈ R3 and R ∈ SO(3) of the re-

spective frames, which is computed from standard volume-

based pose-estimation algorithm [7].1 As shown in Fig. 2,

the input Ii, and Di are aligned and integrated into a 3D

volume

A

with signed-distance fusion [24], from which we

extract voxels vk ∈ A

(k ∈ 1 . . .M ) that contain observed

color Cik from the corresponding view Ii, its surface normal

nk, and camera orientation oi, see Fig. 2(d). Additionally,

we compute the color statistics for each view by simply tak-

ing the average of foreground and background pixels in Ii,1During training, we randomly generated poses for rendering scenes.

denoted as Fi and Bi, respectively. We will discuss these

further in Sec. 3.2.

For the representation of surface reflectance, similar to

prior work [35, 18], we choose a parametric BRDF model—

the isotropic Ward BRDF model [32]—for two reasons: (1)

the Ward BRDF model has a simple form but is representa-

tive for a wide variety of real-world materials [23], and (2)

prior studies [28, 34] on BRDF perception are based on the

Ward BRDF model. Specifically, the isotropic Ward BRDF

model is given by:

f(ωi, ωo; Θ) =ρdπ

+ ρs ·exp

(

− tan2 θh/α2)

4πα2√cos θi cos θo

, (1)

where ωi = (θi, φi) and ωo = (θo, φo) are the incident and

viewing directions, θh is the half angle, and Θ = (ρd, ρs, α)is the parameter to be estimated.

An equivalent, but perceptually-uniform representation

of the Ward BRDF model was proposed in [28], where the

diffuse albedo ρd is converted from RGB to CIE Lab col-

orspace, (L, a, b), and the gloss is described by variables c,the contrast of gloss, and d, the distinctness of gloss. Vari-

ables c and d are related to the BRDF parameters by [28]:

c = 3

ρs + ρd/2− 3

ρd/2, d = 1− α. (2)

Thus, an alternative representation for the BRDF parame-

ters is Θ = (L, a, b, c, d).Our problem is thus formulated as follows. Given a

set of voxels from any 3D scanning-and-fusion pipeline,

vk = Cik,oi ,nk, we estimate the optimal BRDF

parameters Θ with neural networks. Two problems need

to be solved for learning. First, what is a good loss func-

tion that can resolve the color constancy and scale ambigu-

ities due to unknown illumination? For example, just from

input images, we cannot tell whether the material is red-

dish or the illumination is reddish, or whether the material

22

Page 4: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

Name Ed(Θ, Θ)

RMSE1 ||ρd − ρd||2 + ||ρs − ρs||

2 + ||α− α||2

RMSE2 ||Lab − ˆLab||2 + λg||cd − cd||2

Cubic Root 3

ωi,ωo

||f(ωi, ωo; Θ)− f(ωi, ωo; Θ)|| cos θidωodωo

Table 1. Three options for the distance function Ed(Θ, Θ) for

BRDF estimation. RMSE1 and RMSE2 are root mean squared

error using Θ = (ρd, ρs, α) and Θ = (L, a, b, c, d), respectively,

where the latter is the sum of the perceptual color difference and

the perceptual gloss difference (λg = 1) [28]. Cubic Root is the

cosine-weighted ℓ2-norm of the difference of two 4D BRDF func-

tions, inspired by BRDF fitting [23, 11].

is dark or the illumination is dim. Second, the input data

is unstructured — different voxels have different numbers

of observations due to occlusion. What is a good network

architecture for such unstructured input data? We address

these two problems in the following sections.

3.2. Design of the Loss Function

A key part for network training is an appropriate loss

function. Prior work [23, 11] has shown that the commonly-

used ℓ2 norm (i.e., MSE) is not optimal for BRDF fitting.

We design the following loss:

J = Ed(Θ, Θ) + λEc(Θ, Cik), (3)

where Ed(·, ·) measures the discrepancy between the esti-

mated BRDF parameters and the ground truth and Ec(·, ·) is

a regularization term which relates the estimated reflectance

Θ with observed image intensities Cik. Ec aims to resolve

the aforementioned scale and color constancy ambiguities

and is weighted by λ = 0.01 in all our experiments

Table (1) lists three options for Ed(Θ, Θ) implemented

in this paper. RMSE1 and RMSE2 are the root mean

squared error with Θ = (ρd, ρs, α) and Θ = (L, a, b, c, d),respectively, where the latter is the sum of the perceptual

color difference and the perceptual gloss difference (λg =1) [28]. Cubic Root is inspired from BRDF fitting [23, 11],

which is a cosine weighted ℓ2-norm of the difference be-

tween two 4D BRDF functions.

For Ec, we use the color statistics computed for each

view, Fi and Bi, to approximately constrain the estimation

of ρd and ρs. Specifically, Ec is derived based on the ren-

dering equation [13] as

Ec =∑

i

||(ρd + ρs) · Bγi − F

γi ||

2, (4)

where Fi and Bi are the average image intensity of the fore-

ground and background regions of the i-th input image Ii,

γ = 2.4 is used to convert the input 8-bit RGB images to

linear images; see the Appendix for a detailed derivation.

Even though Eq. (4) is only an approximation of the render-

ing equation, it imposes a soft constraint on the scale and

Figure 3. Details of HemiCNN. Top row: generating a voxel hemi-

sphere image, from the sparse 3D set of the observations Cik of

voxel vk to a dense 2D hemisphere representation. Bottom row:

the HemiCNN siamese convolutional neural network architecture.

color cues for the BRDF estimation. We found it quite ef-

fective for working with real data (see Fig. 8).

3.3. Network Architectures

As shown in Fig. 2, our input data is unstructured be-

cause the observations Cik for each voxel vk are irregular,

sparse samples on the 2D slice of the 4D BRDF. Differ-

ent voxels may have different numbers of observations due

to occlusion. In order to feed the unstructured input data

into networks for learning, we propose two new neural net-

work architectures. One is called Hemisphere-based CNN

(HemiCNN) which projects and interpolates the sparse ob-

servations onto a 2D image, enabling the use of standard

convolutional neural networks. The other architectures,

called Grouplet, learns directly from randomly sampled ob-

servations and uses a multilayer perceptron network. Both

networks are also designed to be lightweight in both infer-

ence time per scene (≤90ms) and model size (≤340KB).

3.3.1 Hemisphere-based CNN (HemiCNN)

For HemiCNN, as shown in the top row of Fig. 3, the RGB

observations Cik of voxel vk are projected onto a unit

sphere centered at the sample voxel vk. The unit sphere

is rotated so the positive z-axis is aligned with the voxel’s

surface normal nk. Observations Cik on the positive

hemisphere (i.e., z > 0) are projected onto the 2D x-y

plane. Finally, a dense 2D image, denoted a sample hemi-

sphere image, is generated using nearest-neighbor interpo-

lation among the projected observations.

A siamese convolutional neural network is used to pre-

dict BRDF parameters from a collection of N sample hemi-

sphere images, one for each of a representative set of vox-

els, e.g., chosen by clustering on voxel positions or surface

normals. As shown in Fig. 3, in the first of two stages

the siamese convolutional network operates on each sample

hemisphere image individually to produce a vector repre-

sentation, after which the representations are merged across

23

Page 5: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

!"

!"

##

$

"%&

'() **(+&(,*%(,* -

*.

!"

Figure 4. The Grouplet model for BRDF estimation relies on ag-

gregating results from a set of weak regressors (nodes). Each

node operates on a randomly sampled voxel from the object. M

branches form the input to a node; each samples randomly from

a set of observations. Intermediate representation from multiple

nodes are combined by a moment pooling layer. BRDF parame-

ters are regressed from the output of the moment pooling layer.

the N samples (i.e., voxels) by computing an element-wise

maximum. The network includes two convolutional lay-

ers, each with 16 sets of 3× 3 filters with ReLU activa-

tions, a single 2×2 max-pooling layer, and a single fully-

connected layer with 64 neurons. After aggregating the N

feature vectors with an elementwise-maximum, we use a

fully-connected layer with 32 neurons, followed by tanh ac-

tivation, and a final fully-connected layer to produce the

BRDF prediction. In most experiments with HemiCNN,

we set N = 25. The model size is 56KB and the average

inference time is 16ms per scene.

3.3.2 Sampling-based Network (Grouplet)

The second proposed network architecture is called Grou-

plet (Fig. 4). Unlike HemiCNN, where we transform sparse

observations to 2D images to use standard convolution lay-

ers, Grouplet directly operates on each observation Cik for

each voxel vk. Grouplet relies on aggregating results from

a set of weak regressors called nodes. Each node estimates

an intermediate representation of the BRDF parameters of a

single voxel (vk) from M randomly sampled observations

Γk = C1k, · · · , CMk, as shown in Fig. 4. A different sub-

sets of observations is sampled for each voxel. Each obser-

vation from Γk is processed by a two-layer multilayer per-

ceptron (MLP) with 128 neurons per layer, called a branch.

Inputs to each branch are observed color (Cik), viewing di-

rection (oi), averaged foreground color (Fi) and averaged

background color (Bi).

Next, the output of M branches are concatenating to-

gether with the voxel’s surface normal (nk). This vector

is processed by another two-layer MLP with 256 and 128neurons in the layers, the output of which is the intermediate

representation of the BRDF parameters. During BRDF esti-

mation, we operate on N voxels, each of which is processed

by different nodes with shared weights. To combine the in-

termediate representations computed from several voxels,

we use a moment pooling operator that is invariant to the

number of nodes. We pool with the first and second cen-

tral moments which represent expected value and variance

of the intermediate representation across nodes. The output

of the pooling operator is a 256-dimensional pooled repre-

sentation.

The final part of the network estimates the BRDF param-

eters from the pooled representation by another MLP with

two hidden layers of 128 neurons each, and one final out-

put layer. All layers throughout the model use hyperbolic

tangent activation functions except for the last output layer.

Grouplet is able to work with any number of nodes due

to the use of pooling operators. It also does not require that

the number of nodes be the same during training and test-

ing. However, the order of the M observations in the branch

networks is important. We found the best results by sorting

observations by the cosine distance between the observation

vector (oi) and the voxel’s surface normal (nk). For BRDF

estimation, Grouplet is applied in two forms, Grouplet-fast

and Grouplet-slow, with N = 20 and N = 354 voxels, re-

spectively. Each constructs M=5 nodes per voxel. The av-

erage inference time is 5ms for Grouplet-fast and 90ms for

Grouplet-slow. The model size for both Grouplet-fast and

Grouplet-slow is 339KB. Unless otherwise noted, Grouplet

refers to Grouplet-slow.

For both HemiCNN and GroupLet, we set λg = 1 and

explore a range of λ, finding 0.1 ≤ λ ≤ 1 to be a reasonable

range. We train HemiCNN using RMSProp with learning

rate 0.0001 and 100K minibatches. For Grouplet training,

we use stochastic gradient descent with fixed learning rate

0.01 and momentum 0.9 for 13K minibatches.

3.4. SynBRDF: A Large Benchmark Dataset

Deep learning requires a large amount of data. Yet, for

BRDF estimation, it is extremely challenging to obtain a

large dataset with measured BRDF data due to the complex

settings required for BRDF acquisition [19, 17]. Moreover,

while there are quite a few recent works for 3D shape re-

covery and reflectance estimation in the wild [17, 27, 35],

we are not aware of a large-scale, benchmark dataset with

ground-truth shape, reflectance, and illumination.

With these motivations, we created SynBRDF which is,

to our knowledge, the first large-scale, synthetic bench-

mark dataset for BRDF estimation. SynBRDF covers 5000materials randomly sampled from OpenSurfaces [3], 5000shapes randomly sampled from ScanNet [6], and a total

24

Page 6: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

Figure 5. SynBRDF: (Left) Thumbnails of the first frame of each

example (each contains 100 different observations), (Right) Some

examples with depth map (insets)

of 500K RGB and depth images (both HDR and LDR)

rendered from multiple viewpoints with a physically-based

raytracer [12], under variants of 20 natural environmental

illumination maps. SynBRDF thus has ground truth for

3D shape, BRDF, illumination, and camera pose, making

it an ideal benchmark dataset for evaluating image-based

3D scene reconstruction and BRDF estimation algorithms.

As shown in Fig. 5, each scene is labeled with ground truth

Ward BRDF parameters (Eq. 1 and 2). For more flexible

evaluation that allows other types of rendering (e.g., global

illumination) for the same scene, we will also provide the

90K XML files that indicate the original OpenSurface ma-

terials and contain the variations of environmental and ob-

ject model settings. We believe this dataset will be valuable

for further study in this direction.

In our experiments, we used SynBRDF for training and

evaluation. We randomly chose 400 from the total 5000scenes as a holdout test set, and the remaining 4600 scenes

for training. For real data experiments, we directly applied

the trained models without any domain adaptation.

Network Loss RMSE User Rank

Grouplet RMSE1+Ec 0.455 1Grouplet RMSE1 0.432 2

HemiCNN RMSE2 0.564 3HemiCNN CubeRoot 0.439 4HemiCNN RMSE1 0.419 5HemiCNN CubeRoot+Ec 0.583 6

Grouplet RMSE2 0.457 7

Table 2. Average RMSE (w.r.t ground truth BRDF parameters) on

the test set of SynBRDF and the rank of user preferences from

a perceptual study of rendered materials. Among the variants of

our proposed method, we list the top seven methods based on the

results of the user study. In general these provide most plausible

results among all testing data (see results in Fig. 7 and Fig. 9).

Note that RMSE ranking is not always consistent with the ranking

of user study. Further, as shown in Fig. 8, CubeRoot+Ec provides

most plausible results for the real data. Additional evaluations are

found in the Supplementary Material.

Figure 6. Error versus the number of observations for three meth-

ods: HemiCNN (blue), Grouplet-fast (20 voxel samples, red), and

Grouplet-slow (354 voxel samples, orange). A larger set of obser-

vations yields noticeably improved predictions for all three meth-

ods, but begins to saturate around 30. The common scan-and-fuse

method does not always guarantee a rich coverage of observations.

4. Experimental Results

We evaluated multiple variants of the proposed net-

works, changing the loss function Ed and BRDF representa-

tions as described in Sec 3.1. In Sec. 4.1, we evaluate these

settings on SynBRDF, showing quantitatively and qualita-

tively that several combinations give accurate predictions

on our synthetic dataset. In Sec. 4.2, we compare with prior

work [17]. Finally in Sec. 4.3, we demonstrate the proposed

methods within the KinectFusion pipeline for complete 3D

scene capture with real data.

4.1. Results on Synthetic Data

We evaluated all variants of the proposed methods on the

test set of SynBRDF. Evaluating the quality of BRDF esti-

mation is challenging [11]—perceived quality often varies

with the illumination, 3D shape, and even display settings.

Estimates with the lowest RMSE error on the BRDF pa-

rameters are not necessarily the best for visual perception.

Thus, in addition to computing the RMSE with respect to

the ground truth BRDF parameters, we also conducted a

user study. We randomly chose 10 materials from the test

set, and rendered the BRDF predictions for each material

under (a) natural illumination and (b) moving point light

sources. The rendered images are similar to Fig. 7. We then

asked 10 users to rank the methods on each material based

on the perceptual similarity between the ground truth and

the images rendered from each method.

Table 2 lists the top seven methods based on average user

score, together with the RMSE w.r.t ground truth BRDF pa-

rameters.2 (Additional evaluation results are provided in

the Supplementary Material.) We found the RMSE rank-

ing is not always consistent with the ranking of the user

study. Adding the regularization term Ec can improve the

ranking (e.g., Grouplet-RMSE1-Ec), while the choice of

2RMSE is computed after normalization of the BRDF parameters to

zero mean and unit standard deviation, based on the mean and standard

deviation of the training set. Thus a random prediction (with the same

statistics) will have RMSE ≈ 1.0.

25

Page 7: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

Material 947: top: GT, middle: ours (RMSE1+Ec), bottom: Lombardi et al. [17]

Material 3331: top: GT, middle: ours (RMSE1), bottom: Lombardi et al. [17]

Figure 7. Comparison with Lombardi et al. [17]: ours (middle row

for each example) is closer to the ground truth example (top row)

even under varying lighting.

Ed has mixed effect on performance. We find that the

Θ = (L, a, b, c, d) BRDF representation provides more ac-

curate estimation of gloss (e.g., HemiCNN-RMSE2). Also,

HemiCNN seems able to obtain better estimate for the

gloss, while Grouplet estimates the diffuse albedo better.

Fig. 9 shows three random examples of BRDF estima-

tion. We show the rendered images under natural illumina-

tion with the ground truth BRDF, as well as the estimated

BRDF from two variants of our proposed method. Qualita-

tively, the BRDF estimations accurately reproduce the color

and gloss of the surface materials.

4.2. Comparison with [17]

As mentioned previously, it is difficult to compare with

prior work on BRDF estimation in the wild [27, 35], given

the lack of code and common datasets for comparison.

Lombardi et al. [17] is the only method with released codes.

Strictly speaking, it is not a direct apples-to-apples compar-

ison, because Lombardi et al. [17] requires a single image

and a precise surface normal map as input and estimates

both DSBRDF and lighting, while our methods take multi-

ple RGB-D images as input and estimate the Ward BRDF.

Moreover, Lombardi et al. [17] takes about 3 minutes to run,

while our methods are real-time (≤ 90ms). Nevertheless,

Lombardi et al. [17] is the only available option for compar-

ison, and both its input requirements and running time are

similar to ours. For comparison, we randomly chose two

materials from SynBRDF, rendered a sphere image under

Head model

Pumpkin model

BRDF estimation without regularization Ec

Figure 8. Real data evaluation: For both examples in the top two

sections, head and pumpkin, top-left is the input (real) scene, top-

middle shows the rendered scene with estimated BRDF parame-

ters, top-right shows a different rendered view of the same scene,

bottom-left shows a rendered sphere with the estimated BRDF, and

bottom-right shows rendered spheres with varying point lighting.

Grouplet and HemiCNN with CubeRoot+Ec were used for the

head and pumpkin examples, respectively. The bottom section

shows three rendered views from methods trained without regu-

larization Ec (Eq. 4).

natural illumination, and used it (together with the sphere

normal map) as the input for [17]. Fig. 7 shows the com-

parison. Our proposed method closely matches the ground

truth and outperforms [17].

4.3. Results on Real Data

Previously, in Sec. 1 and Sec. 3.2, we discussed the po-

tential issues of scale ambiguity present in real-world data,

due in part to our use of commodity RGBD camera output

rather than HDR videos. As expected, the regularization

(Eq. 4) plays an important role in achieving correct results,

as illustrated in Fig. 8. Notice that the result with the reg-

ularization better captures brightness as well as plausible

gloss. Additional views of the real examples are shown in

the supplementary video.

26

Page 8: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

5. Conclusions and Limitations

In this paper, we proposed a lightweight and practical ap-

proach for surface reflectance estimation directly from 8-bit

RGB images in real-time. The method can be plugged into

3D scanning-and-fusion systems with a commodity RGBD

sensor for scene capture. Our approach is learning-based,

with the inference time less than 90ms per material and

model size less than 340K bytes. Compared to prior work,

our method is a more feasible solution for real-time applica-

tions (VR/AR) on mobile devices. We proposed two novel

network architectures, HemiCNN and Grouplet, to handle

the unstructured measured data from input images. We also

designed a novel loss function that is both perceptually-

based and able to resolve the scale ambiguity and color-

constancy ambiguity for reflectance estimation. In addi-

tion, we also provided the first large-scale synthetic data set

(SynBRDF) as a benchmark dataset for training and evalu-

ation for surface reflectance estimation in uncontrolled en-

vironments.

Our method has several limitations that we plan to ad-

dress in future work. First, our method estimates homo-

geneous reflectance. While GroupLet and HemiCNN can

in theory operate for each voxel separately and thus could

estimate spatially-varying reflectance, in practice we found

using more voxels as input results in more robust estima-

tion. One future direction is to jointly learn several basis

reflectance functions and weight maps to estimate spatially-

varying BRDF. Second, we use the isotropic Ward model

for BRDF representation. In the future, we plan to in-

vestigate more general, data-driven models such as DS-

BRDF [25] and the related perceptually-based loss [34]. Fi-

nally, we are interested in using neural networks to jointly

refine both 3D geometry and reflectance estimation, and

leveraging domain adaption techniques to further improve

the performance on real data.

Appendix

Derivation of Eq.(4) For viewing direction ωo, the ob-

served scene radiance Lo is given by

Lo =

ωi

f(ωi, ωo; Θ) · Li ·max(cos θi, 0)dωi, (5)

where Li is the environmental illumination in the direc-

tion ωi. We simplified the above rendering equation so that

all terms can be computed from the input fed into the net-

works. Suppose the environment illumination is uniform,

i.e., Li = L, by integrating the reflected radiance from the

entire hemisphere, the measured radiance is:

Lo ≈ (ρd + ρs)L. (6)

Both Lo and L can be approximated from input images,

where the average intensity of the foreground object is close

Material 947

Gro

und

truth

BR

DF

Gro

uple

t

RM

SE1+E

c

Material 3331

Gro

und

truth

BR

DF

Hem

iCN

N

RM

SE2

Material 3905G

round

truth

BR

DF

Hem

iCN

N

RM

SE1

Figure 9. Qualitative results for three randomly selected materials.

For each material, images in first row are rendered from ground

truth BRDF, while images in the second row are rendered from the

estimated BRDF using one of the methods from the list in Table 2.

The ground truth image indicated with a red border is sampled

from the image sequence used for inference (inputs). To demon-

strate how different objects and environmental lights can change

the appearance of the scene even with the same BRDF, the three

images from second to fourth columns are rendered with the same

BRDF but different models and lighting. More examples from dif-

ferent are included in the Supplementary Material.

to Lo, and the average intensity of the background is close

to L. Since the input images are 8-bit images in the sRGB

color space rather than linear HDR images, we need to ap-

ply an additional gamma transformation between pixel in-

tensities and scene radiance (γ = 2.4 for sRGB). Thus, we

have Lo ≈ F γ and L ≈ Bγ , where F and B are the av-

erage image intensities for the foreground and background.

Putting all together, we have

F γi ≈ (ρd + ρs) · Bγ

i , (7)

and thus we have the Ec term in Eq.(4).

27

Page 9: A Lightweight Approach for On-The-Fly Reflectance Estimationopenaccess.thecvf.com/content_ICCV_2017/papers/Kim_A... · 2017. 10. 20. · of visible voxel samples from a specific

References

[1] J. T. Barron and J. Malik. Shape, illumination, and re-

flectance from shading. IEEE Trans. Pattern Anal. Mach.

Intell., 2015. 1, 2

[2] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild.

ACM Trans. Graph., 33(4):159:1–159:12, July 2014. 2

[3] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Opensurfaces:

A richly annotated catalog of surface appearance. ACM

Trans. Graph. (SIGGRAPH), 2013. 2, 3, 5

[4] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material

recognition in the wild with the materials in context database.

Computer Vision and Pattern Recognition (CVPR), 2015. 2

[5] M. Chandraker and R. Ramamoorthi. What an image reveals

about material reflectance. In ICCV, pages 1–8, 2011. 2

[6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,

and M. Nießner. Scannet: Richly-annotated 3d reconstruc-

tions of indoor scenes. http://arxiv.org/, 2017. 2, 5

[7] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt.

Bundlefusion: Real-time globally consistent 3D reconstruc-

tion using on-the-fly surface re-integration. arXiv, 2016. 1,

3

[8] Y. Dong, G. Chen, P. Peers, J. Zhang, and X. Tong.

Appearance-from-motion: Recovering spatially varying sur-

face reflectance under unknown lighting. ACM Trans.

Graph., 2014. 2

[9] R. O. Dror, E. Adelson, and A. Willsky. Estimating surface

reflectance properties from images under unknown illumina-

tion. In SPIE, Human Vision and Electronic Imaging, 2001.

2

[10] R. W. Fleming, R. O.Dror, and E. Adelson. Real-world illu-

mination and the perception of surface reflectance properties.

Journal of Vision, 2003. 2

[11] A. Fores, J. Ferwerda, and J. Gu. Toward a perceptually

based metric for brdf modeling. In Twentieth Color and

Imaging Conference. Los Angeles, California, USA, pages

142–148, November 2012. 2, 4, 6

[12] W. Jakob. Mitsuba Renderer, 2010. http://www.mitsuba-

renderer.org. 6

[13] J. T. Kajiya. The rendering equation. In SIGGRAPH, 1986.

4

[14] H. P. A. Lensch, J. Kautz, M. Goesele, W. Heidrich, and H.-

P. Seidel. Image-Based Reconstruction of Spatially Varying

Materials. In S. J. Gortle and K. Myszkowski, editors, Euro-

graphics Workshop on Rendering, 2001. 1

[15] L. Lettry, K. Vanhoey, and L. V. Gool. Darn: A deep ad-

versial residual network for iintrinsic image decomposition:.

Arxiv. 2

[16] C. Liu, L. Sharan, R. Rosenholtz, and E. H. Adelson. Ex-

ploring features in a bayesian framework for material recog-

nition. In CVPR, 2010. 2

[17] S. Lombardi and K. Nishino. Reflectance and natural illumi-

nation from a single image. In ECCV, pages 582–595, 2012.

1, 2, 5, 6, 7

[18] S. Lombardi and K. Nishino. Radiometric scene decompo-

sition: Scene reflectance, illumination, and geometry from

rgb-d images. In 3DV, 2016. 1, 2, 3

[19] W. Matusik, H. Pfister, M. Brand, and L. McMillan. A data-

driven reflectance model. SIGGRAPH, 2003. 1, 5

[20] D. McAllister. A Generalized Surface Appearance Repre-

sentation for Computer Graphics. PhD thesis, University of

North Carolina at Chapel Hill, 2002. 1

[21] T. Narihira, M. Maire, and S. X. Yu. Direct iintrinsic: Learn-

ing albedo-shading decomposition by convolutional regres-

sion. In ICCV, 2015. 2

[22] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges,

J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and

A. Fitzgibbon. KinectFusion: Real-time dense surface map-

ping and tracking. In ISMAR, 2011. 1

[23] A. Ngan, F. Durand, and W. Matusik. Experimental analysis

of brdf models. In Proceedings of the Eurographics Sympo-

sium on Rendering, pages 117–226. Eurographics Associa-

tion, 2005. 2, 3, 4

[24] M. Nießner, M. Zollhofer, S. Izadi, and M. Stamminger.

Real-time 3D reconstruction at scale using voxel hashing.

ACM Transactions on Graphics (TOG), 32(6):169, 2013. 3

[25] K. Nishino and S. Lombardi. Directional statistics-based re-

flectance model for isotropic bidirectional reflectance distri-

bution functions. OSA Journal of Optical Society of America

A, 28(1):8–18, 2011. 8

[26] K. Nishino and S. Lombardi. Single image multimaterial

estimation. CVPR, pages 238–245, 2012. 1

[27] G. Oxholm and K. Nishino. Shape and reflectance estimation

in the wild. TPAMI, 38(2):376–389, 2016. 1, 5, 7

[28] F. Pellacini, J. A. Ferwerda, and D. P. Greenberg. Toward

a psychophysically-based light reflection model for image

synthesis. In Proceedings of the 27th Annual Conference

on Computer Graphics and Interactive Techniques, SIG-

GRAPH ’00, pages 55–64, 2000. 2, 3, 4

[29] K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuyte-

laars. Deep reflectance maps. In CVPR, 2016. 2

[30] F. Romeiro and T. Zickler. Blind reflectometry. In ECCV,

2010. 1, 2

[31] T. Wang, J. Zhu, E. Hiroaki, M. Chandraker, A. Efros, and

R. Ramamoorthi. A 4d light-field dataset and cnn architec-

tures for material recognition. In ECCV, 2016. 2

[32] G. J. Ward. Measuring and modeling anisotropic reflec-

tion. In Proceedings of the 19th Annual Conference on Com-

puter Graphics and Interactive Techniques, SIGGRAPH ’92,

pages 265–272, 1992. 2, 3

[33] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and

A. Davison. Elasticfusion: Dense slam without a pose graph.

In Proceedings of Robotics: Science and Systems, Rome,

Italy, July 2015. 1

[34] J. Wills, S. Agarwal, D. Kriegman, and S. Belongie. Toward

a perceptual space for gloss. ACM Trans. Graph., 2009. 2,

3, 8

[35] H. Wu, Z. Wang, and K. Zhou. Simultaneous localization

and appearance estimation with a consumer rgb-d camera.

IEEE Trans. Visualization and Computer Graphics, 2016. 1,

2, 3, 5, 7

[36] T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data-

driven reflectance priors for intrinsic image decomposition.

In ICCV, 2015. 2

28