Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A Lightweight Approach for On-the-Fly Reflectance Estimation
Kihwan Kim1 Jinwei Gu1 Stephen Tyree1 Pavlo Molchanov1 Matthias Nießner2 Jan Kautz1
1NVIDIA 2Technical University of Munich
http://research.nvidia.com/publication/reflectance-estimation-fly
Abstract
Estimating surface reflectance (BRDF) is one key com-
ponent for complete 3D scene capture, with wide appli-
cations in virtual reality, augmented reality, and human
computer interaction. Prior work is either limited to con-
trolled environments (e.g., gonioreflectometers, light stages,
or multi-camera domes), or requires the joint optimization
of shape, illumination, and reflectance, which is often com-
putationally too expensive (e.g., hours of running time) for
real-time applications. Moreover, most prior work requires
HDR images as input which further complicates the cap-
ture process. In this paper, we propose a lightweight ap-
proach for surface reflectance estimation directly from 8-
bit RGB images in real-time, which can be easily plugged
into any 3D scanning-and-fusion system with a commod-
ity RGBD sensor. Our method is learning-based, with an
inference time of less than 90ms per scene and a model
size of less than 340K bytes. We propose two novel net-
work architectures, HemiCNN and Grouplet, to deal with
the unstructured input data from multiple viewpoints under
unknown illumination. We further design a loss function to
resolve the color-constancy and scale ambiguity. In addi-
tion, we have created a large synthetic dataset, SynBRDF,
which comprises a total of 500K RGBD images rendered
with a physically-based ray tracer under a variety of natu-
ral illumination, covering 5000 materials and 5000 shapes.
SynBRDF is the first large-scale benchmark dataset for re-
flectance estimation. Experiments on both synthetic data
and real data show that the proposed method effectively re-
covers surface reflectance, and outperforms prior work for
reflectance estimation in uncontrolled environments.
1. Introduction
Capturing scene properties in the wild, including its 3D
geometry and surface reflectance, is one of the ultimate
goals of computer vision, with wide applications in vir-
tual reality, augmented reality, and human computer inter-
action. While 3D geometry recovery has achieved high ac-
curacy, especially with recent RGBD-based scanning-and-
fusion approaches [7, 22, 33], surface reflectance estima-
(a) (b) (c)
Figure 1. Overview of our method: we take RGBD image se-
quences as inputs (a). During a reconstruction process, each view
contributes voxels as an observation. In (b), we visualize the color
of visible voxel samples from a specific view (red circle among
red dots indicating locations of other views). These samples from
all views are evaluated through either HemiCNN (Sec. 3.3.1) or
Grouplet networks (Sec. 3.3.2) to estimate the BRDF in real time.
In (c), we show a rendered image with predicted BRDF and cap-
tured lighting and shape. More examples are shown in Sec. 4.
tion still remains challenging. At one extreme, most of
the fusion methods assume Lambertian reflectance and re-
cover surface texture only. At the other extreme, most of the
prior work on surface BRDF (bidirectional reflectance dis-
tribution function) estimation [14, 20, 19] aims to recover
the full 4D BRDF function, but are often limited to con-
trolled, studio-like environments (e.g., gonioreflectometers,
light stages, multi-camera domes, planar samples).
Recently, a few methods [35, 18, 27, 26, 30, 17, 1] were
proposed to recover surface reflectance in uncontrolled en-
vironments (e.g., unknown illumination or shape) by utiliz-
ing statistical priors on natural illumination and/or BRDF.
These methods formulate the inverse rendering problem as
a joint, alternative optimization among shape, reflectance,
and/or lighting. Despite their accuracy, these methods are
computationally quite expensive (e.g., hours or days of
running time and tens of gigabytes of memory consump-
tion) and are often run in a post-process rather than a real-
time setting. Moreover, these methods often require high-
resolution HDR images as input, which further complicates
the capturing process for real-time or interactive applica-
tions on consumer-grade mobile devices.
1 20
In this paper, we propose a lightweight and practical
approach for surface reflectance estimation directly from
8-bit RGB images in real-time. The method can be eas-
ily plugged into any 3D scanning-and-fusion system with a
commodity RGBD sensor and enables physically-plausible
renderings at novel lighting/viewing conditions, as shown
in Fig. 1. Similar to prior work [35, 18], we use a simpli-
fied BRDF representation and focus on estimating surface
albedo and gloss rather than full 4D BRDF function. Our
method is learning-based, with an inference time of less
than 90ms per scene and a model size of less than 340K
bytes.
In order to deal with unstructured input data (e.g., each
surface point can have a different number of observations
due to occlusions) in the context of neural networks, we
propose two novel network architectures – HemiCNN and
Grouplet. HemiCNN projects and interpolates the sparse
observations onto a 2D image, which enables the use of
standard convolutional neural networks. Grouplet learns di-
rectly from random samples of observations and uses multi-
layer perception networks. Both networks are also designed
to be lightweight in both inference time and model size for
real-time applications on consumer-grade mobile devices.
In addition, since the illumination is unknown, we have de-
signed a novel loss function to resolve the color-constancy
and scale ambiguity (i.e., only given input images, we do
not know whether surfaces are reddish or the lighting is red-
dish, or whether surfaces are dark or the lighting is dim). To
the best of our knowledge, this is the first lightweight, real-
time approach for surface reflectance estimation in the wild.
We also created a large-scale synthetic dataset —
SynBRDF— for reflectance estimation. SynBRDF covers
5000 materials randomly sampled from OpenSurfaces [3],
5000 shapes from ScanNet [6], and a total of 500K RGBD
images (both HDR and LDR) rendered from multiple view-
points with a physically-based ray tracer under 20 natural
environmental illumination conditions, making it an ideal
benchmark dataset for complete image-based 3D scene re-
construction.
Finally, we incorporated the proposed approach with
RGBD scanning-and-fusion for complete 3D scene capture
(see Sec. 4 and Fig. 8). We trained our networks with Syn-
BRDF and directly applied the trained models on real data
captured with a commodity RGBD sensor. Experiments on
both synthetic data and real data show that the proposed
method effectively recovers surface reflectance and outper-
forms prior work for surface reflectance estimation in un-
controlled environments.
2. Related Work
Surface Reflectance Estimation in Uncontrolled Envi-
ronments Most prior work in this direction formulate the
inverse rendering problem as a joint optimization among
the three radiometric ingredients—lighting, geometry, and
reflectance—from observed images. Barron et al. [1] as-
sume Lambertian surfaces and optimize all the three com-
ponents. Others [30, 8, 17, 9] optimize reflectance and illu-
mination with known 3D geometry, either from motion or
based on statistical priors on natural illumination and ma-
terials. Recently, Wu et al. [35] and Lombardi et al. [18]
proposed to jointly estimate lighting, reflectance, and 3D
shape from a RGBD sensor, even in the presence of inter-
reflection. Chandraker et al. [5] investigated the theoretical
limits of material estimation from a single image. Despite
their effectiveness, these methods solve complicated opti-
mization problems iteratively, which is computationally too
expensive for real-time applications (e.g., hours of running
time). As the optimization relies heavily on the paramet-
ric forms of statistical priors, these methods generally re-
quire good initialization and HDR images as input. In con-
trast, our proposed method is a lightweight and practical
approach that can estimate surface reflectance directly from
8-bit RGB images on-the-fly, which is suitable for real-time
applications.
Material Perception and Recognition Our work is also
inspired from prior work on material perception and recog-
nition from images. Pellacini et al. [28] designed a
perceptually-uniform representation of the isotropic Ward
BRDF model [32], and Wills et al. [34] extend to data-
driven models with measured reflectance. Fores et al. [11]
studied the metrics used for BRDF fitting [23]. Flem-
ing et al. [10] found that natural illumination is key for the
perception of surface reflectance. Bell et al. [3] released a
large dataset—OpenSurfaces—with annotated surface ap-
pearance from real-world materials. These prior works in-
spired us in designing the regression loss and creating a
synthetic dataset for training. For learning-based material
recognition, Liu et al. [16] proposed a Bayesian approach
based on a bag of visual features. Bell et al. [4] used CNNs
(convolutional neural networks) for material recognition
from material context input. Recently, Wang et al. [31] pro-
posed a CNN-based method for material recognition from
light field images. These prior work shows neural networks
are capable of learning discriminative features for material
perception from images.
Reflectance Maps Estimation and Intrinsic Image De-
composition Intrinsic image decomposition aims to fac-
tor an input image into a shading-only image and a
reflectance-only image. Recently, CNNs has been success-
fully employed for intrinsic image decomposition [15, 21]
from a single image. Bell et al. [2] proposed a dense CRF-
based method and released a large intrinsic image dataset
generated by crowdsourcing. Zhou et al. [36] used deep
learning to infer data-driven priors from sparse human an-
notations. Rematas et al. [29] used CNNs to estimate re-
21
(a) (b) (c) (d) (e) (f)Figure 2. Overview of our framework: (a) BRDF examples from OpenSurface [3], (b) Input image Ii and depth Di streams. Integrated
volume is shown in (c), where the colors shown from ith view Ii (the red circle with pose Ti) are visualized. Small red dots refer to the
locations of other views. (d) shows the data that we extract from each voxel vk for training: normal nk, observation vector oi (the green
arrow), and color values Cik from the observation Ii at the voxel vk. In (e), these measurements together with color statistics (Fi and Bi)
are fed into one of the two networks, HemiCNN (Sec. 3.3.1) and Grouplet (Sec. 3.3.2) for BRDF estimation.
flectance maps (defined as 2D reflectance under fixed, un-
known illumination) from a single image. These methods
recover only the illumination-dependent reflectance map,
while our method estimates the full BRDF that enables ren-
dering under novel illumination and viewing conditions.
3. Method
Our goal is to develop a module for real-time surface
reflectance estimation that can be plugged into any 3D
scanning-and-fusion methods for 3D scene capture, with
potential applications in VR/AR. In this paper, we make a
significant step towards this goal, and propose two novel
networks for homogeneous surface reflectance estimation
from RGBD image input.
3.1. Framework and Reflectance Model
Our framework takes as input RGBD image/depth se-
quences from a commodity depth sensor (Fig. 1). We
denote RGB color observation as C : ΩC → R3, images
with I : ΩI → R3 and depth maps with D : ΩD → R.
The N acquired RGBD frames consist of RGB color
images Ii, and depth maps Di (with frame index
i ∈ 1 . . . N ). We also denote the absolute camera poses
Ti = (R, t) ∈ SE(3), t ∈ R3 and R ∈ SO(3) of the re-
spective frames, which is computed from standard volume-
based pose-estimation algorithm [7].1 As shown in Fig. 2,
the input Ii, and Di are aligned and integrated into a 3D
volume
A
with signed-distance fusion [24], from which we
extract voxels vk ∈ A
(k ∈ 1 . . .M ) that contain observed
color Cik from the corresponding view Ii, its surface normal
nk, and camera orientation oi, see Fig. 2(d). Additionally,
we compute the color statistics for each view by simply tak-
ing the average of foreground and background pixels in Ii,1During training, we randomly generated poses for rendering scenes.
denoted as Fi and Bi, respectively. We will discuss these
further in Sec. 3.2.
For the representation of surface reflectance, similar to
prior work [35, 18], we choose a parametric BRDF model—
the isotropic Ward BRDF model [32]—for two reasons: (1)
the Ward BRDF model has a simple form but is representa-
tive for a wide variety of real-world materials [23], and (2)
prior studies [28, 34] on BRDF perception are based on the
Ward BRDF model. Specifically, the isotropic Ward BRDF
model is given by:
f(ωi, ωo; Θ) =ρdπ
+ ρs ·exp
(
− tan2 θh/α2)
4πα2√cos θi cos θo
, (1)
where ωi = (θi, φi) and ωo = (θo, φo) are the incident and
viewing directions, θh is the half angle, and Θ = (ρd, ρs, α)is the parameter to be estimated.
An equivalent, but perceptually-uniform representation
of the Ward BRDF model was proposed in [28], where the
diffuse albedo ρd is converted from RGB to CIE Lab col-
orspace, (L, a, b), and the gloss is described by variables c,the contrast of gloss, and d, the distinctness of gloss. Vari-
ables c and d are related to the BRDF parameters by [28]:
c = 3
√
ρs + ρd/2− 3
√
ρd/2, d = 1− α. (2)
Thus, an alternative representation for the BRDF parame-
ters is Θ = (L, a, b, c, d).Our problem is thus formulated as follows. Given a
set of voxels from any 3D scanning-and-fusion pipeline,
vk = Cik,oi ,nk, we estimate the optimal BRDF
parameters Θ with neural networks. Two problems need
to be solved for learning. First, what is a good loss func-
tion that can resolve the color constancy and scale ambigu-
ities due to unknown illumination? For example, just from
input images, we cannot tell whether the material is red-
dish or the illumination is reddish, or whether the material
22
Name Ed(Θ, Θ)
RMSE1 ||ρd − ρd||2 + ||ρs − ρs||
2 + ||α− α||2
RMSE2 ||Lab − ˆLab||2 + λg||cd − cd||2
Cubic Root 3
√
∫
ωi,ωo
||f(ωi, ωo; Θ)− f(ωi, ωo; Θ)|| cos θidωodωo
Table 1. Three options for the distance function Ed(Θ, Θ) for
BRDF estimation. RMSE1 and RMSE2 are root mean squared
error using Θ = (ρd, ρs, α) and Θ = (L, a, b, c, d), respectively,
where the latter is the sum of the perceptual color difference and
the perceptual gloss difference (λg = 1) [28]. Cubic Root is the
cosine-weighted ℓ2-norm of the difference of two 4D BRDF func-
tions, inspired by BRDF fitting [23, 11].
is dark or the illumination is dim. Second, the input data
is unstructured — different voxels have different numbers
of observations due to occlusion. What is a good network
architecture for such unstructured input data? We address
these two problems in the following sections.
3.2. Design of the Loss Function
A key part for network training is an appropriate loss
function. Prior work [23, 11] has shown that the commonly-
used ℓ2 norm (i.e., MSE) is not optimal for BRDF fitting.
We design the following loss:
J = Ed(Θ, Θ) + λEc(Θ, Cik), (3)
where Ed(·, ·) measures the discrepancy between the esti-
mated BRDF parameters and the ground truth and Ec(·, ·) is
a regularization term which relates the estimated reflectance
Θ with observed image intensities Cik. Ec aims to resolve
the aforementioned scale and color constancy ambiguities
and is weighted by λ = 0.01 in all our experiments
Table (1) lists three options for Ed(Θ, Θ) implemented
in this paper. RMSE1 and RMSE2 are the root mean
squared error with Θ = (ρd, ρs, α) and Θ = (L, a, b, c, d),respectively, where the latter is the sum of the perceptual
color difference and the perceptual gloss difference (λg =1) [28]. Cubic Root is inspired from BRDF fitting [23, 11],
which is a cosine weighted ℓ2-norm of the difference be-
tween two 4D BRDF functions.
For Ec, we use the color statistics computed for each
view, Fi and Bi, to approximately constrain the estimation
of ρd and ρs. Specifically, Ec is derived based on the ren-
dering equation [13] as
Ec =∑
i
||(ρd + ρs) · Bγi − F
γi ||
2, (4)
where Fi and Bi are the average image intensity of the fore-
ground and background regions of the i-th input image Ii,
γ = 2.4 is used to convert the input 8-bit RGB images to
linear images; see the Appendix for a detailed derivation.
Even though Eq. (4) is only an approximation of the render-
ing equation, it imposes a soft constraint on the scale and
Figure 3. Details of HemiCNN. Top row: generating a voxel hemi-
sphere image, from the sparse 3D set of the observations Cik of
voxel vk to a dense 2D hemisphere representation. Bottom row:
the HemiCNN siamese convolutional neural network architecture.
color cues for the BRDF estimation. We found it quite ef-
fective for working with real data (see Fig. 8).
3.3. Network Architectures
As shown in Fig. 2, our input data is unstructured be-
cause the observations Cik for each voxel vk are irregular,
sparse samples on the 2D slice of the 4D BRDF. Differ-
ent voxels may have different numbers of observations due
to occlusion. In order to feed the unstructured input data
into networks for learning, we propose two new neural net-
work architectures. One is called Hemisphere-based CNN
(HemiCNN) which projects and interpolates the sparse ob-
servations onto a 2D image, enabling the use of standard
convolutional neural networks. The other architectures,
called Grouplet, learns directly from randomly sampled ob-
servations and uses a multilayer perceptron network. Both
networks are also designed to be lightweight in both infer-
ence time per scene (≤90ms) and model size (≤340KB).
3.3.1 Hemisphere-based CNN (HemiCNN)
For HemiCNN, as shown in the top row of Fig. 3, the RGB
observations Cik of voxel vk are projected onto a unit
sphere centered at the sample voxel vk. The unit sphere
is rotated so the positive z-axis is aligned with the voxel’s
surface normal nk. Observations Cik on the positive
hemisphere (i.e., z > 0) are projected onto the 2D x-y
plane. Finally, a dense 2D image, denoted a sample hemi-
sphere image, is generated using nearest-neighbor interpo-
lation among the projected observations.
A siamese convolutional neural network is used to pre-
dict BRDF parameters from a collection of N sample hemi-
sphere images, one for each of a representative set of vox-
els, e.g., chosen by clustering on voxel positions or surface
normals. As shown in Fig. 3, in the first of two stages
the siamese convolutional network operates on each sample
hemisphere image individually to produce a vector repre-
sentation, after which the representations are merged across
23
!"
!"
##
$
"%&
'() **(+&(,*%(,* -
*.
!"
Figure 4. The Grouplet model for BRDF estimation relies on ag-
gregating results from a set of weak regressors (nodes). Each
node operates on a randomly sampled voxel from the object. M
branches form the input to a node; each samples randomly from
a set of observations. Intermediate representation from multiple
nodes are combined by a moment pooling layer. BRDF parame-
ters are regressed from the output of the moment pooling layer.
the N samples (i.e., voxels) by computing an element-wise
maximum. The network includes two convolutional lay-
ers, each with 16 sets of 3× 3 filters with ReLU activa-
tions, a single 2×2 max-pooling layer, and a single fully-
connected layer with 64 neurons. After aggregating the N
feature vectors with an elementwise-maximum, we use a
fully-connected layer with 32 neurons, followed by tanh ac-
tivation, and a final fully-connected layer to produce the
BRDF prediction. In most experiments with HemiCNN,
we set N = 25. The model size is 56KB and the average
inference time is 16ms per scene.
3.3.2 Sampling-based Network (Grouplet)
The second proposed network architecture is called Grou-
plet (Fig. 4). Unlike HemiCNN, where we transform sparse
observations to 2D images to use standard convolution lay-
ers, Grouplet directly operates on each observation Cik for
each voxel vk. Grouplet relies on aggregating results from
a set of weak regressors called nodes. Each node estimates
an intermediate representation of the BRDF parameters of a
single voxel (vk) from M randomly sampled observations
Γk = C1k, · · · , CMk, as shown in Fig. 4. A different sub-
sets of observations is sampled for each voxel. Each obser-
vation from Γk is processed by a two-layer multilayer per-
ceptron (MLP) with 128 neurons per layer, called a branch.
Inputs to each branch are observed color (Cik), viewing di-
rection (oi), averaged foreground color (Fi) and averaged
background color (Bi).
Next, the output of M branches are concatenating to-
gether with the voxel’s surface normal (nk). This vector
is processed by another two-layer MLP with 256 and 128neurons in the layers, the output of which is the intermediate
representation of the BRDF parameters. During BRDF esti-
mation, we operate on N voxels, each of which is processed
by different nodes with shared weights. To combine the in-
termediate representations computed from several voxels,
we use a moment pooling operator that is invariant to the
number of nodes. We pool with the first and second cen-
tral moments which represent expected value and variance
of the intermediate representation across nodes. The output
of the pooling operator is a 256-dimensional pooled repre-
sentation.
The final part of the network estimates the BRDF param-
eters from the pooled representation by another MLP with
two hidden layers of 128 neurons each, and one final out-
put layer. All layers throughout the model use hyperbolic
tangent activation functions except for the last output layer.
Grouplet is able to work with any number of nodes due
to the use of pooling operators. It also does not require that
the number of nodes be the same during training and test-
ing. However, the order of the M observations in the branch
networks is important. We found the best results by sorting
observations by the cosine distance between the observation
vector (oi) and the voxel’s surface normal (nk). For BRDF
estimation, Grouplet is applied in two forms, Grouplet-fast
and Grouplet-slow, with N = 20 and N = 354 voxels, re-
spectively. Each constructs M=5 nodes per voxel. The av-
erage inference time is 5ms for Grouplet-fast and 90ms for
Grouplet-slow. The model size for both Grouplet-fast and
Grouplet-slow is 339KB. Unless otherwise noted, Grouplet
refers to Grouplet-slow.
For both HemiCNN and GroupLet, we set λg = 1 and
explore a range of λ, finding 0.1 ≤ λ ≤ 1 to be a reasonable
range. We train HemiCNN using RMSProp with learning
rate 0.0001 and 100K minibatches. For Grouplet training,
we use stochastic gradient descent with fixed learning rate
0.01 and momentum 0.9 for 13K minibatches.
3.4. SynBRDF: A Large Benchmark Dataset
Deep learning requires a large amount of data. Yet, for
BRDF estimation, it is extremely challenging to obtain a
large dataset with measured BRDF data due to the complex
settings required for BRDF acquisition [19, 17]. Moreover,
while there are quite a few recent works for 3D shape re-
covery and reflectance estimation in the wild [17, 27, 35],
we are not aware of a large-scale, benchmark dataset with
ground-truth shape, reflectance, and illumination.
With these motivations, we created SynBRDF which is,
to our knowledge, the first large-scale, synthetic bench-
mark dataset for BRDF estimation. SynBRDF covers 5000materials randomly sampled from OpenSurfaces [3], 5000shapes randomly sampled from ScanNet [6], and a total
24
Figure 5. SynBRDF: (Left) Thumbnails of the first frame of each
example (each contains 100 different observations), (Right) Some
examples with depth map (insets)
of 500K RGB and depth images (both HDR and LDR)
rendered from multiple viewpoints with a physically-based
raytracer [12], under variants of 20 natural environmental
illumination maps. SynBRDF thus has ground truth for
3D shape, BRDF, illumination, and camera pose, making
it an ideal benchmark dataset for evaluating image-based
3D scene reconstruction and BRDF estimation algorithms.
As shown in Fig. 5, each scene is labeled with ground truth
Ward BRDF parameters (Eq. 1 and 2). For more flexible
evaluation that allows other types of rendering (e.g., global
illumination) for the same scene, we will also provide the
90K XML files that indicate the original OpenSurface ma-
terials and contain the variations of environmental and ob-
ject model settings. We believe this dataset will be valuable
for further study in this direction.
In our experiments, we used SynBRDF for training and
evaluation. We randomly chose 400 from the total 5000scenes as a holdout test set, and the remaining 4600 scenes
for training. For real data experiments, we directly applied
the trained models without any domain adaptation.
Network Loss RMSE User Rank
Grouplet RMSE1+Ec 0.455 1Grouplet RMSE1 0.432 2
HemiCNN RMSE2 0.564 3HemiCNN CubeRoot 0.439 4HemiCNN RMSE1 0.419 5HemiCNN CubeRoot+Ec 0.583 6
Grouplet RMSE2 0.457 7
Table 2. Average RMSE (w.r.t ground truth BRDF parameters) on
the test set of SynBRDF and the rank of user preferences from
a perceptual study of rendered materials. Among the variants of
our proposed method, we list the top seven methods based on the
results of the user study. In general these provide most plausible
results among all testing data (see results in Fig. 7 and Fig. 9).
Note that RMSE ranking is not always consistent with the ranking
of user study. Further, as shown in Fig. 8, CubeRoot+Ec provides
most plausible results for the real data. Additional evaluations are
found in the Supplementary Material.
Figure 6. Error versus the number of observations for three meth-
ods: HemiCNN (blue), Grouplet-fast (20 voxel samples, red), and
Grouplet-slow (354 voxel samples, orange). A larger set of obser-
vations yields noticeably improved predictions for all three meth-
ods, but begins to saturate around 30. The common scan-and-fuse
method does not always guarantee a rich coverage of observations.
4. Experimental Results
We evaluated multiple variants of the proposed net-
works, changing the loss function Ed and BRDF representa-
tions as described in Sec 3.1. In Sec. 4.1, we evaluate these
settings on SynBRDF, showing quantitatively and qualita-
tively that several combinations give accurate predictions
on our synthetic dataset. In Sec. 4.2, we compare with prior
work [17]. Finally in Sec. 4.3, we demonstrate the proposed
methods within the KinectFusion pipeline for complete 3D
scene capture with real data.
4.1. Results on Synthetic Data
We evaluated all variants of the proposed methods on the
test set of SynBRDF. Evaluating the quality of BRDF esti-
mation is challenging [11]—perceived quality often varies
with the illumination, 3D shape, and even display settings.
Estimates with the lowest RMSE error on the BRDF pa-
rameters are not necessarily the best for visual perception.
Thus, in addition to computing the RMSE with respect to
the ground truth BRDF parameters, we also conducted a
user study. We randomly chose 10 materials from the test
set, and rendered the BRDF predictions for each material
under (a) natural illumination and (b) moving point light
sources. The rendered images are similar to Fig. 7. We then
asked 10 users to rank the methods on each material based
on the perceptual similarity between the ground truth and
the images rendered from each method.
Table 2 lists the top seven methods based on average user
score, together with the RMSE w.r.t ground truth BRDF pa-
rameters.2 (Additional evaluation results are provided in
the Supplementary Material.) We found the RMSE rank-
ing is not always consistent with the ranking of the user
study. Adding the regularization term Ec can improve the
ranking (e.g., Grouplet-RMSE1-Ec), while the choice of
2RMSE is computed after normalization of the BRDF parameters to
zero mean and unit standard deviation, based on the mean and standard
deviation of the training set. Thus a random prediction (with the same
statistics) will have RMSE ≈ 1.0.
25
Material 947: top: GT, middle: ours (RMSE1+Ec), bottom: Lombardi et al. [17]
Material 3331: top: GT, middle: ours (RMSE1), bottom: Lombardi et al. [17]
Figure 7. Comparison with Lombardi et al. [17]: ours (middle row
for each example) is closer to the ground truth example (top row)
even under varying lighting.
Ed has mixed effect on performance. We find that the
Θ = (L, a, b, c, d) BRDF representation provides more ac-
curate estimation of gloss (e.g., HemiCNN-RMSE2). Also,
HemiCNN seems able to obtain better estimate for the
gloss, while Grouplet estimates the diffuse albedo better.
Fig. 9 shows three random examples of BRDF estima-
tion. We show the rendered images under natural illumina-
tion with the ground truth BRDF, as well as the estimated
BRDF from two variants of our proposed method. Qualita-
tively, the BRDF estimations accurately reproduce the color
and gloss of the surface materials.
4.2. Comparison with [17]
As mentioned previously, it is difficult to compare with
prior work on BRDF estimation in the wild [27, 35], given
the lack of code and common datasets for comparison.
Lombardi et al. [17] is the only method with released codes.
Strictly speaking, it is not a direct apples-to-apples compar-
ison, because Lombardi et al. [17] requires a single image
and a precise surface normal map as input and estimates
both DSBRDF and lighting, while our methods take multi-
ple RGB-D images as input and estimate the Ward BRDF.
Moreover, Lombardi et al. [17] takes about 3 minutes to run,
while our methods are real-time (≤ 90ms). Nevertheless,
Lombardi et al. [17] is the only available option for compar-
ison, and both its input requirements and running time are
similar to ours. For comparison, we randomly chose two
materials from SynBRDF, rendered a sphere image under
Head model
Pumpkin model
BRDF estimation without regularization Ec
Figure 8. Real data evaluation: For both examples in the top two
sections, head and pumpkin, top-left is the input (real) scene, top-
middle shows the rendered scene with estimated BRDF parame-
ters, top-right shows a different rendered view of the same scene,
bottom-left shows a rendered sphere with the estimated BRDF, and
bottom-right shows rendered spheres with varying point lighting.
Grouplet and HemiCNN with CubeRoot+Ec were used for the
head and pumpkin examples, respectively. The bottom section
shows three rendered views from methods trained without regu-
larization Ec (Eq. 4).
natural illumination, and used it (together with the sphere
normal map) as the input for [17]. Fig. 7 shows the com-
parison. Our proposed method closely matches the ground
truth and outperforms [17].
4.3. Results on Real Data
Previously, in Sec. 1 and Sec. 3.2, we discussed the po-
tential issues of scale ambiguity present in real-world data,
due in part to our use of commodity RGBD camera output
rather than HDR videos. As expected, the regularization
(Eq. 4) plays an important role in achieving correct results,
as illustrated in Fig. 8. Notice that the result with the reg-
ularization better captures brightness as well as plausible
gloss. Additional views of the real examples are shown in
the supplementary video.
26
5. Conclusions and Limitations
In this paper, we proposed a lightweight and practical ap-
proach for surface reflectance estimation directly from 8-bit
RGB images in real-time. The method can be plugged into
3D scanning-and-fusion systems with a commodity RGBD
sensor for scene capture. Our approach is learning-based,
with the inference time less than 90ms per material and
model size less than 340K bytes. Compared to prior work,
our method is a more feasible solution for real-time applica-
tions (VR/AR) on mobile devices. We proposed two novel
network architectures, HemiCNN and Grouplet, to handle
the unstructured measured data from input images. We also
designed a novel loss function that is both perceptually-
based and able to resolve the scale ambiguity and color-
constancy ambiguity for reflectance estimation. In addi-
tion, we also provided the first large-scale synthetic data set
(SynBRDF) as a benchmark dataset for training and evalu-
ation for surface reflectance estimation in uncontrolled en-
vironments.
Our method has several limitations that we plan to ad-
dress in future work. First, our method estimates homo-
geneous reflectance. While GroupLet and HemiCNN can
in theory operate for each voxel separately and thus could
estimate spatially-varying reflectance, in practice we found
using more voxels as input results in more robust estima-
tion. One future direction is to jointly learn several basis
reflectance functions and weight maps to estimate spatially-
varying BRDF. Second, we use the isotropic Ward model
for BRDF representation. In the future, we plan to in-
vestigate more general, data-driven models such as DS-
BRDF [25] and the related perceptually-based loss [34]. Fi-
nally, we are interested in using neural networks to jointly
refine both 3D geometry and reflectance estimation, and
leveraging domain adaption techniques to further improve
the performance on real data.
Appendix
Derivation of Eq.(4) For viewing direction ωo, the ob-
served scene radiance Lo is given by
Lo =
∫
ωi
f(ωi, ωo; Θ) · Li ·max(cos θi, 0)dωi, (5)
where Li is the environmental illumination in the direc-
tion ωi. We simplified the above rendering equation so that
all terms can be computed from the input fed into the net-
works. Suppose the environment illumination is uniform,
i.e., Li = L, by integrating the reflected radiance from the
entire hemisphere, the measured radiance is:
Lo ≈ (ρd + ρs)L. (6)
Both Lo and L can be approximated from input images,
where the average intensity of the foreground object is close
Material 947
Gro
und
truth
BR
DF
Gro
uple
t
RM
SE1+E
c
Material 3331
Gro
und
truth
BR
DF
Hem
iCN
N
RM
SE2
Material 3905G
round
truth
BR
DF
Hem
iCN
N
RM
SE1
Figure 9. Qualitative results for three randomly selected materials.
For each material, images in first row are rendered from ground
truth BRDF, while images in the second row are rendered from the
estimated BRDF using one of the methods from the list in Table 2.
The ground truth image indicated with a red border is sampled
from the image sequence used for inference (inputs). To demon-
strate how different objects and environmental lights can change
the appearance of the scene even with the same BRDF, the three
images from second to fourth columns are rendered with the same
BRDF but different models and lighting. More examples from dif-
ferent are included in the Supplementary Material.
to Lo, and the average intensity of the background is close
to L. Since the input images are 8-bit images in the sRGB
color space rather than linear HDR images, we need to ap-
ply an additional gamma transformation between pixel in-
tensities and scene radiance (γ = 2.4 for sRGB). Thus, we
have Lo ≈ F γ and L ≈ Bγ , where F and B are the av-
erage image intensities for the foreground and background.
Putting all together, we have
F γi ≈ (ρd + ρs) · Bγ
i , (7)
and thus we have the Ec term in Eq.(4).
27
References
[1] J. T. Barron and J. Malik. Shape, illumination, and re-
flectance from shading. IEEE Trans. Pattern Anal. Mach.
Intell., 2015. 1, 2
[2] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild.
ACM Trans. Graph., 33(4):159:1–159:12, July 2014. 2
[3] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Opensurfaces:
A richly annotated catalog of surface appearance. ACM
Trans. Graph. (SIGGRAPH), 2013. 2, 3, 5
[4] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material
recognition in the wild with the materials in context database.
Computer Vision and Pattern Recognition (CVPR), 2015. 2
[5] M. Chandraker and R. Ramamoorthi. What an image reveals
about material reflectance. In ICCV, pages 1–8, 2011. 2
[6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,
and M. Nießner. Scannet: Richly-annotated 3d reconstruc-
tions of indoor scenes. http://arxiv.org/, 2017. 2, 5
[7] A. Dai, M. Nießner, M. Zollhofer, S. Izadi, and C. Theobalt.
Bundlefusion: Real-time globally consistent 3D reconstruc-
tion using on-the-fly surface re-integration. arXiv, 2016. 1,
3
[8] Y. Dong, G. Chen, P. Peers, J. Zhang, and X. Tong.
Appearance-from-motion: Recovering spatially varying sur-
face reflectance under unknown lighting. ACM Trans.
Graph., 2014. 2
[9] R. O. Dror, E. Adelson, and A. Willsky. Estimating surface
reflectance properties from images under unknown illumina-
tion. In SPIE, Human Vision and Electronic Imaging, 2001.
2
[10] R. W. Fleming, R. O.Dror, and E. Adelson. Real-world illu-
mination and the perception of surface reflectance properties.
Journal of Vision, 2003. 2
[11] A. Fores, J. Ferwerda, and J. Gu. Toward a perceptually
based metric for brdf modeling. In Twentieth Color and
Imaging Conference. Los Angeles, California, USA, pages
142–148, November 2012. 2, 4, 6
[12] W. Jakob. Mitsuba Renderer, 2010. http://www.mitsuba-
renderer.org. 6
[13] J. T. Kajiya. The rendering equation. In SIGGRAPH, 1986.
4
[14] H. P. A. Lensch, J. Kautz, M. Goesele, W. Heidrich, and H.-
P. Seidel. Image-Based Reconstruction of Spatially Varying
Materials. In S. J. Gortle and K. Myszkowski, editors, Euro-
graphics Workshop on Rendering, 2001. 1
[15] L. Lettry, K. Vanhoey, and L. V. Gool. Darn: A deep ad-
versial residual network for iintrinsic image decomposition:.
Arxiv. 2
[16] C. Liu, L. Sharan, R. Rosenholtz, and E. H. Adelson. Ex-
ploring features in a bayesian framework for material recog-
nition. In CVPR, 2010. 2
[17] S. Lombardi and K. Nishino. Reflectance and natural illumi-
nation from a single image. In ECCV, pages 582–595, 2012.
1, 2, 5, 6, 7
[18] S. Lombardi and K. Nishino. Radiometric scene decompo-
sition: Scene reflectance, illumination, and geometry from
rgb-d images. In 3DV, 2016. 1, 2, 3
[19] W. Matusik, H. Pfister, M. Brand, and L. McMillan. A data-
driven reflectance model. SIGGRAPH, 2003. 1, 5
[20] D. McAllister. A Generalized Surface Appearance Repre-
sentation for Computer Graphics. PhD thesis, University of
North Carolina at Chapel Hill, 2002. 1
[21] T. Narihira, M. Maire, and S. X. Yu. Direct iintrinsic: Learn-
ing albedo-shading decomposition by convolutional regres-
sion. In ICCV, 2015. 2
[22] R. Newcombe, A. Davison, S. Izadi, P. Kohli, O. Hilliges,
J. Shotton, D. Molyneaux, S. Hodges, D. Kim, and
A. Fitzgibbon. KinectFusion: Real-time dense surface map-
ping and tracking. In ISMAR, 2011. 1
[23] A. Ngan, F. Durand, and W. Matusik. Experimental analysis
of brdf models. In Proceedings of the Eurographics Sympo-
sium on Rendering, pages 117–226. Eurographics Associa-
tion, 2005. 2, 3, 4
[24] M. Nießner, M. Zollhofer, S. Izadi, and M. Stamminger.
Real-time 3D reconstruction at scale using voxel hashing.
ACM Transactions on Graphics (TOG), 32(6):169, 2013. 3
[25] K. Nishino and S. Lombardi. Directional statistics-based re-
flectance model for isotropic bidirectional reflectance distri-
bution functions. OSA Journal of Optical Society of America
A, 28(1):8–18, 2011. 8
[26] K. Nishino and S. Lombardi. Single image multimaterial
estimation. CVPR, pages 238–245, 2012. 1
[27] G. Oxholm and K. Nishino. Shape and reflectance estimation
in the wild. TPAMI, 38(2):376–389, 2016. 1, 5, 7
[28] F. Pellacini, J. A. Ferwerda, and D. P. Greenberg. Toward
a psychophysically-based light reflection model for image
synthesis. In Proceedings of the 27th Annual Conference
on Computer Graphics and Interactive Techniques, SIG-
GRAPH ’00, pages 55–64, 2000. 2, 3, 4
[29] K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuyte-
laars. Deep reflectance maps. In CVPR, 2016. 2
[30] F. Romeiro and T. Zickler. Blind reflectometry. In ECCV,
2010. 1, 2
[31] T. Wang, J. Zhu, E. Hiroaki, M. Chandraker, A. Efros, and
R. Ramamoorthi. A 4d light-field dataset and cnn architec-
tures for material recognition. In ECCV, 2016. 2
[32] G. J. Ward. Measuring and modeling anisotropic reflec-
tion. In Proceedings of the 19th Annual Conference on Com-
puter Graphics and Interactive Techniques, SIGGRAPH ’92,
pages 265–272, 1992. 2, 3
[33] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and
A. Davison. Elasticfusion: Dense slam without a pose graph.
In Proceedings of Robotics: Science and Systems, Rome,
Italy, July 2015. 1
[34] J. Wills, S. Agarwal, D. Kriegman, and S. Belongie. Toward
a perceptual space for gloss. ACM Trans. Graph., 2009. 2,
3, 8
[35] H. Wu, Z. Wang, and K. Zhou. Simultaneous localization
and appearance estimation with a consumer rgb-d camera.
IEEE Trans. Visualization and Computer Graphics, 2016. 1,
2, 3, 5, 7
[36] T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data-
driven reflectance priors for intrinsic image decomposition.
In ICCV, 2015. 2
28