6
Spatially-Varying Outdoor Lighting Estimation from Intrinsics: Supplementary Material Yongjie Zhu 1 Yinda Zhang 2 Si Li 1 * Boxin Shi 3, 4 * 1 School of Artificial Intelligence, Beijing University of Posts and Telecommunications 2 Google 3 NELVT, Department of Computer Science and Technology, Peking University 4 Institute for Artificial Intelligence, Peking University In this supplementary material, we provide more details about our implementation details, synthetic data generation, network architectures, and additional results on synthetic data and real images in the wild. A. Implementation Details To train SOLID-Net, we use SOLID-Img augmented with random flip and crop. Our framework is implemented in PyTorch [3] and Adam [2] optimizer is used with default parameters. We first train I-Net using a batch size of 8 for 20 epochs until convergence, and then train P-Net with a batch size of 4 for 60 epochs on an RTX2080 GPU. We find that an end-to-end fine-tuning does not improve the perfor- mance. The learning rate is initially set to 5 × 10 -4 and halved every 5 epochs for both networks. Training conver- gence takes roughly 24 hours. B. Additional Details in Data Generation The data generation pipeline for SOLID-Img dataset has been introduced in Section 3.1 of the main paper. Here, we introduce additional details about the camera setting step. For each road block, we select a set of cameras with diverse views seeing most objects in the context, to pro- vide comprehensive information for lighting estimation, as shown in Figure 2(a) of the main paper. Our process starts by selecting the “best” camera [6] for each of the six hori- zontal view direction sectors in every road block. For each horizontal view direction, we sample a dense set of cameras on a 2D grid with 1m solution, choosing a random cam- era viewpoint within each grid cell, a random horizontal view direction within the 60 sector, a random height of 1.55 ± 0.05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam- era, we render an item buffer and count the number of visi- ble pixels according to Z-buffer and the number of objects. For each horizontal view direction in each road block, we select the view with the highest percentage pixel coverage, as long as it has more than three object categories. Table A: Numerical results and MAE errors on estimated sky environment map. Methods ξ azimuth ξ elevation ξ HDR LENet reg 37.0 16.2 0.508 LENet ae [5] 34.0 16.0 0.542 LENet sky [1] 22.3 11.0 0.609 Ours (w/o L dif ) 19.1 11.4 0.491 Ours 12.6 8.5 0.478 C. Details about Network Architectures In this section, we introduce the detailed network archi- tectures of baseline models as shown in Figure A. The first baseline model, denoted as LENet reg , is a re- gression model that directly regresses the global sky en- vironment map from the input limited-FOV image. The second baseline model, denoted as LENet ae , is a two- stream convolution network used to estimate sun position and normalized HDR panorama from a LDR panorama [5]; we modify the input as a single limited-FOV LDR image to adapt our task. The last baseline model, denoted as LENet sky , learns to estimate both the sun azimuth and the sky parameters from two image encoders and uses an autoencoder to learn the space of outdoor lighting by com- pressing an HDR sky image to a 64-dimensional latent vec- tor and reconstructing it to a HDR sky environment map [1]. In particular, LENet ae learns azimuth estimation as a re- gression task, while LENet sky treats it as a classification task. Numerical results about these baseline models compared with ours are shown in Table A (ξ azimuth is the sun az- imuth angular error, ξ elevation is the sun elevation angular error, and ξ HDR is the MAE error of normalized sky envi- ronment map). Training a single model to estimate a sky environment map with azimuth rotation is proved to be dif- ficult (see Row 1 in Table A). By separating this task into the estimation of a normalized sky environment map (sun 1

Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

Spatially-Varying Outdoor Lighting Estimation from Intrinsics:Supplementary Material

Yongjie Zhu1 † Yinda Zhang2 Si Li1 ∗ Boxin Shi3, 4 ∗

1School of Artificial Intelligence, Beijing University of Posts and Telecommunications2Google 3NELVT, Department of Computer Science and Technology, Peking University

4Institute for Artificial Intelligence, Peking University

In this supplementary material, we provide more detailsabout our implementation details, synthetic data generation,network architectures, and additional results on syntheticdata and real images in the wild.

A. Implementation DetailsTo train SOLID-Net, we use SOLID-Img augmented

with random flip and crop. Our framework is implementedin PyTorch [3] and Adam [2] optimizer is used with defaultparameters. We first train I-Net using a batch size of 8 for20 epochs until convergence, and then train P-Net with abatch size of 4 for 60 epochs on an RTX2080 GPU. We findthat an end-to-end fine-tuning does not improve the perfor-mance. The learning rate is initially set to 5 × 10−4 andhalved every 5 epochs for both networks. Training conver-gence takes roughly 24 hours.

B. Additional Details in Data GenerationThe data generation pipeline for SOLID-Img dataset has

been introduced in Section 3.1 of the main paper. Here, weintroduce additional details about the camera setting step.

For each road block, we select a set of cameras withdiverse views seeing most objects in the context, to pro-vide comprehensive information for lighting estimation, asshown in Figure 2(a) of the main paper. Our process startsby selecting the “best” camera [6] for each of the six hori-zontal view direction sectors in every road block. For eachhorizontal view direction, we sample a dense set of camerason a 2D grid with 1m solution, choosing a random cam-era viewpoint within each grid cell, a random horizontalview direction within the 60◦ sector, a random height of1.55 ± 0.05m above the floor uniformly, and a pitch anglewithin 10◦ around horizontal direction. Then for each cam-era, we render an item buffer and count the number of visi-ble pixels according to Z-buffer and the number of objects.For each horizontal view direction in each road block, weselect the view with the highest percentage pixel coverage,as long as it has more than three object categories.

Table A: Numerical results and MAE errors on estimatedsky environment map.

Methods ξazimuth ξelevation ξHDR

LENetreg 37.0◦ 16.2◦ 0.508LENetae [5] 34.0◦ 16.0◦ 0.542LENetsky [1] 22.3◦ 11.0◦ 0.609Ours (w/o Ldif ) 19.1◦ 11.4◦ 0.491Ours 12.6◦ 8.5◦ 0.478

C. Details about Network Architectures

In this section, we introduce the detailed network archi-tectures of baseline models as shown in Figure A.

The first baseline model, denoted as LENetreg, is a re-gression model that directly regresses the global sky en-vironment map from the input limited-FOV image. Thesecond baseline model, denoted as LENetae, is a two-stream convolution network used to estimate sun positionand normalized HDR panorama from a LDR panorama [5];we modify the input as a single limited-FOV LDR imageto adapt our task. The last baseline model, denoted asLENetsky, learns to estimate both the sun azimuth andthe sky parameters from two image encoders and uses anautoencoder to learn the space of outdoor lighting by com-pressing an HDR sky image to a 64-dimensional latent vec-tor and reconstructing it to a HDR sky environment map [1].In particular, LENetae learns azimuth estimation as a re-gression task, while LENetsky treats it as a classificationtask.

Numerical results about these baseline models comparedwith ours are shown in Table A (ξazimuth is the sun az-imuth angular error, ξelevation is the sun elevation angularerror, and ξHDR is the MAE error of normalized sky envi-ronment map). Training a single model to estimate a skyenvironment map with azimuth rotation is proved to be dif-ficult (see Row 1 in Table A). By separating this task intothe estimation of a normalized sky environment map (sun

1

Page 2: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

Input: Limited FOV imageconv 7-64conv 5-128conv 3-256conv 3-256

FC-64

deconv 3-256deconv 3-256deconv 5-128deconv 7-64deconv 1-3

FC-32FC-16FC-1

Output: Normalized sky environment map

Output: Sun azimuth

Input: Limited FOV imageconv 7-64conv 5-128conv 3-256conv 3-256

deconv 3-256deconv 3-256deconv 5-128deconv 7-64deconv 1-3

Output: Sky environment map

conv 3-512

deconv 3-512

Input: Normalized sky environment mapconv 7-64

deconv 3-3

Output: Normalized sky environment map

FC-64

2 x residual block (stride 2 | channel 64)

2 x residual block (stride 2 | channel 16)

deconv 3-64deconv 3-64

3 x residual block (stride 1 | channel 64)

deconv 3-64deconv 3-64

2 x residual block (stride 1 | channel 64)

Output: Latent light code

sky

enco

der

sky

deco

der

Input: Limited FOV image

Output: Latent light code

DenseNet-161

FC-64

Input: Limited FOV image

Output: Sun azimuth

DenseNet-161

FC-32

(b) (c)(a)

sky decoder

Output: Normalized sky environment map

Figure A: Structures of baseline models.

in the middle position along the horizontal direction) andsun azimuth angle, the results of single model lighting esti-mation get improved (see Row 2 in Table A). Further usingthree sub models to solve the whole problem results in moreaccurate estimation than using single models (see Row 3in Table A). But the sun position and sky environment mapare closely related, we prove that the estimation accuracyto both could be improved by our jointly training with in-trinsic constraints being integrated in deep models (see Row4-5 in Table A).

D. Qualitative Results on Synthetic DataMore results of image decomposition and lighting es-

timation using SOLID-Img test dataset are shown in Fig-ure B, Figure C, and Figure D. Given an input image, our es-timated albedo, normal, plane distance, shadow, and shad-ing show close appearance to the ground truth (shown asinsets) as shown in Figure B. In Figure C, we can see thatglobal lighting estimation resuls of SOILD-Net are closerto ground truth than three baseline models (we rotate thelighting of LENetae and LENetsky according to the sunazimuth for better comparison). In addition, the relightedbunnies using our estimated lighting display accurate castshadows, while other models fail to render.

As shown in Figure D, our method can recover more ac-curate local lighting than NeurIllum [4] even for the reflec-tion of the ground and some unseen parts (typical examplescould be found in local lighting 1 in row 1, local lighting 3in row 4, and local lighting 4 in row 5). We conjecture this isbecause our multi-input (global lighting for lighting infor-mation, shadow for occlusion information) module providesmore useful information.

E. Qualitative Results on Real DataMore results of local lighting estimation on real dataset

are shown in Figure E. Compared with NeurIllum [4], thelighting estimation results of our method are more similarto the ground truth in terms of overall structure, and our

relighted bunnies show more realistic rendering apperances.

References[1] Yannick Hold-Geoffroy, Akshaya Athawale, and Jean-

Francois Lalonde. Deep Sky Modeling for Single Image Out-door Lighting Estimation. In Proc. of Computer Vision andPattern Recognition, 2019. 1

[2] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. 2015. 1

[3] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: Animperative style, high-performance deep learning library. InProc. of Neural Information Processing Systems, 2019. 1

[4] Shuran Song and Thomas Funkhouser. Neural Illumination:Lighting Prediction for Indoor Environments. In Proc. ofComputer Vision and Pattern Recognition, 2019. 2

[5] Jinsong Zhang and Jean-Francois Lalonde. Learning HighDynamic Range from Outdoor Panoramas. In Proc. of In-ternational Conference on Computer Vision, 2017. 1

[6] Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva,Joon-Young Lee, Hailin Jin, and Thomas Funkhouser.Physically-Based Rendering for Indoor Scene UnderstandingUsing Convolutional Neural Networks. In Proc. of ComputerVision and Pattern Recognition, 2017. 1

2

Page 3: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

Input Image Albedo Normal Plane Dis. Shadow Shading SH Light

Figure B: Intrinsic decomposition results and ground truth (shown as insets) on SOLID-Img test dataset.

3

Page 4: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

Input Image

Global

LightingR

elit.G

lobal Lighting

Relit.

Global

LightingR

elit.G

lobal Lighting

Relit.

Global

LightingR

elit.G

lobal Lighting

Relit.

Figure C: Global lighting estimation and relighting results on SOLID-Img test dataset.

4

Page 5: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

Input Image + Loca�onsO

ursN

eurIllumGT

Ours

NeurIllum

GTO

ursN

eurIllumGT

Ours

NeurIllum

GTO

ursN

eurIllumGT

Ours

NeurIllum

GTLocal ligh�ng Relit. Local ligh�ng Relit. Local ligh�ng Relit. Local ligh�ng Relit.1 2 3 4

12

3

4

1

2

3 4

12

3 4

1

2

34

1

2

34

12

34

Figure D: Local lighting estimation and relighting results on SOLID-Img test dataset.

5

Page 6: Spatially-Varying Outdoor Lighting Estimation from Intrinsics ......1:55 0:05m above the floor uniformly, and a pitch angle within 10 around horizontal direction. Then for each cam-era,

1 2

1 2

12

1

2

1

2

1

2

1

2

1

2

1

2

1

2

Input Image + Loca�ons

Local ligh�ng Relit. Local ligh�ng Relit.Local ligh�ng

Ours NeurIllumGT

Figure E: Spatially-varying lighting estimation and relighting results on real dataset.

6