arXiv:1610.01944v1 [cs.CV] 6 Oct 2016 · WRL format. Depth images and orthophotos For the derivation of orthophotos and depth maps we estimate a support plane for the input mesh by

PetroSurf3D – A high-resolution 3D Dataset of

Rock Art for Surface Segmentation

Georg Poier1, Markus Seidl2, Matthias Zeppelzauer2, Christian Reinbacher1, MartinSchaich3, Giovanna Bellando4, Alberto Marretta5, and Horst Bischof1

1Graz University of Technology2St. Polten University of Applied Sciences

3ArcTron 3D4University of Cambridge

5Parco Archeologico Comunale di Seradina-Bedolina

Abstract

Ancient rock engravings (so called petroglyphs) represent one of theearliest surviving artifacts describing life of our ancestors. Recently, mod-ern 3D scanning techniques found their application in the domain of rockart documentation by providing high-resolution reconstructions of rocksurfaces. Reconstruction results demonstrate the strengths of novel 3Dtechniques and have the potential to replace the traditional (manual) doc-umentation techniques of archaeologists.

An important analysis task in rock art documentation is the segmenta-tion of petroglyphs. To foster automation of this tedious step, we presenta high-resolution 3D surface dataset of natural rock surfaces which exhibitdifferent petroglyphs together with accurate expert ground-truth annota-tions. To our knowledge, this dataset is the first public 3D surface datasetwhich allows for surface segmentation at sub-millimeter scale. We con-duct experiments with state-of-the-art methods to generate a baseline forthe dataset and verify that the size and variability of the data is suffi-cient to successfully adopt even recent data-hungry Convolutional NeuralNetworks (CNNs). Furthermore, we experimentally demonstrate that theprovided geometric information is key to successful automatic segmenta-tion and strongly outperforms color-based segmentation. The introduceddataset represents a novel benchmark for 3D surface segmentation meth-ods in general and is intended to foster comparability among differentapproaches in future.

1 Introduction

Petroglyphs are a type of rock art that have been pecked, scratched or carvedinto rocks with different types of tools in many places all over the world. Theyrepresent figures and symbols with different artistic expressions and are one ofthe rare sources of information for the investigation of the development of humanlife and culture. Large effort is made in analyzing these historic documents[1, 2, 3].

1

arX

iv:1

610.

0194

4v1

[cs

.CV

] 6

Oct

201

6

Figure 1: Traditional contact tracing. The rock surface is covered by transparentfoil and petroglyphs are traced manually

The primary tool for documentation and analysis of petroglyphs by archae-ologists is contact tracing. In contact tracing the rock surface is covered withtransparent foil and the archaeologist annotates the individual peck-marks thatmake up the petroglyphs on the foil (see Fig. 1). However, considering thetremendous amount of petroglyphs that have been discovered so far [1, 4], con-tact tracing becomes infeasible. Additionally, contact tracing captures only 2Dinformation (the shape of the figures) – not their inherent 3D attributes. Hence,automated 3D scanning and analysis methods are required to perform analysesat such large scales.

A basic step to increase the level of automation for the reconstruction anddocumentation of petroglyphs is to replace the time-consuming, manual tracingprocedure. This requires, firstly, the precise 3D scanning of the rock surfaceand, secondly, the segmentation of petroglyphs from the scanned surfaces. Thesegmentation problem can be stated as: separate pecked regions from the un-worked rock surface. In general, this problem is an instance of 3D segmentationand is closely related to texture segmentation with one main difference beingthat in case of petroglyph segmentation 3D surface texture is analyzed insteadof image texture. The case of petroglyph segmentation is particularly hard dueto the large variety of different figures (shapes), different pecking styles (dueto different tools and artistic styles), as well as different types of rock surfaces(materials) and different levels of erosion and plant cover.

A crucial requirement for the development of automatic surface segmenta-tion algorithms are publicly available datasets with precise manual annotations(ground truth). Such datasets are not only needed for the evaluation and objec-tive comparison of different approaches but further for training machine learningmethods such as the most successful methods today.

A large number of datasets have been published for 2D and 3D texture anal-ysis. They were often created for tasks like material or texture classificationand segmentation, or to analyze specific surface properties [5, 6, 7]. Usually,

2

no geometric information is provided with these datasets i.e., the datasets con-tain only images of the surfaces (potentially with different lighting directions).Petroglyphs, however, are three dimensional objects by nature and often evenhumans are only able to estimate the extent of these figures by investigating thethird dimension, e.g ., by scattering them with oblique lighting in the dark or bytouching the petroglyphs with their fingers. Similarly, automatic segmentationmethods are supposed to benefit strongly from full 3D geometric informationcompared to only 2D (RGB) information. Other datasets, employed for seman-tic segmentation, indeed provide 3D information [8, 9, 10, 11, 12]. However,these datasets, which are usually captured using off-the-shelf depth cameras,have primarily been developed for scene understanding and object recognition.Thus they most often show isolated objects as well as synthetic or real indoorscenes. They often cover areas of several cubic meters, and provide samplingrates in the range of centimeters. These datasets focus mostly on entire objectsand not on the representation of different types of surfaces and thus address acompletely different task. Furthermore, the employed capturing hardware is notsuitable for the task of rock art segmentation, where interesting discriminativefeatures are in the range of millimeters. Recently, some work has been devotedto 3D analysis of rock art [2, 13, 14, 15, 16]. Most often the captured data inthese works has not been published and more importantly, these works do notaim at the task of surface segmentation and, hence, do not provide the necessaryground truth labeling.

To the best of our knowledge, there is no publicly available 3D petroglyphdataset, which indeed hampers progress in the field. Hence, most existing ap-proaches towards automatic segmentation of petroglyphs solely rely on 2D data[17, 18]. While these approaches can handle simple cases where the petroglyphsare clearly visible, in practice – due to often hundreds or thousands of years ofweathering and erosion – the petroglyphs are virtually indistinguishable fromthe natural rock surface from 2D imagery.

To fill this gap, we created a fully labeled 3D dataset of rock surfaces withpetroglyphs and make it publicly available1. In a large effort, we scanned pet-roglyphs on several different rocks at sub-millimeter accuracy. From the scanswe additionally generated orthophotos and corresponding depth maps to enablethe application of image-based approaches on the data. Note that since thereare usually no self-occlusions in pecked rock surfaces most of the 3D informationis preserved in the depth maps. For all depth maps and orthophotos we providepixel-wise ground truth labels (overall about 232 million labeled pixels).

Im sum, the contributions of this paper are as follows:

• A novel publicly available benchmark dataset for surface segmentation ofhigh-resolution 3D surfaces.

• Precise expert annotations for the evaluation of surface segmentation al-gorithms.

• Baseline experiments with state-of-the-art approaches

• A comprehensive evaluation that investigates the generalization ability ofprominent segmentation methods as well as the additional benefit of 3D

1 http://lrs.icg.tugraz.at/research/petroglyphsegmentation/

3

http://lrs.icg.tugraz.at/research/petroglyphsegmentation/

information for petroglyph segmentation compared to pure 2D textureprocessing.

The remainder of this paper is organized as follows. In Section 2 we presentthe dataset, describe the acquisition process, the available data, and providebasic statistics of the dataset. Section 3 specifies evaluation measures and pro-tocols to enable reproducible experiments, describes the performed experimentswith different state-of-the-art segmentation methods and presents first baselineresults for the dataset. We conclude the paper in Section 4.

2 Dataset

2.1 Dataset Acquisition

The surface data has been acquired in Summer 2013 at the UNESCO Worldheritage site in Valcamonica, Italy, which provides one of the largest collec-tions of rock art in the world2. The surfaces to be scanned have been carefullyselected by archaeologists with the intention to maximize diversity across dif-ferent styles, shapes, scenes, and locations. Furthermore, regions which werenever scanned in 3D before were given preference. The data has been scannedby experts using two different scanning techniques: (i) structured light scan-ning (SLS) with the Polymetric PTM1280 scanner in combination with theassociated software QTSculptor and (ii) structure from motion (SfM). For SfM,photos were acquired with a high-quality Nikkor 60 mm macro lense mountedon a Nikon D800. For bundle adjustment the SfM engine of the software pack-age Aspect3D3 was used and SURE4 was used for the densification of the pointclouds. The point clouds have been denoised by removing outliers which standout significantly from the surface [19] and smoothed by a moving least squaresfilter5. The resulting point clouds have a sampling distance of at least 0.1 mmand provide RGB color information for each 3D Point. The point’s coordinatesare in metric units relative to a base station. We provide the point cloudsin XYZRGB format, which is an ASCII format where every line contains onepoint of the cloud: < X,Y, Z,R,G,B >. Additionally, the point clouds weremeshed by Poisson triangulation. Meshes were textured and are provided inWRL format.

Depth images and orthophotos For the derivation of orthophotos anddepth maps we estimate a support plane for the input mesh by estimating amedian plane from a subset of its points. Next, we estimate the location of each3D point on the support plane by projecting the point along the normal direc-tion of the plane. We map the signed distances between the 3D points and theplane to the respective projected location on the plane. The result is a 2D depthmap of the 3D surface. Similarly, the orthophoto is generated by mapping theRGB colors to the support plane. For the rasterization of the projected images

2http://whc.unesco.org/en/list/94, last visited June 20163http://aspect.arctron.de, last visited June 20164http://www.ifp.uni-stuttgart.de/publications/software/sure/index.en.html, last visited

June 20165Both filters are implemented in the Point Cloud Library (PCL) http://pointclouds.org,

last visited June 2016

4

a resolution of 300dpi (i.e., 0.08 mm pixel side length) has been chosen to avoidloosing resolution compared to the point cloud. We used meshes to generate theprojections since, this way, a dense projection without holes is possible. Thedepth maps are stored as 32 bit TIFF files.

For each surface a pixel-accurate ground truth has been generated by ar-chaeologists that labeled all pecked regions on the surface. Since the surfacescontain no self-occlusions the annotators worked directly on the 2D orthophotosand depth maps. For this purpose an image processing program with differentbrush tools was used to produce the ground truth annotation. The Annotatorsspent several hours on each surface depending on the size and type (anthro-pomorph, inscription, symbol, etc.) of the images. Anthropogenically altered,i.e., pecked areas were annotated with white color, whereas the natural rocksurface remained black and regions outside the scan were colored red. The ar-chaeologists reported that – besides the tedious procedure – they sometimesexperienced difficulties in annotating pecked regions from the orthophotos dueto their similar visual appearance to the natural rock surface.

2.2 Dataset Overview

The final dataset contains 26 high-resolution surface reconstructions of naturalrock surfaces with a large number of petroglyphs. The petroglyphs have beencaptured at various locations at three different sites in the valley: “Foppe diNadro”, “Naquane”, and “Seradina”. We list the surface reconstructions con-tained in the dataset in Tab. 1. Each of the three sites is partitioned into dif-ferent rocks and larger rocks are further subdivided into multiple areas. Tab. 2provides some basic measures for each reconstruction, such as number of points,covered area, percentage of pecked surface area etc. The point clouds of allsurfaces together sum up to overall 115 million points. They cover in total anarea of around 1.6 m2. After generation of a mesh and projection to orthopho-tos and depth images this area corresponds to around 232 million pixels. Notethat there are more pixels than 3D points due to the interpolation that takesplace during projection of the mesh. The scans show scenes of isolated figuresas well as scenes with several petroglyphs depicted in interaction. The peckedregions in all reconstructions are disconnected and in average consist of about40 segments. The pecked regions make up around 19% of the entire scannedarea. Thus, the class priors are strongly imbalanced

Four example surfaces of the dataset are shown in Fig. 2. We depict the or-thophoto, the corresponding depth map and the ground truth labels. Note thatthe peckings are sometimes virtually unrecognizable from the orthophoto andcan hardly be discovered without taking the ground truth labels into account.Further note the strong variation in depth ranges which stems from differentshapes and curvatures of the rock surfaces.

3 Experiments

In this section we present experiments on our dataset that should serve as firstbaselines for surface segmentation. Note that we published some complementaryresults on the dataset previously [20]. In this work [20], we focused on inter-active segmentation and different types of hand-crafted surface features. In

5

Table 1: Summary of the reconstructed surfaces in the dataset. For each surfacewe provide an ID, description, its location (in terms of site, rock, and area) andthe reconstruction method We further provide the assignment to evaluationfolds for cross validation in the last columnID Description Site Rock Area Reconstr. Eval.

Method Fold-ID

1 Rosa Camuna Foppe di Nadro 24 7 SLS 42 Warrior Foppe di Nadro 24 7 SLS 23 Deer Foppe di Nadro 24 9 SLS 2

4 Stele (part 1) Naquane Ossimo 8 11 SLS 25 Stele (part 2) Naquane Ossimo 8 11 SLS 16 Stele (part 3) Naquane Ossimo 8 11 SLS 17 Stele (part 4) Naquane Ossimo 8 11 SLS 18 Standing Rider (part 1) Naquane 50 14 SFM 19 Standing Rider (part 2) Naquane 50 14 SFM 110 Sun-Shape Superimposition Naquane 50 15 SFM 1

11 Warrior Scene (part 1-1) Seradina 12C 1 SLS 412 Warrior Scene (part 1-2) Seradina 12C 1 SLS 313 Warrior Scene (part 1-3) Seradina 12C 1 SLS 314 Warrior Scene (part 1-4) Seradina 12C 1 SLS 415 Warrior Scene (part 2-2) Seradina 12C 1 SLS 316 Warrior Scene (part 2-3) Seradina 12C 1 SLS 317 Warrior Scene (part 2-4) Seradina 12C 1 SLS 318 Warrior Scene (part 3-1) Seradina 12C 1 SLS 319 Warrior Scene (part 3-2) Seradina 12C 1 SLS 320 Warrior Scene (part 3-3) Seradina 12C 1 SLS 121 Hunter With Bow Seradina 12C 4 SLS 422 Hunter With Speer (part 1) Seradina 12C 5 SLS 423 Hunter With Speer (part 2) Seradina 12C 5 SLS 224 Hunting Scene (part 1) Seradina 12C 10 SLS 225 Hunting Scene (part 2) Seradina 12C 10 SLS 226 Hunting Scene (part 3) Seradina 12C 10 SLS 2

contrast, here we focus on fully automatic segmentation and learned features.Aside from providing an evaluation protocol and first baselines of state-of-the-art approaches we investigate the following questions related to our dataset indetail: (i) What is the benefit of using full 3D information compared to puretexture information (RGB) for surface segmentation of petroglyphs? (ii) Canour learned models generalize from rock surfaces of one location to surfaces ofanother location (generalization ability)?

3.1 Evaluation Protocol

To enable reproducible and comparable experiments, we propose the followingtwo evaluation protocols on the dataset. Firstly, to obtain results for the wholedataset, we perform a k-fold cross-validation, with the number of folds beingk = 4. We randomly assigned the surface reconstructions to the folds and showthe assignment in the last column of Tab. 1.

For the second protocol we separate the dataset into two sets according tothe geographical locations the scans were acquired at. We employ one of thetwo sets as training set and the other one as test set, and vice-versa. In thisway, we can obtain insights into the generalization ability of a given approachacross data from different sites.

The latter protocol is especially interesting since, on the one hand, the rocksurface varies between sites, and on the other hand, the petroglyphs at differentsites sometimes exhibit vastly different shapes and peck styles, e.g ., due to

6

Table 2: Overview of basic measures of the digitized surfaces: the covered area(in pixels at 300dpi and in cm2), the number of 3D points in the point cloud,the percentage of pecked regions, the number of disconnected pecked regions,the range of depth values

ID Covered Area Num. Percentage Num. Depth Rangein px in cm2 3D Pts. Pecked Seg. in mm

1 5 143 296 368.69 3 264 005 14.61 48 2.892 15 638 394 1121.03 10 280 976 10.56 21 4.833 8 846 214 634.14 5 503 742 47.63 18 9.11

4 15 507 622 1 111.66 3 782 381 14.96 17 62.525 16 994 561 1 218.25 2 658 330 17.27 44 70.606 13 102 254 939.23 1 260 401 12.67 13 49.327 12 035 386 862.75 810 312 34.02 26 15.178 12 834 446 920.03 8 677 163 26.17 45 6.749 12 835 586 920.11 8 386 259 32.83 29 3.8210 5 901 454 423.04 2 096 476 21.59 9 5.41

11 5 632 144 403.74 3 541 799 9.26 23 10.2312 7 103 936 509.24 4 432 013 5.09 6 10.2213 6 155 628 441.26 3 810 000 8.26 63 19.8514 5 855 280 419.73 4 417 779 6.47 17 10.5015 4 855 764 348.08 2 981 570 4.44 24 9.3916 4 029 231 288.83 2 523 543 6.58 29 4.2717 4 838 487 346.84 3 022 433 3.15 27 21.7518 6 396 152 458.50 4 007 232 19.41 25 9.4519 7 141 253 511.92 4 472 845 18.20 32 17.3220 6 864 476 492.08 4 238 990 12.02 15 21.3921 3 909 579 280.26 2 255 030 20.40 61 5.3222 4 073 804 292.03 2 395 125 16.34 65 3.9923 3 612 131 258.93 2 113 670 24.23 54 5.3324 19 104 798 1 369.52 10 685 564 26.61 152 27.3525 14 920 005 1 069.53 8 188 025 15.55 63 17.4926 8 921 684 639.55 5 515 973 15.59 99 16.62

Overall 232 253 565 16 648.97 115 321 636 18.68 1025 70.60

different tools that were used for engraving. We separate the dataset into one setcontaining the scans from Seradina and the other one containing the scans fromFoppe di Nadro and Naquane. Foppe di Nadro and Naquane were joined becausethese sites are situated next to each other and the corresponding petroglyphsshow more similarities. Tab. 1 provides the association of surface reconstructionsto locations and, thus, shows the proposed split. For evaluation we use one ofthe two sets as training set and the other one as test set, and vice-versa. Thisresults in the following three experiments:

• Training on data from Foppe di Nadro and Naquane; testing on Seradina.

• Training on data from Seradina; testing on Foppe di Nadro.

• Training on data from Seradina; testing on Naquane.

In this way each surface reconstruction is exactly once in the test set.

Metrics

For quantitative evaluations on our dataset we propose a number of metricscommonly used for semantic segmentation to enable reproducible experiments6.

6We will provide the evaluation source code with the dataset

7

Figure 2: Example orthophotos (left), corresponding depth maps (center), andground truth labels (right). For visualization of the depth, we normalized andcropped the distance ranges per scan and show the resulting values in false color.Best viewed in color on screen with zoom

Measures include the Jaccard index [21], also often termed region based inter-section over union (IU) [22, 23, 24], the pixel accuracy (PA), the dice similaritycoefficient (DSC), the hit rate (HR) and the false acceptance rate (FAR).

IU is defined as the intersection of the predicted segmentation mask, P, andthe ground truth mask G over their union:

IU =|P ∩ G||P ∪ G|

=nii

ti +∑

j nji − nii, (1)

where operator |S| denotes the number of pixels in a set S, nij is the numberof pixels of class i predicted to belong to class j, and ti is the total number ofpixels of class i, i.e., ti =

∑j nij . To this end, commonly, the mean intersection

over union score (mIU) is considered. The mIU is simply the average IU score

8

over all classes (i.e., foreground and background in our case):

mIU =nii

N(ti +∑

j nji − nii), (2)

where N is the number of classes. The pixel accuracy, on the other hand, is theratio between correctly classified pixels and the overall number of pixels:

PA =

∑i nii∑i ti

. (3)

The DSC is similar to the IU and measures the mutual overlap between thepredicted and the ground truth mask. However, we do not compute the meanover classes in this case, but instead use the standard definition and, thus, focuson the foreground class:

DSC =2 |P ∩ G||P|+ |G|

. (4)

While the above metrics are indicators for the overall segmentation quality,the hit rate (HR) and the false acceptance rate (FAR) reveal some furtherdetails for the given binary classification task. The HR measures the numberof correctly classified foreground pixels:

HR =|P ∩ G|

|P ∩ G|+ |G \ P|=|P ∩ G||G|

, (5)

which is in principle similar to the pixel accuracy measure but considers onlythe foreground class. The FAR, on the other hand, measures the number ofpixels, which were incorrectly predicted to be foreground:

FAR =|P \ G|

|P ∩ G|+ |P \ G|=|P \ G||P|

. (6)

In general higher values represent better segmentations for all above metricsexcept for FAR which should be minimized. The metrics are computed overthe whole dataset and additionally for each individual scan.

3.2 Methods

We evaluate the performance of prominent general approaches for semantic seg-mentation on our dataset. First, we perform experiments with a segmentationmethod based on Random Forests (RF). Next, we apply Convolutional NeuralNetworks (CNNs) [25, 26], which currently show best performance for semanticsegmentation [22, 23, 24, 27, 28] and compare them with the RF-based ap-proach. For our experiments we mainly employ the depth maps generated fromthe point clouds as input to segmentation as well as the orthophotos.

For Random Forests (RFs) we employed a straight forward approach, whichwas also used as a baseline in many other RF-based works on semantic segmen-tation [29, 30, 31]. That is, we trained a classification forest [32] to computea pixelwise labeling of the scans. The Random Forest is trained on patchesrepresenting the spatial neighborhood of the corresponding pixel. To this end,we downscaled the scans by a factor of 5 and extracted patches of size 17× 17corresponding to a side lengths 6.8 mm. We randomly sampled 8000 patches

9

Table 3: Quantitative results for different setups, comparing the capabilities ofcolor (2D) and depth (3D) information. 3D segmentation strongly outperformscolor-based 2D segmentation

Representation HR FAR DSC mIU PA

Color 0.493 0.675 0.392 0.465 0.715Depth 0.779 0.553 0.568 0.569 0.779

Depth – Cross-Sites 0.777 0.574 0.550 0.551 0.763

– balanced over the classes – from each training image. The used features arespecified with the respective experiments. For all experiments we trained 10trees, for which we stopped training when a maximum depth of 18 was reachedor less than a minimum number of 5 samples arrived in a node.

In the CNN-based approach we employ fully convolutional neural networksas proposed in [22], since this work has been very influential for several followingCNN-based methods for semantic segmentation [27, 24, 28]. To perform petro-glyph segmentation on our dataset we finetune a model, which was pre-trainedfor semantic segmentation on PASCAL-Context [33], for our task. To createtraining data for finetuning we again downscaled the depth maps by a factorof 5 and randomly sampled 224 × 224 pixel crops. To increase the variationin the training set, we augment it with randomly rotated versions of the depthmaps (r ∈ {0, 45, 90, . . . , 315} degrees) prior to sampling patches. Similarly, weflip the depth-maps with a probability of 0.5. Note, that rotating the imagesrandomly is reasonable since we are unable to define a unique orientation forthe petroglyphs. This comes from the fact that the petroglyphs have often beenpecked with arbitrary orientation on the rock surfaces. This way we sampledabout 5000 crops, while ensuring that each crop contains pixel labels from bothclasses. We finetuned for a maximum of 30 epochs. For finetuning we employCaffe [34] and set the learning rate to 5×10−9. Due to GPU memory limitations(3GB) we were only able to use a batch size of one (i.e. one depth map at atime). We, thus, follow [35] and use a high momentum of 0.98, which approxi-mates a higher batch size and might also yield better accuracy due to the morefrequent weight updates [35].

3.3 2D vs. 3D Information

In a first experiment we investigate the importance of 3D information providedby our dataset compared to pure color-based surface segmentation. Therefore,we train a Random Forest (RF) only with color information from the orthopho-tos and compare the results to a RF trained only on depth information. Forthis experiment we follow the 4-fold cross-validation protocol specified in Sec. 3.1and Tab. 1, respectively. The results in Tab. 3 clearly show the necessity for3D information to obtain good results. This is further underlined in Fig. 3,where the results are compared for each individual scan. We observe that depthinformation improves results nearly for each scan by a large margin. This canbe explained by the fact that pecked surface regions often resemble the visualappearance of the neighboring unpecked rock surface due to influences fromweathering.

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

0.2

0.4

0.6

0.8

1

Scans

DSC

ColorDepth

Figure 3: Dice Similarity Coefficient (DSC) per scan

Table 4: Results for cross-validation over different sites. Quantitative resultsobtained for scans from Seradina when an RF classifier is trained on scans ofonly Foppe di Nadro and Naquane, as well as results for scans from Foppe diNadro and Naquane when the classifier is trained only on scans of Seradina

Training Set: Foppe di Nadro + Naquane Seradina SeradinaTest Set: Seradina Foppe di Nadro Naquane

HR 0.843 0.706 0.744FAR 0.544 0.274 0.644DSC 0.592 0.716 0.482mIU 0.612 0.704 0.446PA 0.827 0.875 0.645

Note that we also experimented with combining color and depth informa-tion, as well as computing different features like image gradients, and texturefeatures like LBP [36], and Haralick features [37], on top of color as well asdepth information. However, these had no or only insignificant impact on thefinal segmentation performance and, hence, the results are omitted for brevity.

3.4 Baseline Results

In this section we present the results of the baseline methods for the two pro-posed evaluation protocols.

3.4.1 Cross-Site Generalization

The results for Random Forests for the proposed cross-site evaluation protocol(see Sec. 3.1) are listed in Tab. 4. Here, we give the detailed results for each ofthe three splits. Overall results averaged over all three experiments are shown inTab. 3 for comparison with the experiments in Sec. 3.3. Interestingly, the overallresults are in the same range as the results of the 4-fold cross-validation withrandomly selected folds. This suggests that – using 3D information – methodsare able to generalize from one site of the valley to another.

11

Table 5: 4-fold cross-validation results for Random Forests (RFs) and Convo-lutional Neural Networks (CNNs)

Method HR FAR DSC mIU PA

RF 0.779 0.553 0.568 0.569 0.779CNN 0.693 0.357 0.667 0.676 0.871

3.4.2 4-fold Cross-Validation

To provide a more comprehensive baseline for the performance of state-of-the-art methods we compare the results obtained with Random Forests (RFs) andConvolutional Neural Networks (CNNs) both evaluated on depth information.For the CNN, which was pre-trained on color images (see Section 3.2) we simplyfill all three input channels with the same depth channel to obtain a compatibleinput format. Additionally, we subtract the local average depth value fromeach pixel in the depth map to normalize the depth data, which was necessary toexploit the CNN pre-trained on RGB data. This normalization can be efficientlyperformed in a pre-processing step by subtracting a smoothed version of thedepth map (Gaussian filter with σ = 12.5 mm) from the depth map. Thisoperation results in a local constrast equalization across the depth map [38]that better enhances the fine geometric details of the surface texture.

Quantitative results for the whole dataset are shown in Tab. 5. The quanti-tative results in terms of mUI for each surface are visualized in Fig. 4. In Fig. 5we show some example results for each method. From the results we observethat the Random Forest (RF) yields more cluttered results, whereas the theCNN yields more consistent but coarser segmentations. The RF correctly de-tects small and thin pecked regions, which the CNN misses, whereas the CNNusually captures the overall shape of the petroglyphs more accurately but missesdetails. Note that for none of the results we applied an MRF, CRF or similarmodels, since we want to focus the baseline comparison on the differences be-tween the base methods. We assume that the reasons for the differences of RFand CNN are (i) that the RF makes independent pixel-wise decisions whereasthe CNN implicitly considers the spatial context through its learned feature hi-erarchy and (ii) that the receptive field of the RF is smaller than the receptivefield of the CNN. This is because the CNN is able to exploit the additional spa-tial information while the RF was unable to effectively exploit larger receptivefields in our experiments.

The different abilities of RF and CNN are further reflected in the quantitativeresults in Tab. 5. The more consistent and coarser segmentation of CNN yieldsto a better overall segmentation result which is reflected by the higher DSC,mIU, and PA values. For the foreground class in particular the HR of RFoutperforms that of CNN which means that a higher percentage of foregroundpixels is labeled correctly. The reason for this is that CNN often misses largerportions of the pecked regions.

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

0.2

0.4

0.6

0.8

1

Scans

mIU

RFCNN

Figure 4: Mean intersection over union (mIU) per scan

4 Conclusions

In this paper, we describe a 3D dataset with high-resolution scans of rock sur-faces containing petroglyphs. The main motivation for contributing the datasetto the community is to foster, in general, research on the automated semanticsegmentation of 3D surfaces and, in particular, the segmentation of petroglyphs.We complement the dataset with accurate expert-annotated ground-truth andthe results for prominent segmentation methods that serve as baseline for com-parisons with future approaches. The central lessons learned from our experi-ments are: (i) Depth information – as provided by our dataset – is imperativefor the generalization ability of segmentation methods. (ii) In most cases, theuse of CNN classification outperforms RFs in terms of quantitative measuresand, qualitatively, the CNN yields rougher but more consistent segmentationsthan RFs.

Our experiments point out that 3D information is essential for the segmen-tation of different surface textures. By contributing the dataset to the public wehope to stimulate research on 3D segmentation methods in general and therebyalso the segmentation of petroglyphs as a contribution to the conservation ofour cultural heritage.

13

(a) RGB (b) Depth map (c) Ground truth

(d) RF result (e) CNN result

(f) RGB (g) Depth map (h) Ground truth

(i) RF result (j) CNN result

Figure 5: Input images (orthophotos and depth maps), ground truth labelingsand results for CNNs and RFs. CNN segmentation is smoother but fails incapturing fine details compared to RF.

14

References

[1] Anati, E.: Evoluzione e stile nell’ arte rupestre Camuna. Archivi 6 (1975)

[2] Williams, K., Twohig, E.S.: From sketchbook to structure from motion:Recording prehistoric carvings in ireland. Digital Applications in Archae-ology and Cultural Heritage 2(2–3) (2015) 120–131

[3] Noya, N.C., Garcıa, A.L., Ramırez, F.C.: Combining photogrammetryand photographic enhancement techniques for the recording of megalithicart in north-west Iberia. Digital Applications in Archaeology and CulturalHeritage 2(2–3) (2015) 89–101

[4] Arca, A.: Mount bego and valcamonica, most ancient engraving phasescomparison. from neolithic to early bronze age, parallels and differencesbetween marvegie (marvels) and pitoti (puppets) of the two main alpinerock art poles. Rivista di scienze preistoriche 59 (2009) 265–306

[5] Dana, K.J., van Ginneken, B., Nayar, S.K., Koenderink, J.J.: Reflectanceand texture of real-world surfaces. ACM Trans. on Graphics 18(1) (1999)1–34

[6] Ojala, T., Maenpaa, T., Pietikainen, M., Viertola, J., Kyllonen, J., Huovi-nen, S.: Outex- new framework for empirical evaluation of texture analysisalgorithms. In: Proc. Int’l Conf. on Pattern Recognition. (2002)

[7] Haindl, M., Mikes, S.: Texture segmentation benchmark. In: Proc. Int’lConf. on Pattern Recognition. (2008)

[8] Firman, M.: RGBD datasets: Past, present and future. In: CVPR Work-shop on Large Scale 3D Data: Acquisition, Modelling and Analysis. (2016)

[9] Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene un-derstanding benchmark suite. In: Proc. IEEE Conf. on Computer Visionand Pattern Recognition. (2015)

[10] Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentationand support inference from RGBD images. In: Proc. European Conf. onComputer Vision. (2012)

[11] Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Dar-rell, T.: A category-level 3-d object dataset: Putting the kinect to work.In: ICCV Workshop on Consumer Depth Cameras for Computer Vision.(2011)

[12] Silberman, N., Fergus, R.: Indoor scene segmentation using a structuredlight sensor. In: ICCV Workshop on 3D Representation and Recognition.(2011)

[13] Alexander, C., Pinz, A., Reinbacher, C.: Multi-scale 3d rock-art recording.Digital Applications in Archaeology and Cultural Heritage 2(2–3) (2015)181–195

15

[14] Plisson, H., Zotkina, L.V.: From 2d to 3d at macro- and microscopicscale in rock art studies. Digital Applications in Archaeology and CulturalHeritage 2(2–3) (2015) 102–119

[15] Lymer, K.: Image processing and visualisation of rock art laser scansfrom loups’s hill, county durham. Digital Applications in Archaeology andCultural Heritage 2(2–3) (2015) 155–165

[16] Robinson, D., Baker, M.J., Bedford, C., Perry, J., Wienhold, M., Bernard,J., Reeves, D., Kotoula, E., Gandy, D., Miles, J.: Methodological con-siderations of integrating portable digital technologies in the analysis andmanagement of complex superimposed californian pictographs: From spec-troscopy and spectral imaging to 3-d scanning. Digital Applications inArchaeology and Cultural Heritage 2(2–3) (2015) 166–180

[17] Seidl, M., Breiteneder, C.: Automated petroglyph image segmentation withinteractive classifier fusion. In Proc. of the Indian Conference on ComputerVision, Graphics and Image Processing (2012)

[18] Deufemia, V., Paolino, L.: Segmentation and Recognition of PetroglyphsUsing Generic Fourier Descriptors. In: Image and Signal Processing. (2014)487–494

[19] Rusu, R.B., Marton, Z.C., Blodow, N., Dolha, M., Beetz, M.: Towards 3dpoint cloud based object maps for household environments. Robotics andAutonomous Systems 56(11) (2008) 927–941

[20] Zeppelzauer, M., Poier, G., Seidl, M., Reinbacher, C., Breiteneder, C.,Bischof, H., Schulter, S.: Interactive segmentation of rock-art in high-resolution 3d reconstructions. In: Proc. Int’l Conf. on Digital Heritage.(2015)

[21] Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol-ogist 11(2) (1912) 37–50

[22] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks forsemantic segmentation. In: Proc. IEEE Conf. on Computer Vision andPattern Recognition. (2015)

[23] Hariharan, B., Arbelaez, P.A., Girshick, R.B., Malik, J.: Hypercolumns forobject segmentation and fine-grained localization. In: Proc. IEEE Conf.on Computer Vision and Pattern Recognition. (2015)

[24] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du,D., Huang, C., Torr, P.: Conditional random fields as recurrent neuralnetworks. In: Proc. IEEE Int’l Conf. on Computer Vision. (2015)

[25] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learningapplied to document recognition. Proceedings of the IEEE 86(11) (1998)2278–2324

[26] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification withdeep convolutional neural networks. In: Advances in Neural InformationProcessing Systems. (2012)

16

[27] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semanticimage segmentation with deep convolutional nets and fully connected crfs.In: Proc. Int’l Conf. on Learning Representations. (2015)

[28] Lin, G., Shen, C., van dan Hengel, A., Reid, I.: Efficient piecewise trainingof deep structured models for semantic segmentation. In: Proc. IEEE Conf.on Computer Vision and Pattern Recognition. (2016)

[29] Laptev, D., Buhmann, J.M.: Convolutional decision trees for feature learn-ing and segmentation. In: Proc. German Conf. on Pattern Recognition.(2014)

[30] Kontschieder, P., Kohli, P., Shotton, J., Criminisi, A.: Geof: Geodesicforests for learning coupled predictors. In: Proc. IEEE Conf. on ComputerVision and Pattern Recognition. (2013)

[31] Bulo, S.R., Kontschieder, P.: Neural decision forests for semantic imagelabelling. In: Proc. IEEE Conf. on Computer Vision and Pattern Recogni-tion. (2014)

[32] Breiman, L.: Random forests. Machine Learning 45(1) (2001) 5–32

[33] Mottaghi, R., Chen, X., Liu, X., Cho, N., Lee, S., Fidler, S., Urtasun,R., Yuille, A.L.: The role of context for object detection and semanticsegmentation in the wild. In: Proc. IEEE Conf. on Computer Vision andPattern Recognition. (2014)

[34] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.,Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fastfeature embedding. arXiv preprint arXiv:1408.5093 (2014)

[35] Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks forsemantic segmentation. IEEE Trans. on Pattern Analysis and MachineIntelligence 0(0) (2016) 1–12 (to be published).

[36] Ojala, T., Pietikainen, M.: Unsupervised texture segmentation using fea-ture distributions. Pattern Recognition 32(3) (1999) 477–486

[37] Haralick, R.M., Shanmugam, K.S., Dinstein, I.: Textural features for imageclassification. IEEE Trans. Systems, Man, and Cybernetics 3(6) (1973)610–621

[38] Zeppelzauer, M., Seidl, M.: Efficient image-space extraction and represen-tation of 3d surface topography. In: Proc. Int’l Conf. on Image Processing.(2015)

17

Documents

arXiv:1610.01944v1 [cs.CV] 6 Oct 2016 · WRL format. Depth images and orthophotos For the derivation of orthophotos and depth maps we estimate a support plane for the input mesh by