8
Feature Regression for Multimodal Image Analysis Michael Ying Yang, Xuanzi Yong, Bodo Rosenhahn Institute for Information Processing (TNT), Leibniz University Hannover Appelstr. 9A, 30167 Hannover, Germany {yang,yong,rosenhahn}@tnt.uni-hannover.de Abstract In this paper, we analyze the relationship between the corresponding descriptors computed from multimodal im- ages with focus on visual and infrared images. First the de- scriptors are regressed by means of linear regression as well as Gaussian process. We apply different covariance func- tions and inference methods for Gaussian process. Then the descriptors detected from visual images are mapped to infrared images through the regression results. Predictions are assessed in two ways: the statistics of absolute error between true values and actual values, and the precision score of matching the predicted descriptors to the original infrared descriptors. Experimental results show that regres- sion methods achieve a well-assessed relationship between corresponding descriptors from multiple modalities. 1. Introduction In recent years multisensor data fusion receives more and more attention [10]. The integration of images from multi- ple modalities can provide complementary information and therefore the accuracy increases with an observed and char- acterized quantity. In the domain of computer vision, the detected points can be represented by some descriptors. In this way a point with its surroundings is described by a vec- tor. Therefore, we represent an image with a set of vectors, which get rid of the noises and some unnecessary informa- tion. Moreover the computational costs are also reduced as well as the memory costs, which shows to be more efficient. Given multi-modal images, a series of applications are provided such as matching objects and scene registration. It is easy to obtain the interest points from the given images and reform them by some feature descriptors. However, it comes to a question if there is some relationship between the two corresponding feature vectors. Or can we get the feature descriptor of a point in infrared image by the corre- sponding point in visual image. Moreover, with a mapping function, a descriptor is mapped to an infrared image as a new vector. So how would this new vector looks like, where might this new vector locate in the infrared images. In this paper, we address the aforementioned problems using re- gression methods. To the best of our knowledge, there is no existed work on analyzing the relationship among descrip- tors in multimodal images. In this work, we analyze the behavior of features in mul- timodal images with focus on visual and infrared images. In order to get a reliable result, we construct the datasets with different types of images from different categories. We present regression results of feature descriptors from visual images to infrared modality, which indicates the existence of the relations between the descriptors. The regression is worked in two ways: linear regression and Gaussian pro- cess for regression (GPR). The former has a computational advantage that it runs faster and costs lower than other com- mon regression methods. And the latter can obliquely rep- resent the underlying regression function without claiming, but rigorously. As a result, the descriptors of points de- tected in visual images are mapped as the descriptors from infrared images. We evaluate the performances of linear regression mainly by the value of coefficient of determina- tion and the results are evaluated by the mean and variance of error between the descriptor vectors. In order to assess the performance of Gaussian process regression, we apply the regression result to the application of matching. The re- sults are evaluated by the precision of matching. Moreover the regression error is considered as the criterion as well. From the results, we can find that, based on specific covari- ance functions and inference methods, the regression pro- cess performs well that the predicted descriptors are similar to the actual descriptor vectors. Example results of GPR are shown in Fig. 1. The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 presents our two re- gression methods for multimodal image analysis: linear re- gression and Gaussian process for regression. In Section 4, we provide a short review of related descriptors. Section 5 shows the results of two regression methods for the feature descriptors in multimodal images. Finally, this work is con- cluded in Section 6. 756

Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Feature Regression for Multimodal Image Analysis

Michael Ying Yang, Xuanzi Yong, Bodo RosenhahnInstitute for Information Processing (TNT), Leibniz University Hannover

Appelstr. 9A, 30167 Hannover, Germany{yang,yong,rosenhahn}@tnt.uni-hannover.de

Abstract

In this paper, we analyze the relationship between thecorresponding descriptors computed from multimodal im-ages with focus on visual and infrared images. First the de-scriptors are regressed by means of linear regression as wellas Gaussian process. We apply different covariance func-tions and inference methods for Gaussian process. Thenthe descriptors detected from visual images are mapped toinfrared images through the regression results. Predictionsare assessed in two ways: the statistics of absolute errorbetween true values and actual values, and the precisionscore of matching the predicted descriptors to the originalinfrared descriptors. Experimental results show that regres-sion methods achieve a well-assessed relationship betweencorresponding descriptors from multiple modalities.

1. IntroductionIn recent years multisensor data fusion receives more and

more attention [10]. The integration of images from multi-ple modalities can provide complementary information andtherefore the accuracy increases with an observed and char-acterized quantity. In the domain of computer vision, thedetected points can be represented by some descriptors. Inthis way a point with its surroundings is described by a vec-tor. Therefore, we represent an image with a set of vectors,which get rid of the noises and some unnecessary informa-tion. Moreover the computational costs are also reduced aswell as the memory costs, which shows to be more efficient.

Given multi-modal images, a series of applications areprovided such as matching objects and scene registration. Itis easy to obtain the interest points from the given imagesand reform them by some feature descriptors. However, itcomes to a question if there is some relationship betweenthe two corresponding feature vectors. Or can we get thefeature descriptor of a point in infrared image by the corre-sponding point in visual image. Moreover, with a mappingfunction, a descriptor is mapped to an infrared image as anew vector. So how would this new vector looks like, where

might this new vector locate in the infrared images. In thispaper, we address the aforementioned problems using re-gression methods. To the best of our knowledge, there is noexisted work on analyzing the relationship among descrip-tors in multimodal images.

In this work, we analyze the behavior of features in mul-timodal images with focus on visual and infrared images.In order to get a reliable result, we construct the datasetswith different types of images from different categories. Wepresent regression results of feature descriptors from visualimages to infrared modality, which indicates the existenceof the relations between the descriptors. The regression isworked in two ways: linear regression and Gaussian pro-cess for regression (GPR). The former has a computationaladvantage that it runs faster and costs lower than other com-mon regression methods. And the latter can obliquely rep-resent the underlying regression function without claiming,but rigorously. As a result, the descriptors of points de-tected in visual images are mapped as the descriptors frominfrared images. We evaluate the performances of linearregression mainly by the value of coefficient of determina-tion and the results are evaluated by the mean and varianceof error between the descriptor vectors. In order to assessthe performance of Gaussian process regression, we applythe regression result to the application of matching. The re-sults are evaluated by the precision of matching. Moreoverthe regression error is considered as the criterion as well.From the results, we can find that, based on specific covari-ance functions and inference methods, the regression pro-cess performs well that the predicted descriptors are similarto the actual descriptor vectors. Example results of GPR areshown in Fig. 1.

The rest of the paper is organized as follows: Section 2presents the related work. Section 3 presents our two re-gression methods for multimodal image analysis: linear re-gression and Gaussian process for regression. In Section 4,we provide a short review of related descriptors. Section 5shows the results of two regression methods for the featuredescriptors in multimodal images. Finally, this work is con-cluded in Section 6.

1756

Page 2: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Figure 1. Example results of GPR using SIFT. The left column areoriginal visual images with detected points and on the right sideare infrared images with relocated descriptors that are achieved byGaussian process regression.

2. Related Work

On account of the development of sensor fusion tech-nique, many applications were developed under multiplemodality. [13] explored a statistic research on analyzingthe significant characters on infrared images. Compared tocorresponding visual images, the infrared images had no-ticeably less texture indoors because of the homogeneoustemperature. Further, the joint wavelet statistics presentedstrong correlation between object boundaries in visual andinfrared images, which could be used in vision applica-tions with the combined statistical model. Moreover anoverview of registering different types of sensors was pro-vided by [17]. [8] studied an approach to multimodal imageregistration based on corners and Hausdorff distance. Theapproaches using the mutual information as the matchingcriterion are the state-of-the-art technique in multispectralmatching [12, 9]. Due to the points and the contours of in-frared images are different enough relative to visual imagesof the same scene, this region-based technique performs rel-atively well. [7] implemented a line-based global transfor-mation using the edge properties for the image registrationbetween visual images and infrared images. Besides, [6]provided a feature based matching and multimodal RGB toNIR registration with multispectral interest points. Further-more an experiment for multimodal 2D and 3D face recog-nition was presented by [4]. [1] pursued on the problemof matching images with disparate appearance arising fromfactors such as dramatic illumination (day vs. night), timeperiod (historic vs. new) and rendering style differences.By using the eigen-spectrum of the joint image graph, thepersistent features were detected and matched into pairs.

3. Descriptor Regression

Given sets of descriptors DA detected from visual im-ages and DB from infrared images, regression analysis isto estimate the relationships between DA and DB. Bymeans of the relationship, which might be implicit or ex-plicit, the predictions DA′ come out that the descriptor ismapped from visual image to infrared modality. The re-gression process includes two techniques for modeling andanalyzing variables. Since linear regression is a commonand light method used in stochastic problems, it is consid-ered in this work at first. Then Gaussian process is used asan advanced method, which can especially solve non-linearproblems.

3.1. Linear Regression

In statistics, linear regression is an approach to model therelationship between a scalar dependent variable and one ormore explanatory variables, in which data are modeled bylinear functions and unknown model parameters are esti-mated from the data [3].

Upon to the principle of linear regression, a descriptorDa with n dimensions from visual image is mapped to adescriptor as Db in the same dimension in infrared imagethrough a matrix as linear transformation, which is givenas:

Db = Da×H (1)

where H is a n× n matrix. H is calculated by training andthen used to predict the new input Da forward to a Db. Theprocess of linear regression is implemented by the methodof Least squares. It is a technique for mathematical optimiz-ing that the sum of the squares of the errors is minimized byequating its gradient to zero and then the regressors are ob-tained through the mean value.

3.1.1 Regression estimating

The regression procedure is estimated by a set of statistics,such as R2, F , p and the estimate of the error varianceerr.var. R2 is the coefficient of determination defined as

R2 ≡ 1− SSres

SStot, (2)

and SStot is the total sum of squared errors in the model thatdoes not use the independent variable, and SSres is the sumof squared errors in the linear model. It is a very importantindicator to state if the regression is efficient while it in-forms the goodness of fit of a model. In regression, R2 rep-resents the percent of the data that is the closest to the lineof best fit, in other words, it informs how well the regres-sion line approximates the real data points. The F statisticis the test statistic of the F-test on the regression model, for

757

Page 3: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

a significant linear regression relationship between the re-sponse variable and the predictor variables. P-value p is theprobability of obtaining a test statistic at least as extreme asthe one that was actually observed, assuming that the nullhypothesis is true. When the p-value is less than the givensignificant level, in usual case as 0.05, the null hypothesiswill be rejected. By using these arguments, the performanceof linear regression is evaluated.

3.1.2 Predictions evaluating

After the regression procedure, the linear transformationmatrix H is obtained and then used to predict the new de-scriptor Da forward to a Da′. For sake of predictions eval-uating, we compare the prediction Da′ and the true de-scriptor vector Db by the absolute difference between cor-responding components in vectors as ε = |Da′ −Db|. Twoparameters are used to evaluate, that is mean, which is theaverage value of ε and the variance of ε.

However, the method of linear regression can not solvenon-linear problems. Hence we use Gaussian process for re-gression as an advanced method, in which a specific modelneed not to be claimed at first.

3.2. Gaussian Process for Regression

Given some noisy observations of a dependent variable,the estimate of a new value x comes out easily by using afunction f(x), which can describe the distribution of the ob-servations. Rather than a specific model which the claimedfunction f(x) relates to, a Gaussian process can representf(x) obliquely, but rigorously [15]. That is so-called Gaus-sian Process Regression (GPR).

Taking account of the noise on the observed target valuesfrom measurement errors and so on, which are given by

tn = yn + εn (3)

where yn = f(xn), and εn is a random noise variable whosevalue is chosen independently for each observation n.

The conditional distribution of tN+1 given target valuest = (t1, . . . , tN )T is itself Gaussian-distributed as the form:

tN+1|t ∼ N (kTC−1N t, c− kTC−1N k). (4)

The mean, kTC−1N t, is known as the matrix of regressioncoefficients, and the variance, c − kTC−1N k, is the Schurcomplement of CN in CN+1. These are the key results thatdefine Gaussian process regression. While the vector k isa function with respect to the test input value xN+1, thepredictive distribution is a Gaussian depended on xN+1.

As a crucial component of a Gaussian process predic-tor, covariance function controls how much the data aresmoothed in estimating the unknown function [15]. Two

functions are considered: the squared exponential (SE) co-variance function has the form

kSE(r) = exp(− r2

2`2), (5)

with parameter ` defined as characteristic length-scale.This covariance function has sample functions with in-finitely many derivatives and thus is very smooth. Anotheris rational quadratic (RQ) covariance function

kRQ(r) = (1 +r2

2α`2)−α (6)

with α, ` > 0, which can be regarded as a scale mixture (aninfinite sum) of squared exponential (SE) covariance func-tions with different characteristic length-scales (sums of co-variance functions are also a valid covariance).

The descriptors are treated in two ways:

3.2.1 Global descriptors

Assuming a particular structure, where the covariance func-tion is set as the squared exponential function and the meanof Gaussian process is defined as zero like the assumptionsin most cases. In addition, the Expectation propagation (EP)is applied as the inference function and the likelihood func-tion is in the form of Laplace. The parameters of covariancefunction are initialized with zero at first and later they areoptimized by minimizing their negative log marginal likeli-hood.

3.2.2 Local descriptors

The goal of this part is to check the potential location re-lationship of the descriptors. Based on the training data,the descriptor vectors DA from visual images are mappedto the infrared images as DA′. For each descriptor Da inthe set DA, we are looking for the most similar descriptorsamong all the descriptors DB in infrared images. In otherwords, a vector is predicted by a descriptor from visual im-age. And the task is to check the location of this vector inthe infrared image.

The processing procedure is as following: first the inter-est points are detected from infrared and visual images andthen represented by feature descriptors. Hence we obtain aset of vector pairs. Each vector consists of two parts, the de-scriptor of the interest points and its location. The next stepis to obtain the predictions by Gaussian process. The initialhyperparameter of covariance function is set with 0.7. Andthen for one prediction vector, find the closest vector amongall the original descriptor vectors in infrared images by us-ing the Euclidean distance between two vectors. Moreovermin-pooling approach is used to avoid too many incorrect

758

Page 4: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

matchings. In practice, assuming the infrared image andvisual image display completely the same scene. Five can-didates are chosen with most similar vectors. And then theprediction is determined to locate in the position of its near-est candidate.

4. Feature DescriptorsDescriptors are used to represent the image structure in

spatial neighborhoods at a set of feature points. There arevarious kinds of descriptors, and we can choose an appro-priate one based on the application. In this section, wewill present four descriptors used in this work, that is SIFT,SURF, LBP and HOG.

4.1. Scale-invariant Feature Transform (SIFT)

SIFT (Scale-invariant feature transform) is based on theinterest points detected by Difference of Gaussian [11]. Thedescriptor records the direction for each interest point, thusit has good scale and rotational invariance. A key point ischaracterized with location, scale and direction. The orien-tations of 16×16 neighbors of each keypoint are calculatedand then projected into one of eight directions with 4×4 re-gion. Subsequently, a histogram is built with 8 bins, whichindicate 8 directions. As a result, the descriptor is in theform of vector with 128 dimensions. With the help of thisdescriptor, we can match key points between images.

4.2. Speeded Up Robust Feature (SURF)

Speeded Up Robust Feature (SURF) is an improvementof SIFT, which is first presented by [2]. It is claimed that itperforms excellent on repeatability, distinctiveness and ro-bustness. The interest points are detected using Hessian ma-trix, that is named as Fast Hessian detector, which is calcu-lated for each point. To solve it, SURF makes efficient useof integral images. Then by comparing each point with its26 neighbors on the same octave and the octave above andbelow, the points with maximum or minimum responses areconsidered as interest points after filtering by given thresh-old. The descriptor is based on sum of Haar wavelet re-sponses within the region in the size of 4 × 4, instead ofhistogram in SIFT, which is in the form as:∑

dx∑dy

∑|dx|

∑|dy|,

dx and dy are the filter responses to the Haar wavelets. Thusthe output of SURF is a feature vector with 64 dimensions.

4.3. Local Binary Pattern (LBP)

Local Binary Pattern (LBP) is a type of texture spectrummodel proposed in [16] and first described by [14]. In thisapproach, an examined window is first divided into 16× 16cells. And then for each pixel in a cell, comparing the gray-value with other eight neighbors. It is assigned as 1, when

the neighbor is greater than center pixel. Thus, an 8 bitbinary pattern comes, i.e LBP. Compute the histogram ofthe frequency of each binary number occurring over the celland normalize. The feature vector for the window shouldbe the concatenate normalized histograms of all cells.

4.4. Histogram of Oriented Gradients (HOG)

Histogram of Oriented Gradients (HOG) is first repre-sented by [5], which focuses on pedestrian detection at thattime. And the essential idea behind the Histogram of Ori-ented Gradient descriptors is that local object appearanceand shape within an image can be described by the distribu-tion of intensity gradients or edge directions. To implementit, the image need to be divided into small connected re-gions, called cells. And then compute the gradient for eachpixel in the region of a cell. The histogram of gradient ineach cell is the descriptor for the cell and the combination ofthese histograms present the descriptor. In some advancedprocess, the cells are grouped into larger spatial blocks andthese blocks are normalized separately. As a result, the finaldescriptor is exact the vector composed of all the compo-nents of the normalized cells by the blocks in the detectionwindow.

5. ExperimentsWe construct the experiments to regress the descriptors

by using linear regression and Gaussian process. And theresults are assessed by the criteria of error and the preci-sion of matching.

5.1. Datasets

Three datasets are used in this paper, namely RGB-NIR scene dataset from EPFL-IC-IVRG1, OutdoorUrbanby DGP-UofT2 and MoCap from LUH-TNT3.

These three datasets contain different image types withdifferent viewpoints. The dataset RGB-NIR consists ofnumbers of images captured in RGB and Near-infrared(NIR) by visible and NIR filters using separate exposuresfrom modified SLR cameras. There are totally 9 categoriessuch as field, forest, mountain and water. And the images inthe dataset OutdoorUrban are fully about the views aroundthe city, such as cars, buildings and some other city scenes.Compared to the natural scenery, there are greater thermalvariations in urban environments [13]. In this dataset, thereare total 290 outdoor urban daytime image pairs as well asabout 30 urban night-time image pairs. The data are cap-tured by a single axis, multiparameter camera which com-bines an infrared camera and a visible light camera. Andthe content of dataset MoCap is the human motions such as

1Image and Visual Representation Group at EPFL2Dynamic Graphics Project at University of Toronto3Institute for Information Processing at Leibniz University Hannover

759

Page 5: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Figure 2. Sample images from the datasets RGB-NIR, OutdoorUrban and MoCap respectively. The images in the first row are visual RGBimages and the images in the second row are the corresponding infrared (or near-infrared) images with respect to the images above.

Table 1. Information of datasetsDataset Type Data number Image size ContentsRGB-NIR near-infrared 370 640×480 nature viewsOutdoorUrban infrared 330 384×288 city viewsMoCap infrared 1300 640×480 human motions

Table 4. The test result of linear regression on HOG.HOG size mean varRGB-NIR 70 0.0356 0.1111Urban 27 0.2710 0.0656MoCap 500 0.0243 5.1046e-04

Table 5. The test result of linear regression on LBP.LBP size mean varRGB-NIR 70 449.8296 5.1483e+06Urban 27 499.2019 4.8877e+06MoCap 500 308.3616 2.0000e+06

waving, boxing and jogging indoors. Since the two camerasare set with a small baseline, the taken images are not iden-tical in view, but nor too far away from each other. Somesamples of the datasets are shown in Fig. 2 and the basicinformation is summarized in Table 1.

5.2. Results of Linear Regression

Considering the regression, 300 images in RGB-NIR areselected in the training set and the rest 37 images are keptfor testing. For dataset OutdoorUrban, the size of trainingdata is 100 and it is 27 of testing data. In dataset MoCap,there are 800 images in the training dataset and 500 imagesare regarded as testing data.

The regression is assessed by the results in Table 2 and

Table 3. In the tables, the value of R2 is around 0.9, whichmeans that the regression function is much closer to the truevalues, and it understands the information of the data verywell. Also most of the p-value in two tables are greaterless than 0.05, so the null hypothesis is rejected, namely thelinear model is correct for the data. But HOG in datasetOutdoorUrban with the value 0.0670 is an exception. In aword, the two descriptors are both regressed well with thetraining data, and they draw linear lines perfectly fitting tothe points.

For testing, the mean and var in Table 4 and Table 5refer to the average value and variance of the error betweenthe actual value and the true value. Since the componentsof HOG descriptor vectors are in the range of 0 to 1, thatthe mean error is only about 0.03 on RGB-NIR and MoCapindicates an excellent result. Meanwhile, the error of LBPseems much greater, but if they were normalized in the in-terval from 0 to 1, the average value of error is 0.0026 forexample of RGB-NIR.

5.3. Results of GPR

5.3.1 GPR for HOG and LBP

By using these hyperparameters, the new feature descriptorsare predicted. First set with 10 test data, the two criteria,mean and variance are computed as shown in Table 6.On datasets RGB-NIR and MoCap, the averages of absoluteerror are both under 0.05. Meanwhile the variances on thetwo dataset are in a great level as well.

Further, for the sake of analyzing the effect on trainingand testing dataset, we enlarge the size of training data to100 images and the size of testing data is amplified to 50

760

Page 6: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Table 2. The statistics of linear regression on HOG.HOG size R2 F p err.varRGB-NIR 300 0.9206 35.1212 0 0.0011Urban 100 0.9072 2.5281 0.0670 0.0055MoCap 1300 0.8482 101.2218 1.9102e-276 2.4714e-04

Table 3. The statistics of linear regression on LBP.LBP size R2 F p err.varRGB-NIR 300 0.9735 8.3986 6.8146e-06 5.5635e+05Urban 100 1 NaN NaN NaNMoCap 1300 0.8842 49.4688 5.4552e-16 1.0459e+06

Table 6. The result of GP regression for HOG.HOG RGB-NIR OutdoorUrban MoCapmean 0.0342 0.1376 0.0146variance 5.8757e-04 0.0070 8.3017e-06

Table 7. The result of GP regression for HOG with 100 trainingand 10 testing data.

HOG RGB-NIR OutdoorUrban MoCapmean 0.0310 0.1416 0.0042variance 4.0629e-4 0.0036 4.5617e-06

Table 8. The result of GP regression for HOG with 100 trainingand 50 testing data.

HOG RGB-NIR OutdoorUrban MoCapmean err 0.0238 0.0280 0.0502var err 0.0004 0.0010 0.0024fron actual 21.1325 21.2427 20.8977fron true 21.4242 21.4240 21.2131

Table 9. The result of GP regression with exact inference methodand Gaussian likelihood function.

HOG RGB-NIR OutdoorUrban MoCapmean err 0.0251 0.0287 0.0500corr2 0.8909 0.9609 0.8983frob actual 21.1551 21.2685 20.8614frob true 21.4242 21.4240 21.2131

respectively. Comparison the data in Tables 6, 7 and 8 invertical direction, the result indicates that the size of neithertraining data nor testing dataset can effect the performanceof Gaussian process heavily. Therefore, Gaussian processis robust and efficient with fewer training data.

Applying exact inference method and Gaussian as likeli-hood function, the sizes of training data and testing data areset with 100 and 50 respectively. Based on this setting, theprocess runs much faster than using EP inference method.From Table 9, we can see that the performance of evalua-tion is excellent, the average value of error is less than 0.03.According to Frobenius norm, the ratio between the actualvalue and true value is over 99%, which shows the similarity

completely.Also we consider the role of the initial value of the hy-

perparameters of the covariance functions. The values areset to change with step of 0.1. Since the procedure of pa-rameter optimization is applied, the initial values make nosense to effect the result and performance.

For the purpose of LBP , the same process contents havebeen executed as HOG. However, the result turns that wecan not obtain an answer to the regression for LBP.

5.3.2 GPR for SIFT and SURF

We consider squared exponential function (SE) and rationalquadratic function as the covariance function for Gaussianprocess regression. And also the two inference methods:EP inference method and exact inference method, are ap-plied in this work. In addition the mean value of Gaussianprocess is set as zero and 100. Thus, based on these threeconditions, where each has two values, there are totally 23

combinations.From the results of SIFT, the process with RQ covari-

ance function by exact inference method outperforms thanthe others with an optimal result, especially in RGB-NIRwith a precision over 90%. And the precisions on other setsare also acceptable with about 50%. But the SE covariancefunction is not fitted in this model. In addition, we can findthat the value of mean has little effect on the results. Onthe aspect of SURF, both RQ and SE covariance functionsperform well in this work. Based on the value of preci-sion as well as the illustration of the example result imagesin Fig. 3, the regression results from EP method are totallyfailed in each situation despite of higher computational cost.In a word, the best performance is the process by exact in-ference method with RQ covariance function.

5.4. Linear regression vs. Gaussian process regres-sion

For descriptor HOG, both linear regression and Gaussianprocess can predict reasonable mapping models from visualimages to infrared images. Comparing the two methods,

761

Page 7: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Figure 3. Example of descriptor matching. The descriptors are regressed by Gaussian process with exact inference method. First twocolumns refer to the results with RQ covariance function with mean value zero and 100, and the last two columns show the results withSE covariance function with mean value zero and 100. The squares refer to the detected points in visual images and the stars refer to therelocated descriptors in infrared images.

the results are shown in Table 10 depending on the condi-tion of error introduced before. Both approaches performwell with a low error. And it is obvious that Gaussian pro-

cess performs better than linear regression, where error byGaussian process is extraordinarily small. Another advan-tage of Gaussian process appears when the training set is

762

Page 8: Feature Regression for Multimodal Image Analysis...The rest of the paper is organized as follows: Section2 presents the related work. Section3presents our two re-gression methods for

Table 10. Comparison of linear regression and Gaussian processfor HOG.

mean error Linear Regression Gaussian ProcessNIR 0.1074 0.0553OutdoorUrban 0.7019 0.0505MoCap 0.0243 0.0475

small. It means that in practice, the requisite prior knowl-edge is much less for GP than linear regression. However,for this instance, linear regression also performs well, so itis a good choice as well because the complexity and cost oflinear regression is much lower than GP. Notice that actu-ally for the dataset OutdoorUrban, it does not fit into a lin-ear model. But Gaussian process can deal with linear modeland also non-linear model problems. We can see that com-paring to the error in OutdoorUrban by linear regression,Gaussian process is much better than it in this case.

6. ConclusionIn this paper, we have focused on the relationships

among descriptors from infrared and visual image pairs.Three extensive datasets of infrared and visual image pairsare considered to explore the regressions. Between corre-sponding HOG and LBP, linear relations have been pro-vided by least squares method with good regression qual-ities. This indicated the possibility to map a descriptor fromvisual image to infrared modality by a linear transformation.Furthermore, we have used Gaussian process for regressionon HOG and LBP. The optimal regression results have beenshown with small error by using squared exponential co-variance function. The GPR results of SIFT and SURF havebeen evaluated by the application of matching. The pro-cess of SIFT with rational quadratic function as covariancefunction has a good performance by evaluating the precisionscore of matching. The results have presented not only therelationships of SIFT and SURF corresponding descriptors,but also the possibility of obtaining the relationship of de-scriptors in multi-modal images by means of Gaussian pro-cess. In addition, comparing the results of linear regressionand GPR of HOG, Gaussian process performs better thanlinear regression but with a higher computational costs. Forthe future work, we will perform some regression analysisfor other multimodal data, such as visual and depth images.

AcknowledgmentsThe work is funded by the ERC-Starting Grant (DYNAMICMINVIP). The authors gratefully acknowledge the support.

References[1] M. Bansal and K. Daniilidis. Joint spectral correspon-

dence for disparate image matching. In CVPR, pages

2802–2809, 2013. 2[2] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded

up robust features. In ECCV (1), pages 404–417,2006. 4

[3] N. H. Bingham, N. Bingham, and J. M. Fry. Regres-sion: Linear models in statistics. Springer, 2010. 2

[4] K. I. Chang, K. W. Bowyer, and P. J. Flynn. An eval-uation of multimodal 2d+ 3d face biometrics. PAMI,27(4):619–624, 2005. 2

[5] N. Dalal and B. Triggs. Histograms of oriented gradi-ents for human detection. In CVPR, pages 886–893,2005. 4

[6] D. Firmenichy, M. Brown, and S. Susstrunk. Multi-spectral interest points for rgb-nir image registration.In ICIP, pages 181–184, 2011. 2

[7] J. Han, E. J. Pauwels, and P. M. de Zeeuw. Visible andinfrared image registration in man-made environmentsemploying hybrid visual features. Pattern RecognitionLetters, 34(1):42–51, 2013. 2

[8] T. Hrkac, Z. Kalafatic, and J. Krapac. Infrared-visualimage registration based on corners and hausdorff dis-tance. In Scandinavian Conference on Image Analysis,pages 383–392, 2007. 2

[9] J. P. Kern and M. S. Pattichis. Robust multispectralimage registration using mutual-information models.IEEE Transactions on Geoscience and Remote Sens-ing, 45(5):1494–1505, 2007. 2

[10] B. Khaleghi, A. Khamis, F. O. Karray, and S. N.Razavi. Multisensor data fusion: A review of the state-of-the-art. Information Fusion, 14(1):28–44, 2013. 1

[11] D. Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999. 4

[12] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal,and P. Suetens. Multimodality image registration bymaximization of mutual information. IEEE Transac-tions on Medical Imaging, 16:187–198, 1997. 2

[13] N. J. W. Morris, S. Avidan, W. Matusik, and H. Pfister.Statistics of infrared images. In CVPR, 2007. 2, 4

[14] T. Ojala, M. Pietikainen, and D. Harwood. A com-parative study of texture measures with classificationbased on featured distributions. Pattern Recognition,29(1):51–59, 1996. 4

[15] C. Rasmussen and C. Williams. Gaussian processesfor machine learning. 2006. 3

[16] L. Wang and D.-C. He. Texture classification usingtexture spectrum. Pattern Recognition, 23(8):905–910, 1990. 4

[17] B. Zitova and J. Flusser. Image registration methods:a survey. Image and Vision Computing, 21(11):977–1000, 2003. 2

763