Upload
phungtuyen
View
220
Download
0
Embed Size (px)
Citation preview
Fast Keypoint Detection in Video Sequences LucaBaroffio,Ma,eoCesana,AlessandroRedondi,MarcoTagliasacchi,StefanoTubaroDipar?mentodiEle,ronica,InformazioneeBioingegneriaPolitecnicodiMilanoItaly
March 2016 ICASSP 2016
Local Visual Features n Star?ngpointformanycomputervisiontasks
n Objectrecogni?onn Content-basedretrievaln Imageregistra?on
n Two-stepsapproach:n Firststep:keypointdetec?on(corners,blobs,etc.)n Secondstep:descriptorextrac?on(SIFT,SURF,BRISK,etc.)
March 2016 ICASSP 2016
200 400 600 800 1000 1200 1400 1600 1800
100
200
300
400
500
600
700
800
900
200 400 600 800 1000 1200 1400 1600 1800
100
200
300
400
500
600
700
800
900
Local features detection in video n Mostalgorithmsaretailoredtos?llimages
n Forvideo,pastliteraturetargetstheiden?fica?onofkeypointsthatarestableacross+men Stablefeaturesarekeytoobjecttracking,eventiden?fica?onandvideocalibra?on(maingoal:applica?onaccuracy)
n Stablefeaturesimprovetheefficiencyofcodingarchitecturesexploi?ngthetemporalredundancy(maingoal:minimizebandwidth)
n Wetargetcomputa?onalcomplexityn Low power devices (smartphones, embedded systems, Visual SensorNetworks) require the process of features detec?on to be both fastandaccurate
March 2016 ICASSP 2016
Fast extraction from video
n Baselineapproach:applyafeaturedetectoroneachframeofavideosequencen Inefficientfromacomputa?onalpointofview!n Temporalredundancyisnotexploited!
n Ourapproach:applythefeaturedetectoronlyinregionsofthataresufficientlydifferentfromn Computeadetec+onmasktoiden?fysuchregionsn Reusekeypointsfromoutsidethoseregions(keypointpropaga?onfromto)
March 2016 ICASSP 2016
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
Fast extraction from video
n Formally:n Letbethesetoffeaturesextractedfromframe(size)n Letbethei-thfeaturesoftheset,computedinkeypointloca?on
n Letbeabinarydetec?onmaskdefiningtheregionsoftheframewherethedetectorshouldbeapplied
March 2016 ICASSP 2016
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
New features Propagated features
Detection Mask
March 2016 ICASSP 2016
n Howtocomputethedetec?onmask?n Needforacomputa?onallyefficientalgorithm!
n Weproposetwoalterna?ves:n IntensityDifferenceDetec?onmaskn KeypointBinningDetec?onMask
2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)
Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask
and Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0
n,o. Finally,the intermediate representation M0
n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx
⇥Ny . Masks can then be applied to detection in
different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n. Sucha detection mask is applied to all scale-space octaves.
Intensity Difference Detection Mask
March 2016 ICASSP 2016
n Idea:applyadetec?ononlytoregionsthatvarysufficientlyacrosscon?guousframes
n Tothisend,computetheabsolutedifferencebetweendownsampledrepresenta?onsoftwoconsecu?veframesn Alreadycomputedbythescale-spacepyramid!
n Ifthedifferenceinagivenregionisgreaterthanathreshold,performdetec?oninsucharegion
n Finalmaskobtainedthroughupsampling
native to traditional local feature detectors and descriptors. The al-gorithm consists in two main steps: i) a keypoint detector, that iden-tifies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of radius3 around p are all brighter than Ip + t, or all darker than Ip � t, witht a predefined threshold. Thus, the highest the threshold, the lowestthe number of keypoints which are detected and vice-versa. Eachkeypoint is finally given a score s, defined as the largest thresholdfor which p is classified as a keypoint.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves. Each octave is formed by progressively half-sampling theoriginal image, while the intra-octaves are obtained by downsam-pling the original image by a factor of 1.5 and by successive half-samplings. The FAST detector is applied separately to each layerof the scale-space pyramid, in order to identify potential regions ofinterest having different sizes. Then, non-maxima suppression isapplied in a 3x3 scale-space neighborhood, retaining only featurescorresponding to local maxima. Finally, a three-step interpolationprocess is applied in order to identify the correct position of the key-point with sub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset bythis extra cost. In the following, two such algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Maskand Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no extracost. Considering frame In and O detection octaves, pyramid layersLn,o, o = 1, . . . ,O are obtained by progressively smoothing andhalf-sampling the original image, as illustrated in Section 2. Then,considering two contiguous frames In�1 and In and octave o, asubsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coor-dinates of the pixels in the intermediate representation M0
n,o.Finally, depending on which layer of the scale-space pyramid is
employed, the intermediate representation M0n,o resulting from the
previous operation needs to be upsampled in order to obtain the finalmask Mn 2 {0, 1}Nx
⇥Ny .
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [?]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n.
Intensity Difference Detection Mask
- Absolute difference
threshold
Idea: apply a detector only to regions of the image that sufficiently vary across contiguous frames.
March 2016 ICASSP 2016
Keypoint Binning Detection Mask
March 2016 ICASSP 2016
n Idea:applyadetec?ononlytoregionswherefeatureshavebeenfoundinpreviousframes
n Tothisend,computea2Dspa?alhistogramofkeypointsloca?on
n Ifthenumberofkeypointsinaspa?albin(ofthepreviousframe)isgreaterthanathreshold,performdetec?oninsucharegion
native to traditional local feature detectors and descriptors. The al-gorithm consists in two main steps: i) a keypoint detector, that iden-tifies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.
The BRISK detector is a scale-invariant version of the lightweightFAST [] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of radius3 around p are all brighter than Ip + t, or all darker than Ip � t, witht a predefined threshold. Thus, the highest the threshold, the lowestthe number of keypoints which are detected and vice-versa. Eachkeypoint is finally given a score s, defined as the largest thresholdfor which p is classified as a keypoint.
Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves. Each octave is formed by progressively half-sampling theoriginal image, while the intra-octaves are obtained by downsam-pling the original image by a factor of 1.5 and by successive half-samplings. The FAST detector is applied separately to each layerof the scale-space pyramid, in order to identify potential regions ofinterest having different sizes. Then, non-maxima suppression isapplied in a 3x3 scale-space neighborhood, retaining only featurescorresponding to local maxima. Finally, a three-step interpolationprocess is applied in order to identify the correct position of the key-point with sub-pixel and sub-scale precision.
3. FAST VIDEO FEATURE EXTRACTION
Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.
Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx
⇥Ny having the
same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,
Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)
Note that the algorithm used to compute the Detection Mask
needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset bythis extra cost. In the following, two such algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Maskand Keypoint Binning Detection Mask.
3.1. Intensity Difference Detection Mask
The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no extracost. Considering frame In and O detection octaves, pyramid layersLn,o, o = 1, . . . ,O are obtained by progressively smoothing andhalf-sampling the original image, as illustrated in Section 2. Then,considering two contiguous frames In�1 and In and octave o, asubsampled version of the Detection Mask is obtained as follows:
M0n,o(k, l) =
(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI
0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)
where TI is an arbitrarily chosen threshold and (k, l) the coor-dinates of the pixels in the intermediate representation M0
n,o.Finally, depending on which layer of the scale-space pyramid is
employed, the intermediate representation M0n,o resulting from the
previous operation needs to be upsampled in order to obtain the finalmask Mn 2 {0, 1}Nx
⇥Ny .
3.2. Keypoint Binning Detection Mask
Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [?]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N
x
/Nc
and Sy = Ny
/Nr
.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:
M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i
/Sx
c = k, byn�1,i/S
y
c = l,
(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i
and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :
M0n(k, l) =
(1 if M00
n(k, l) � TH
0 if M00n(k, l) < TH ,
(4)
Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0
n.
Keypoint Binning Detection Mask
March 2016
Idea: apply a detector only to regions of the image where features have been found in previous frames.
ICASSP 2016
Keypoint Binning Detection Mask
March 2016
Idea: apply a detector only to regions of the image where at least N features have been found in previous frames.
ICASSP 2016
Keypoint Binning Detection Mask
March 2016
Idea: apply a detector only to regions of the image where features have been found in previous frames.
ICASSP 2016
Keypoint Binning Detection Mask
March 2016
Idea: apply a detector only to regions of the image where features have been found in previous frames.
ICASSP 2016
Experiments
n Datasets:n StanfordMARdataset(4sequencesofcdcoversunderdifferentimagingcondi?ons)n RomeLandmarkdataset(10sequencesofdifferentlandmarksinRome)n StanfordMARmul?pleobject(4sequencesofdifferentobjects)
n Selectedlocalfeatures:BRISK(butourmethodsisgenerallyappliable)
n Dependingonthedataset,differentaccuracymeasures:n Matches-post-Ransac(MPR)forStanfordMARdatasetn MeanofAveragePrecision(MAP)forRomeLandmarkdatasetn Combineddetec?onandtrackingaccuracyforStanfordMARmul?pleobjects
n ComplexityismeasuredbymeansoftherequiredCPU?me
March 2016 ICASSP 2016
Comparison with baselines
March 2016 ICASSP 2016
0 50 100 150 200100
102
104
frame
time
[ms]
0 50 100 150 2000
50
100
150
frame
Accu
racy
[MPR
]
full detectionID detection maskKB detection maskpatch propagation, ∆ = 10
Results – Stanford MAR
March 2016
0 5 10 15 200
10
20
30
40
50
computational time [ms]
accu
racy
[MPR
]
full detectionID detection maskKB detection mask
30% reduction
ICASSP 2016
Results – Rome Landmark Dataset
March 2016
0 10 20 30 40 50 600.44
0.46
0.48
0.5
0.52
0.54
computational time [ms]
MAP
full detectionID detection maskKB detection mask
35% reduction, 0.03% loss in MAP
ICASSP 2016
Results – Stanford Multiple Object
March 2016
10 12 14 16 18 20 220.65
0.7
0.75
0.8
0.85
0.9
0.95
feature extraction time/frame [ms]
aver
age
track
ing
prec
isio
n [p
ixel
]
full detectionID detection maskKB detection mask
40% reduction
ICASSP 2016
Conclusions
n Upto35/40%reduc+onintermsofcomputa?onalcomplexitywithoutsignificantlyreducingvisualtaskaccuracy
n Higherframerates/lowerpowerconsump?ononlow-powerdevices(smartphones,embeddedsystems)
March 2016 ICASSP 2016
Thank you!