19
Fast Keypoint Detection in Video Sequences Luca Baroffio, Ma,eo Cesana, Alessandro Redondi, Marco Tagliasacchi, Stefano Tubaro Dipar?mento di Ele,ronica, Informazione e Bioingegneria Politecnico di Milano Italy March 2016 ICASSP 2016

Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Embed Size (px)

Citation preview

Page 1: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Fast Keypoint Detection in Video Sequences LucaBaroffio,Ma,eoCesana,AlessandroRedondi,MarcoTagliasacchi,StefanoTubaroDipar?mentodiEle,ronica,InformazioneeBioingegneriaPolitecnicodiMilanoItaly

March 2016 ICASSP 2016

Page 2: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Local Visual Features n Star?ngpointformanycomputervisiontasks

n Objectrecogni?onn Content-basedretrievaln  Imageregistra?on

n Two-stepsapproach:n  Firststep:keypointdetec?on(corners,blobs,etc.)n  Secondstep:descriptorextrac?on(SIFT,SURF,BRISK,etc.)

March 2016 ICASSP 2016

200 400 600 800 1000 1200 1400 1600 1800

100

200

300

400

500

600

700

800

900

200 400 600 800 1000 1200 1400 1600 1800

100

200

300

400

500

600

700

800

900

Page 3: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Local features detection in video n Mostalgorithmsaretailoredtos?llimages

n Forvideo,pastliteraturetargetstheiden?fica?onofkeypointsthatarestableacross+men  Stablefeaturesarekeytoobjecttracking,eventiden?fica?onandvideocalibra?on(maingoal:applica?onaccuracy)

n  Stablefeaturesimprovetheefficiencyofcodingarchitecturesexploi?ngthetemporalredundancy(maingoal:minimizebandwidth)

n Wetargetcomputa?onalcomplexityn  Low power devices (smartphones, embedded systems, Visual SensorNetworks) require the process of features detec?on to be both fastandaccurate

March 2016 ICASSP 2016

Page 4: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Fast extraction from video

n Baselineapproach:applyafeaturedetectoroneachframeofavideosequencen  Inefficientfromacomputa?onalpointofview!n  Temporalredundancyisnotexploited!

n Ourapproach:applythefeaturedetectoronlyinregionsofthataresufficientlydifferentfromn Computeadetec+onmasktoiden?fysuchregionsn Reusekeypointsfromoutsidethoseregions(keypointpropaga?onfromto)

March 2016 ICASSP 2016

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

Page 5: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Fast extraction from video

n Formally:n  Letbethesetoffeaturesextractedfromframe(size)n  Letbethei-thfeaturesoftheset,computedinkeypointloca?on

n  Letbeabinarydetec?onmaskdefiningtheregionsoftheframewherethedetectorshouldbeapplied

March 2016 ICASSP 2016

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

New features Propagated features

Page 6: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Detection Mask

March 2016 ICASSP 2016

n Howtocomputethedetec?onmask?n Needforacomputa?onallyefficientalgorithm!

n Weproposetwoalterna?ves:n  IntensityDifferenceDetec?onmaskn KeypointBinningDetec?onMask

2. BINARY ROBUST INVARIANT SCALABLEKEYPOINTS (BRISK)

Leutenegger et al. [6] propose the Binary Robust Invariant ScalableKeypoints (BRISK) algorithm as a computationally efficient alterna-tive to traditional local feature detectors and descriptors. The algo-rithm consists in two main steps: i) a keypoint detector, that identi-fies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [3] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of ra-dius 3 around p are all brighter than Ip + t, or all darker than Ip � t,with t a predefined threshold. Thus, the highest the threshold, thelowest the number of keypoints which are detected and vice-versa.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves, obtained by progressively downsampling the original im-age. The FAST detector is applied separately to each layer of thescale-space pyramid, in order to identify potential regions of interesthaving different sizes. Then, non-maxima suppression is applied ina 3x3 scale-space neighborhood, retaining only features correspond-ing to local maxima. Finally, a three-step interpolation process isapplied in order to refine the correct position of the keypoint withsub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset by thisextra cost. In the following, two efficient algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Mask

and Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no ex-tra cost. Considering frame In and O detection octaves, pyramidlayers Ln,o, o = 1, . . . ,O are obtained by progressively smooth-ing and half-sampling the original image, as explained in Section 2.Then, considering two contiguous frames In�1 and In and octave o,a subsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coordi-nates of the pixels in the intermediate representation M0

n,o. Finally,the intermediate representation M0

n,o resulting from the previousoperation needs to be upsampled in order to obtain the final maskMn 2 {0, 1}Nx

⇥Ny . Masks can then be applied to detection in

different fashions: i) exploiting the mask obtained resorting to eachscale-space layer o = 1, . . . ,O in order to detect keypoint at thecorresponding layer o; ii) use a single detection mask for all thescale-space layers.

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [17]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n. Sucha detection mask is applied to all scale-space octaves.

Page 7: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Intensity Difference Detection Mask

March 2016 ICASSP 2016

n  Idea:applyadetec?ononlytoregionsthatvarysufficientlyacrosscon?guousframes

n Tothisend,computetheabsolutedifferencebetweendownsampledrepresenta?onsoftwoconsecu?veframesn Alreadycomputedbythescale-spacepyramid!

n  Ifthedifferenceinagivenregionisgreaterthanathreshold,performdetec?oninsucharegion

n Finalmaskobtainedthroughupsampling

native to traditional local feature detectors and descriptors. The al-gorithm consists in two main steps: i) a keypoint detector, that iden-tifies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of radius3 around p are all brighter than Ip + t, or all darker than Ip � t, witht a predefined threshold. Thus, the highest the threshold, the lowestthe number of keypoints which are detected and vice-versa. Eachkeypoint is finally given a score s, defined as the largest thresholdfor which p is classified as a keypoint.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves. Each octave is formed by progressively half-sampling theoriginal image, while the intra-octaves are obtained by downsam-pling the original image by a factor of 1.5 and by successive half-samplings. The FAST detector is applied separately to each layerof the scale-space pyramid, in order to identify potential regions ofinterest having different sizes. Then, non-maxima suppression isapplied in a 3x3 scale-space neighborhood, retaining only featurescorresponding to local maxima. Finally, a three-step interpolationprocess is applied in order to identify the correct position of the key-point with sub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset bythis extra cost. In the following, two such algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Maskand Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no extracost. Considering frame In and O detection octaves, pyramid layersLn,o, o = 1, . . . ,O are obtained by progressively smoothing andhalf-sampling the original image, as illustrated in Section 2. Then,considering two contiguous frames In�1 and In and octave o, asubsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coor-dinates of the pixels in the intermediate representation M0

n,o.Finally, depending on which layer of the scale-space pyramid is

employed, the intermediate representation M0n,o resulting from the

previous operation needs to be upsampled in order to obtain the finalmask Mn 2 {0, 1}Nx

⇥Ny .

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [?]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n.

Page 8: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Intensity Difference Detection Mask

- Absolute difference

threshold

Idea: apply a detector only to regions of the image that sufficiently vary across contiguous frames.

March 2016 ICASSP 2016

Page 9: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Keypoint Binning Detection Mask

March 2016 ICASSP 2016

n  Idea:applyadetec?ononlytoregionswherefeatureshavebeenfoundinpreviousframes

n Tothisend,computea2Dspa?alhistogramofkeypointsloca?on

n  Ifthenumberofkeypointsinaspa?albin(ofthepreviousframe)isgreaterthanathreshold,performdetec?oninsucharegion

native to traditional local feature detectors and descriptors. The al-gorithm consists in two main steps: i) a keypoint detector, that iden-tifies salient points in a scale-space and ii) a keypoint descriptor, thatassigns each keypoint a rotation- and scale- invariant binary descrip-tor. Each element of such descriptor is obtained by comparing theintensities of a given pair of pixels sampled within the neighborhoodof the keypoint at hand.

The BRISK detector is a scale-invariant version of the lightweightFAST [] corner detector, based on the Accelerated Segment Test(AST). Such a test classifies a candidate point p (with intensity Ip)as a keypoint if n contiguous pixels in the Bresenham circle of radius3 around p are all brighter than Ip + t, or all darker than Ip � t, witht a predefined threshold. Thus, the highest the threshold, the lowestthe number of keypoints which are detected and vice-versa. Eachkeypoint is finally given a score s, defined as the largest thresholdfor which p is classified as a keypoint.

Scale-invariance is achieved in BRISK by building a scale-spacepyramid consisting of a pre-determined number of octaves and intra-octaves. Each octave is formed by progressively half-sampling theoriginal image, while the intra-octaves are obtained by downsam-pling the original image by a factor of 1.5 and by successive half-samplings. The FAST detector is applied separately to each layerof the scale-space pyramid, in order to identify potential regions ofinterest having different sizes. Then, non-maxima suppression isapplied in a 3x3 scale-space neighborhood, retaining only featurescorresponding to local maxima. Finally, a three-step interpolationprocess is applied in order to identify the correct position of the key-point with sub-pixel and sub-scale precision.

3. FAST VIDEO FEATURE EXTRACTION

Let In denote the n-th frame of a video sequence of size Nx ⇥Ny ,which is processed to extract a set of local features Dn. First, a key-point detector is applied to identify a set of interest points. Then,a descriptor is applied on the (rotated) patches surrounding eachkeypoint. Hence, each element of dn,i 2 Dn is a visual feature,which consists of two components: i) a 4-dimensional vector pn,i =[x, y,�, ✓]T , indicating the position (x, y), the scale � of the de-tected keypoint, and the orientation angle ✓ of the image patch; ii) aP -dimensional binary vector dn,i 2 {0, 1}P , which represents thedescriptor associated to the keypoint pn,i.

Traditionally, local feature extraction algorithms have been de-signed to efficiently extract and describe salient points within a sin-gle frame. Considering video sequences, a straightforward approachconsists in applying a feature extraction algorithm separately to eachframe of the video sequence at hand. However, such a method isinefficient from a computational point of view, as the temporal re-dundancy between contiguous frame is not taken into consideration.The main idea behind our approach is to apply a keypoint detectionalgorithm only on some regions of each frame. To this end, for eachframe In, a binary Detection Mask Mn 2 {0, 1}Nx

⇥Ny having the

same size of the input image is computed, exploiting the informa-tion extracted from previous frames. Such mask defines the regionsof the frame where a keypoint detector has to be applied. That is,considering an image pixel In(x, y), a keypoint detector is appliedto such a pixel if the corresponding mask element Mn(x, y) is equalto 1. Furthermore, we assume that if a region of the n-th frame is notsubject to keypoint detection , the keypoints that are present in suchan area in the previous frame, i.e. In�1, are still valid. Hence, suchkeypoints are propagated to the current set of features. That is,

Dn = {dn,i : Mn(pn,i) = 1 [ dn�1,j : Mn(pn�1,j) = 0} (1)

Note that the algorithm used to compute the Detection Mask

needs to be computationally efficient, so that the savings achievableby skipping detection in some parts of the frame are not offset bythis extra cost. In the following, two such algorithms for obtaining aDetection Mask are proposed: Intensity Difference Detection Maskand Keypoint Binning Detection Mask.

3.1. Intensity Difference Detection Mask

The key tenet is to apply the detector only to those regions thatchange significantly across the frames of the video. In order to iden-tify such regions and build the Detection Mask, we exploit the scale-space pyramid built by the BRISK detector, thus incurring in no extracost. Considering frame In and O detection octaves, pyramid layersLn,o, o = 1, . . . ,O are obtained by progressively smoothing andhalf-sampling the original image, as illustrated in Section 2. Then,considering two contiguous frames In�1 and In and octave o, asubsampled version of the Detection Mask is obtained as follows:

M0n,o(k, l) =

(1 if |Ln,o(k, l)� Ln�1,o(k, l)| TI

0 if |Ln,o(k, l)� Ln�1,o(k, l)| > TI ,(2)

where TI is an arbitrarily chosen threshold and (k, l) the coor-dinates of the pixels in the intermediate representation M0

n,o.Finally, depending on which layer of the scale-space pyramid is

employed, the intermediate representation M0n,o resulting from the

previous operation needs to be upsampled in order to obtain the finalmask Mn 2 {0, 1}Nx

⇥Ny .

3.2. Keypoint Binning Detection Mask

Considering two contiguous frames of a video sequence, the amountof features identified in a given area are often correlated [?]. To ex-ploit such information, the detector is applied to a region of the inputimage only if the number of features extracted in the co-located re-gion in the previous frame is greater than a threshold. Specifically, inorder to obtain a Detection Mask for the n�th frame, a spatial bin-ning process is applied to the features extracted from frame In�1.To this end, we define a grid consisting of Nr ⇥ Nc spatial binsBi,j , i = 0, . . . ,Nr, j = 0, . . . ,Nc. Thus, each bin refers to a rect-angular area of Sx⇥Sy pixels, where Sx = N

x

/Nc

and Sy = Ny

/Nr

.Then, a two-dimensional spatial histogram of keypoints is created byassigning each feature to the corresponding bin as follows:

M00n(k, l) = |dn�1,i 2 Dn�1| : bxn�1,i

/Sx

c = k, byn�1,i/S

y

c = l,

(3)where (xn�1,i, yn�1,i) represents the location of feature dn�1,i

and | · | the number of elements in a set. Then, a binary subsam-pled version of the Detection Mask is obtained by thresholding suchhistogram, employing a tunable threshold TH :

M0n(k, l) =

(1 if M00

n(k, l) � TH

0 if M00n(k, l) < TH ,

(4)

Finally, the Detection Mask Mn having size Nx ⇥Ny pixels isobtained by upsampling the intermediate representation M0

n.

Page 10: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Keypoint Binning Detection Mask

March 2016

Idea: apply a detector only to regions of the image where features have been found in previous frames.

ICASSP 2016

Page 11: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Keypoint Binning Detection Mask

March 2016

Idea: apply a detector only to regions of the image where at least N features have been found in previous frames.

ICASSP 2016

Page 12: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Keypoint Binning Detection Mask

March 2016

Idea: apply a detector only to regions of the image where features have been found in previous frames.

ICASSP 2016

Page 13: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Keypoint Binning Detection Mask

March 2016

Idea: apply a detector only to regions of the image where features have been found in previous frames.

ICASSP 2016

Page 14: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Experiments

n  Datasets:n  StanfordMARdataset(4sequencesofcdcoversunderdifferentimagingcondi?ons)n  RomeLandmarkdataset(10sequencesofdifferentlandmarksinRome)n  StanfordMARmul?pleobject(4sequencesofdifferentobjects)

n  Selectedlocalfeatures:BRISK(butourmethodsisgenerallyappliable)

n  Dependingonthedataset,differentaccuracymeasures:n  Matches-post-Ransac(MPR)forStanfordMARdatasetn  MeanofAveragePrecision(MAP)forRomeLandmarkdatasetn  Combineddetec?onandtrackingaccuracyforStanfordMARmul?pleobjects

n  ComplexityismeasuredbymeansoftherequiredCPU?me

March 2016 ICASSP 2016

Page 15: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Comparison with baselines

March 2016 ICASSP 2016

0 50 100 150 200100

102

104

frame

time

[ms]

0 50 100 150 2000

50

100

150

frame

Accu

racy

[MPR

]

full detectionID detection maskKB detection maskpatch propagation, ∆ = 10

Page 16: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Results – Stanford MAR

March 2016

0 5 10 15 200

10

20

30

40

50

computational time [ms]

accu

racy

[MPR

]

full detectionID detection maskKB detection mask

30% reduction

ICASSP 2016

Page 17: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Results – Rome Landmark Dataset

March 2016

0 10 20 30 40 50 600.44

0.46

0.48

0.5

0.52

0.54

computational time [ms]

MAP

full detectionID detection maskKB detection mask

35% reduction, 0.03% loss in MAP

ICASSP 2016

Page 18: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Results – Stanford Multiple Object

March 2016

10 12 14 16 18 20 220.65

0.7

0.75

0.8

0.85

0.9

0.95

feature extraction time/frame [ms]

aver

age

track

ing

prec

isio

n [p

ixel

]

full detectionID detection maskKB detection mask

40% reduction

ICASSP 2016

Page 19: Fast Keypoint Detection in Video Sequences - SigPort · Fast Keypoint Detection in Video Sequences Luca Baroffio, Maeo Cesana ... as a keypoint if n contiguous pixels in the Bresenham

Conclusions

n  Upto35/40%reduc+onintermsofcomputa?onalcomplexitywithoutsignificantlyreducingvisualtaskaccuracy

n  Higherframerates/lowerpowerconsump?ononlow-powerdevices(smartphones,embeddedsystems)

March 2016 ICASSP 2016

Thank you!