14
MTP Phase2 Report on Scale Invariant Descriptors For Videos by Sanjay Viswanath (11410263) under the supervision of Dr. P.K.Bora & Dr. Tony Jacob Department of Electronics and Electrical Engineering IIT Guwahati

MTP_Phase2_11410263

Embed Size (px)

DESCRIPTION

my mtp phase 2 report

Citation preview

Page 1: MTP_Phase2_11410263

MTP Phase2 Report on

Scale Invariant Descriptors For Videos

by

Sanjay Viswanath (11410263)

under the supervision of

Dr. P.K.Bora & Dr. Tony Jacob

Department of Electronics and Electrical Engineering

IIT Guwahati

Page 2: MTP_Phase2_11410263

1

Abstract

Scale invariant features have become extremely important and popular in image processing and com-

puter vision for several applications like object detection, image registration and alignment. Among the

various algorithms for detecting such interest points in animage, SIFT is perhaps the most popular and

widely applied technique. However unlike other algorithmslike the Harris corner detector, SIFT doesn’t have

an efficient analogue in the spatio-temporal domain of videos. Currently many techniques have evolved for

videos which utilize the concept of SIFT by treating video as3-D voxel. However such techniques while

giving invariance in the scale space, are not truly invariant to temporal space scaling like frame rates

of videos. In temporal space, motion is more important than the pixel intensities and as such, exisiting

techniques like optical flow are susceptible to frame rate ofthe videos since they utilize adjacent frames

for motion estimation. This report presents these techniques and analyses their limitations while exploring

a framework for frame rate invariant moment features and their extension into true spatio-temporal scale

invariant features.

Details Of Work Done

1. Studied SIFT[1] algorithm for scale invariant features.[50hrs]

2. Simulated SIFT feature matching for images.[2hrs]Explored Lowe’s code[2], but it doesn’t provide access to feature detection and descriptor coding.

3. Explored SIFT algorithm enhancement in terms of ASIFT[3]and mathematical framework for

affine transformation and scale invariance. [25hrs]Used online demo code for ASIFT[4] [2hrs]

4. Explored literature regarding SIFT usage for medical registration. [100hrs]

5. Explored SIFT feature matching for medical image registration and found performance to bepoor. [20hrs]

6. Explored literature for SIFT extension into videos.

Exploredn-SIFT[5] [10hrs]Explored SIFT Flow[6] and Optical Flow[7] [50hrs]

Explored other literature such as SIFT Bag Kernel[8] for video event analysis [100hrs]

Explored MoSIFT[9] [10hrs]7. Selected framework for exploring temporal invariant moments in videos and currently simulating

the same.

Facing issues with open source codes for SIFT by Andrea Veldaldi[10] due to compiler issues.

[30hrs]Facing issues with open source code for MoSIFT[11] due to OpenCV issues. [15hrs]

Presently simulating SIFT Flow based interest point detection by extracting frames from videos and

computing SIFT Flow[12] with an aim to create DoSf space for extracting interest points. [25hrs]

Page 3: MTP_Phase2_11410263

2

I. INTRODUCTION

Image matching is a fundamental aspect of many problems in computer vision like object or scenerecognition, solving for 3D structure from multiple imagesand motion tracking. Scale invariant

features have become extremely popular for such applications. Several such feature detectors are in

use such as the Harris[13] corner detector, Scale InvariantFeature Transform( SIFT)[1], SpeededUp Robust Feature(SURF)[14] etc. Among them, SIFT is the most popular and important technique

for extracting and matching scale invariant features whichare robust under affine transformations

to a significant extent.Unlike direct extensions of other scale invariant detectors like Harris Corner detector, Hessian detec-

tor etc, SIFT does not have an efficient analogue in 3-d spatio-temporal space of videos.Currently,

extensions of the SIFT descriptor to 2+1-dimensional spatio-temporal data in context of human ac-tion recognition in video sequences have been studied. The computation of local position-dependent

histograms in the 2D SIFT algorithm are extended from two to three dimensions to describe SIFT

features in a spatio-temporal domain.Some possible applications of 3-D SIFT descriptor will be:1. Video copy detection, indexing and recovery

2. Video Hashing

3. Human action recognition

4. Object recognition and tracking in videoThe basic motivation in spatio-temporal domain will be the temporal invariance in terms of frame

rate, of the interest points and the descriptors while maintaining invariance under affine transforma-

tion in scale space. Such interest points can be used for synchronizing multiple video streams of thesame event from different view points at different frame rates and hence will not require any prior

knowledge of camera locations or their individual frame rates, etc. Hence such points can be used

for forensic applications like time synchronzing multiplevideo streams for analysing some eventslike a bomb blasts, spatially registering the coverage and tracking events of interest. They can also

be applied in sports, surveillance etc. The challenge will be to define a true spatio-temporal scale

space wherein events can be analysed in both spatial and temporal resolutions to filter out cornerpoints in scale space or moment points in temporal space.

This project studies SIFT with an aim to apply it to videos in aspatio-temporal domain. The restof the report is organized as follows: After reviewing the SIFT algorithm for images in Section 2,

there is a review of similar algorithms for videos in Section3.In Section 4, the proposed methods

for frame-rate invariant moment detector are presented andSection 5 concludes the report withsuggestions for future work.

A. Scale Invariant Feature Transform

Scale Invariant Feature Transform[1] is an algorithm in computer vision to detect and describe

features in an image. The algorithm was proposed by David Lowe in 1999 and is patented in US

Page 4: MTP_Phase2_11410263

3

under British Columbia University. The algorithm has become very popular in image processing andassociated fields since its inception. The SIFT features areinvariant to image scaling and rotation,

and partially invariant to change in illumination and 3D camera viewpoint. They are well localized

in both the spatial and frequency domains, reducing the probability of disruption by occlusion,clutter, or noise. Large numbers of features can be extracted from typical images with efficient

algorithms. In addition, the features are highly distinctive, which allows a single feature to be

correctly matched with high probability against a large database of features, providing a basis forobject and scene recognition. Following are the major stages of computation used to generate the

set of image features:

1. Scale-space extrema detection: The first stage of computation searches over all scales andimage locations. It is implemented efficiently by using a difference-of-Gaussian function to identify

potential interest points that are invariant to scale and orientation.

2. Keypoint localization: At each candidate location, a detailed model is fit to determine locationand scale. Keypoints are selected based on measures of theirstability.

3. Orientation assignment: One or more orientations are assigned to each keypoint location based

on local image gradient directions. All future operations are performed on image data that hasbeen transformed relative to the assigned orientation, scale, and location for each feature, thereby

providing invariance to these transformations.

4. Keypoint descriptor: The local image gradients are measured at the selected scale in the regionaround each keypoint. These are transformed into a representation that allows for significant levels

of local shape distortion and change in illumination.

This approach has been named the Scale Invariant Feature Transform (SIFT), as it transforms image

data into scale-invariant coordinates relative to local features.

B. Object Matching Performance

Though some open source codes for SIFT are available, alot ofissues were faced while using

them like MATLAB compiler issues and hence SIFT demo code provided by Lowe was used formatching 2 images. The result is shown here in figure 1. As seen, SIFT is very robust to affine

transformations.

SIFT has been further extended as ASIFT[3] by Morel et al. which can match images underlarge changes in camera angles as well. The authors have alsopresented a paper on how SIFT

is truly affine invariant. The results of ASIFT from their demo[?] is shown in figure 2 and SIFT

performance is shown in figure 3.

Page 5: MTP_Phase2_11410263

4

Figure 1. SIFT Feature Matching of Object

Figure 2. ASIFT Feature Matching of Object under large camera angle shift

C. Applications

SIFT has become one of the most popular technqiues used in image processing especially whenaffine-invariant features are required.SIFT features can essentially be applied to any task that requires

identification of matching locations between images. Work has been done on applications such as

recognition of particular object categories in 2D images, 3D reconstruction, motion tracking andsegmentation, robot localization, image panorama stitching and epipolar calibration. Some of the

important applications are:

1. Object recognition using SIFT features[1]: Given SIFT’s ability to find distinctive keypoints thatare invariant to location, scale and rotation, and robust toaffine transformations (changes in scale,

rotation, shear, and position) and changes in illumination, they are usable for object recognition.

2. Panorama stitching: SIFT feature matching can be used in image stitching for fully automated

Page 6: MTP_Phase2_11410263

5

Figure 3. SIFT Feature Matching of the same Object under large camera angle shift

panorama reconstruction from non-panoramic images. The SIFT features extracted from the input

images are matched against each other to find k nearest-neighbors for each feature. These cor-respondences are then used to find m candidate matching images for each image. Homographies

between pairs of images are then computed using RANSAC and a probabilistic model is used for

verification. Because of the SIFT-inspired object recognition approach to panorama stitching, theresulting system is insensitive to the ordering, orientation, scale and illumination of the images.

The input images can contain multiple panoramas and noise images (some of which may not even

be part of the composite image), and panoramic sequences arerecognized and rendered as output.3. 3D scene modeling, recognition and tracking: This application uses SIFT features for 3D

object recognition and 3D modeling in context of augmented reality, in which synthetic objects

with accurate pose are superimposed on real images. SIFT matching is done for a number of 2Dimages of a scene or object taken from different angles. Thisis used with bundle adjustment to

build a sparse 3D model of the viewed scene and to simultaneously recover camera poses and

calibration parameters.

5. 3D SIFT-like descriptors for human action recognition: Extensions of the SIFT descriptor to2+1-dimensional spatio-temporal data in context of human action recognition in video sequences

have been studied. The computation of local position-dependent histograms in the 2D SIFT algo-

rithm are extended from two to three dimensions to describe SIFT features in a spatio-temporaldomain. For application to human action recognition in a video sequence, sampling of the training

videos is carried out either at spatio-temporal interest points or at randomly determined locations,

times and scales. The spatio-temporal regions around theseinterest points are then described usingthe 3D SIFT descriptor. These descriptors are then clustered to form a spatio-temporal Bag of words

model. 3D SIFT descriptors extracted from the test videos are then matched against these words

for human action classification.

Page 7: MTP_Phase2_11410263

6

II. RELATED WORK

Alot of work has been carried out regarding spatio-temporalfeature detectors for videos. Someof them use 3-d extensions of scale invariant detectors likeHarris corner detector while others use

SIFT based techniques. The most noteworthy techniques are explored in this section.

A. Structure Based 3-D Spatio-Temporal Detectors

Drawing inspiration from the usefulness of local multiscale salient features for object recognition,an immediate extension has been developed for spatio-temporal feature extraction for action recog-

nition and for video analysis in general. To extend 2D salient features to video, most of existing

methods consider the sequence of images (2D+t) as a 3D object. As the 2D feature detectors selectmainly salient structures in a still image, their extensions to 3D is considered as structured-based

feature detectors.To detect spatio-temporal structured-based features, existing methods treat the time

domain as the third dimension of space and hence, they apply the same scale-space filter in spaceand time directions. That is, similar to the spatial Gaussian filtering, a temporal Gaussian is applied

in the time direction. The performance of structure based interest point detectors were reviewed

by Shabani et al[15] for human action recognition and found them to have less performance thanmotion based detectors.

1. 3D Harris:Laptev et al.[16] extended the Harris corner criteria from2D image to 3D to extract

corresponding points in a video sequence. To this end, the original video signalI(x, y, t) is smoothedusing a spatial GaussianGσ and a temporal Gaussian kernelGτ using the convolutionL = Gσ * Gτ

* I. The autocorrelation matrixA = LdT X Ld is then computed from the spatio-temporal derivative

vector Ld = [Lx, Ly, Lt]. To compare each pixel to the neighborhood pixels a spatiotemporal

Gaussian weightingG2σ * G2τ is then applied.

M = G2σ ∗ G2τ ∗ A = G2σ ∗G2τ ∗

Lx2 LxLy LxLt

LyLx Ly2 LyLt

LtLx LtLy Lt2

The autocorrelation matrixM defines the second moment approximation for the local distribution

of the gradients within a spatio-temporal neighborhood. Using the eigen-valuesλ1, λ2, λ3 of theHarris matrix M, one can compute the spatio-temporal cornermap C in which the corners are

magnified and the rest are weakened (k = 0.0005).

C = det(M) − k(trace3(M)) = λ1λ2λ3 − k(λ1 + λ2 + λ3)3

2. 3D Hessian: Willems et al.[17] extended 2D Hessian features to 3D by applying (an approxi-mation) of 3D Gaussian filter and used the determinant of the Hessian matrix as the saliency criteria.

The points with high-value determinant (S =||det(H)||) represent the center of the ellipsoids (3D

Page 8: MTP_Phase2_11410263

7

blob like structures) in the video.

H =

Lxx Lxy Lxt

Lyx Lyy Lyt

Ltx Lty Ltt

3. 3D KLT : 2D KLT[18] features have been widely used in many computer vision tasks suchas tracking and structure from motion. The 3D KLT[19] is the extension of its counterpart from

2D and it can be detected at multiple spatial and temporal scales. To this end, a family of scale-

space representation of the video is obtained by performing2D spatial Gaussian filteringGσ and atemporal Gaussian filteringGτ . At each scale, the 3D KLT saliency criteria is applied on the3D

autocorrelation matrixA to keep the points with the minimum of the eigen values above athreshold

(i.e., min(λ1, λ2, λ3) ¿α). The 3D KLT features are then localized at points with maximum saliencyvalue in their spatiotemporal neighborhood.

B. n-SIFT

n-SIFT[5] generalizes SIFT ton-dimensional images and evaluate the extensions in the context

of medical images.n-SIFT locates positions that are stable in the image, creating a unique featurevector, and matches the feature vectors between two scalar images of arbitrary dimensionality.

The authors argue that this generalization can be extended to any arbitrary dimensions including

spatio-temporal domain. However a 3-D spatio-temporal domain is very different from a 3-D spatial

domain in the sense that motion in temporal domain cannot be captured by treating the video as3-D spatial voxels. Hence such a generalization may not be the best one when 1 of the dimensions

is time. Following are the major stages of computation used to generate the set ofn-dimensional

features:1. Feature Localization: To achieve invariance to image scaling and identifying maxima/minima

in difference of Gaussian scale space, a multilevel image pyramid is created, similar to the one

employed by 2-D SIFT, as shown in figure 3.

Locate Extrema in the DoG space: The procedures are same as that employed by SIFT[1]

in 2-d. At each level, an approximation of the doG space fromσ to 2σ is made by taking the

difference of Gaussian blurred images. Each voxel of a doG image (scalekjσ) is compared againstthe neighboring voxels immediately adjacent, orthogonally and diagonally,the corresponding voxel

in the scale above (kj+1σ) at the same pyramid level, and all the neighbors to that corresponding

voxel, and the corresponding voxel in the scale below(kj−1σ) and all the neighbors.Let us refer to a voxelp from the doG image scalekjσ at position (x1, x2, ...., xn) aspj(x1, x2, ...., xn).

Definition 1: A voxel pj(x1, x2, ...., xn) is a local extrema iff∀i0=−1,0,1∀i1=−1,0,1∀i2=−1,0,1...∀in=−1,0,1,

|pj(x1, x2, ...., xn)| ≥ |pj+i0(x1 + i1, x2 + i2, ...., xn + in)|

Page 9: MTP_Phase2_11410263

8

Figure 4. Image Pyramid

or

|pj(x1, x2, ...., xn)| ≤ |pj+i0(x1 + i1, x2 + i2, ...., xn + in)|

2. Feature Generation: At each of the extrema localized, an identifying feature isgenerated.

Regardless of the pyramid level or scale where the extrema was localized, or the feature generationmethod used,the feature vector generated will be associated with the physical location in the original

image corresponding to the location of the extrema detected. For analysis, three related methods

for generating features involving histograms of the local gradient were investigated, from whichthe best feature was found to be then-SIFT feature summarizes a hypercubic voxel region around

the feature position. Then-SIFT feature divides the local area into subregions, each using a bin

histogram to summarize the gradients of the voxels in the subregion, resulting in an-dimensionalfeature vector. The 3-D case is shown in figure 4.

3. Feature Matching: To match histogram(s) generated by any one of the three types of features,

we convert the histogram(s) of a feature to a single vector, and compare thel2 distance betweena feature vector in one image against every feature vector ofthe second image. For featurev, let

u be the best match (lowest distance),u′

be the second best match, andd(v,u) be the distance

between featuresv and u. We then make sured(v,u)/d(v,u′

) , is below a thresholdTm and thatvis, conversely, the best match foru. This decreases mismatches by removing matches where other

features are very close in feature space to the best match.

C. SIFT Flow

In videos, the motion of objects between frames is captured by optical flow which is computedas the velocity of pixels between frames. There are local andglobal methods to compute optical

Page 10: MTP_Phase2_11410263

9

Figure 5. In 3-D, each of the 4 X 4 X 4 regions (dashed and shaded) summarizes the gradients at 4 X 4 X 4 voxel locations (solidand white).

flow. Since optical flow captures motion in videos and sharp motions are moments of interest in

videos, it might be useful in detecting spatio-temporal interest points.SIFT Flow[6] algorithm is analogous to optical flow and consists of matching densely sampled,

pixelwise SIFT features between two images while preserving spatial discontinuities. Optical Flow

computation between different frames can be done using various algorithms. However the authorsof SIFT Flow argue that since optical flow computation involves the pixel intensity constancy

assumption, it will be susceptible to noise and hence they propose the new method as more robust.

The technique can be summarized in following steps:1. Dense SIFT Descriptors:SIFT[1] is a local sparse descriptor to characterize localgradient

information, that consists of both feature extraction and detection. SIFT Flow only use the feature

extraction. For every pixel in an image, we divide its neighborhood (e.g., 16 X 16) into a 4 X 4 cellarray, quantize the orientation into 8 bins in each cell, andobtain a 4 X 4 X 8 = 128-dimensional

vector as the SIFT representation for a pixel. This per-pixel SIFT descriptor is called SIFT image.

2. Matching Objective: SIFT flow is formulated same as optical flow[7] with the exception ofmatching SIFT descriptors instead of pixel intensities. Therefore, the objective function of SIFT

flow is very similar to that of optical flow. Let p=(x,y) be the grid coordinate of images, and

w(p)=(u(p),v(p)) be the flow vector at p. We only allow u(p) and v(p) to be integers and we assumethat there are L possible states for u(p) and v(p), respectively. Let s1 and s2 be two SIFT images

that we want to match. Set e contains all the spatial neighborhoods (a four-neighbor system is

Page 11: MTP_Phase2_11410263

10

used). The energy function for SIFT flow is defined as:

E(w) =∑

p

min( ||s1(p) − s2(p + w(p))||1, t)

+∑

p

η(|u(p) + v(p)|)

+∑

(p,q)ǫε

min( α(|u(p) − u(q)|), d) +min(α(|v(p) − v(q)|), d)

which contains a data term, small displacement term and smoothness term (a.k.a. spatial regu-

larization). The data term on line 1 constrains the SIFT descriptors to be matched along with the

flow vector w(p). The small displacement term on line 2 constrains the flow vectors to be as small

as possible when no other information is available. The smoothness term on line 3 constrains theflow vectors of adjacent pixels to be similar. In this objective function, truncated L1 norms are

used in both the data term and the smoothness term to account for matching outliers and flow

discontinuities, with t and d as the threshold, respectively.A dual-layer loopy belief propagation is used as the base algorithm to optimize the objective

function.

SIFT Flow has been used for video summarization in terms of key frames wherein frames withhigh matching are clustered and represented by single key frame.

D. MoSIFT

MoSIFT[9] algorithm was developed to achieve robust human action recognition in the TREC

Video Retrieval Evaluation (TRECVID 2008) real-world London Gatwick airport surveillance videos.The algorithm detects spatiotemporal interest points in terms of spatial interest points with scale

invariance which have substantial motion in temporal domain and then build a descriptor in the

following way:1. Detecting Interest Points:The MoSIFT algorithm detects spatially distinctive interest points

with substantial motions. The SIFT[1] algorithm is first applied to find visually distinctive compo-

nents in the spatial domain and then spatio-temporal interest points are detected with (temporal)motion constraints. The motion constraint consists of a ’sufficient’ amount of optical flow around

the distinctive points.

Figure 4 demonstrates the MoSIFT algorithm. The algorithm takes a pair of video frames to findspatio-temporal interest points at multiple scales. Two major computations are applied: SIFT point

detection and optical flow computation according to the scale of the SIFT points.

SIFT is used to detect distinctive interest points in a stillimage. The candidate points are distinctive

in appearance, but they are independent of the motions or actions in video. In the interest pointdetection part of the MoSIFT algorithm, authors refer to constructing optical flow pyramids over

two Gaussian pyramids, though they dont give much details about how this is done. Multiple-scale

Page 12: MTP_Phase2_11410263

11

Figure 6. System flow graph of the MoSIFT algorithm: A pair of frames is the input. Local extremes of DoG and optical flowdetermine the MoSIFT points for which features are described.

optical flows are calculated according to the SIFT scales anda local extreme from DoG pyramids

can only become an interest point if it has sufficient motion in the optical flow pyramid.As long as

a candidate interest point contains a minimal amount of movement, the algorithm will extract this

point as a MoSIFT interest point. MoSIFT interest points arescale invariant in the spatial domain.However, they are not scale invariant in the temporal domain. The authors argue that temporal scale

invariance could be achieved by calculating optical flow on multiple scales in time. However since

the main purpose of MoSIFT algorithm is human action recognition, the authors donot enforcetemporal scale invariance and proceed with simple flow magnitude constraint so as to retain a large

no. of scale invariant points which could model the actions.

2. MoSIFT feature description:Since MoSIFT point detection is based on DoG and opticalflow, the descriptor leverages these two features. MoSIFT builds a single feature descriptor, which

concatenates both HoG and HoF into one vector.

The magnitude and direction for the gradient are represntedthrough a SIFT feature vector with128 dimensions (4x4x8 = 128). Each vector is normalized to enhance invariance to changes in

illumination. MoSIFT adapts the idea of grid aggregation inSIFT to describe motions. Optical flow

detects the magnitude and direction of a movement. Thus, optical flow has the same properties asappearance gradients. The same aggregation can be applied to optical flow in the neighborhood

of interest points to increase robustness to occlusion and deformation. The main difference to

appearance description is in the dominant orientation. Rotation invariance is important to appearancesince it provides a standard to measure the similarity of twointerest points. In surveillance video,

rotation invariance of appearance remains important due tovarying view angles and deformations.

Page 13: MTP_Phase2_11410263

12

Since surveillance video is captured by a stationary camera, the direction of movement is actually animportant (non-invariant) vector to help recognize an action. Therefore, there is no adjusting for ori-

entation invariance in the MoSIFT motion descriptors. The two aggregated histograms (appearance

and optical flow) are combined into the MoSIFT descriptor, which hence has 256 dimensions.

III. FRAME RATE INVARIANT MOMENT DETECTOR

Since we need scale and frame rate invariant moments is spatio-temporal domain, it is necessary

to account for both spatial domain and motion in temporal domain. As such, the framework usedby MoSIFT[?] seems to be the most natural way to explore such interest points. In this case, SIFT

flow might have additional advantages over conventional optical flow and hence if points having

high acceleration in temporal domain are detected from Siftflow images computed between videoframes, they might have frame invariance to some extend. Theextrema of Difference Of SIFT Flow(

DoSf) need to be tested for frame rate invariance. Figure 5 shows the proposed DoSf scheme.

Figure 7. Difference Of SIFT Image scheme: A pair of frames is the input. Local extremes of DoSf determine moments ofinterest in temporal space.

Currently, simulations are being carried on regarding sucha framework and results are awaited.

Simulating SIFT has been a major issue, since the open sourcecodes like Andrea Veldaldi’s code

have compilation issues and there are variations in performance from Lowe’s code. Also, the exactdetails of SIFT are also not given since it is patented technique. This is a challenge in simulation

and hence currently, Lowe’s code is being used which is limited to detecting SIFT keypoints and

their descriptors from an image or matching SIFT descriptors between images.

Page 14: MTP_Phase2_11410263

13

IV. FUTURE WORK

Spatio-Temporal interest points are important for video processing applications like video syn-chronization. Such interest points with frame rate invariance are being explored through Difference

of SIFT Flow(DoSf) approach.The simulations of the same along with n-SIFT and Mo-SIFT have

to be completed. The descriptors have to be studied in detailbased on the results and furtherpossibilities explored. After the descriptors are extracted, they have to be applied for synchronizing

events from cameras with multiple frame rates.

REFERENCES

[1] D. Lowe, “Object recognition from local scale-invariant features,” inComputer Vision, 1999. The Proceedings of the Seventh

IEEE International Conference on, vol. 2, 1999, pp. 1150 –1157 vol.2.

[2] Lowe, http://www.cs.ubc.ca/ lowe/keypoints/, David Lowe Std.[3] Guoshen Yu, Jean-Michel Morel, “ASIFT: An Algorithm forFully Affine Invariant Comparison,”Image Processing On Line,

2011.

[4] Morel and Yu,http://www.cmap.polytechnique.fr/ yu/research/ASIFT/demo.html, Std.[5] W. Cheung and G. Hamarneh, “N-sift: N-dimensional scaleinvariant feature transform for matching medical images,”in

Biomedical Imaging: From Nano to Macro, 2007. ISBI 2007. 4thIEEE International Symposium on, april 2007, pp. 720 –723.[6] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,”Pattern Analysis and

Machine Intelligence, IEEE Transactions on, vol. 33, no. 5, pp. 978 –994, may 2011.

[7] A. Bruhn, J. Weickert, and C. Schnrr, “Lucas/kanade meets horn/schunck: Combining local and global optic flow methods,”International Journal of Computer Vision, vol. 61, pp. 211–231, 2005.

[8] X. Zhou, X. Zhuang, S. Yan, S.-F. Chang, M. Hasegawa-Johnson, and T. S. Huang, “Sift-bag kernel for video event analysis,”in Proceedings of the 16th ACM international conference on Multimedia, ser. MM ’08. New York, NY, USA: ACM, 2008,

pp. 229–238. [Online]. Available: http://doi.acm.org/10.1145/1459359.1459391

[9] S. Videos, M. yu Chen, and A. Hauptmann, “Mosift: Recognizing human actions in,” 2009.[10] Veldaldi, http://www.vlfeat.org/ vedaldi/code/sift.html, Std.

[11] LIBSCOM, http://lastlaugh.inf.cs.cmu.edu/libscom/downloads.htm, Std.

[12] L. et al., http://people.csail.mit.edu/celiu/ECCV2008/, Std.[13] C. Harris and M. Stephens, “A combined corner and edge detector,” in In Proc. of Fourth Alvey Vision Conference, 1988, pp.

147–151.[14] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” inIn ECCV, 2006, pp. 404–417.

[15] A. Shabani, D. Clausi, and J. Zelek, “Evaluation of local spatio-temporal salient feature detectors for human action recognition,”

in Computer and Robot Vision (CRV), 2012 Ninth Conference on, may 2012, pp. 468 –475.[16] I. Laptev and T. Lindeberg, “Space-time interest points,” in IN ICCV, 2003, pp. 432–439.

[17] G. Willems, T. Tuytelaars, and L. Gool, “An efficient dense and scale-invariant spatio-temporal interest point detector,”in Proceedings of the 10th European Conference on Computer Vision: Part II, ser. ECCV ’08. Berlin, Heidelberg:

Springer-Verlag, 2008, pp. 650–663. [Online]. Available:http://dx.doi.org/10.1007/978-3-540-88688-448

[18] J. Shi and C. Tomasi, “Good features to track,” inComputer Vision and Pattern Recognition, 1994. Proceedings CVPR ’94.,

1994 IEEE Computer Society Conference on, jun 1994, pp. 593 –600.

[19] Y. Kubota, K. Aoki, H. Nagahashi, and S.-I. Minohara, “Pulmonary motion tracking from 4d-ct images using a 3d-klt tracker,”in Nuclear Science Symposium Conference Record (NSS/MIC), 2009 IEEE, 24 2009-nov. 1 2009, pp. 3475 –3479.