21
Prey: Stapler Detection System Using Clustered HOG Features CS543 Final Project Report Onur Karaman (karaman1), Sam Liu (liu130), Karan Parikh (parikh8) May 7, 2013 Abstract For our project we were interested in exploring general object de- tection. Particularly, we wanted to implement a general object de- tection algorithm that had pretty good performance regardless of the object. After consulting with Professor Lazebnik during office hours as well as literature, we decided to reduce the scope of our project and only focus on items found in a particular setting. For the project, our target object was made to be a particular stapler, and all scenes are of a particular desk (with clutter). We chose to implement an approach utilizing machine learning with HOG features. 1 Background Material & Related Work For object detection we were inspired by a project called Kittydar[1][2], a sliding-window detection algorithm that uses HOG features and a neural network classifier to find cat faces. Throughout the duration of our project, we also explored using SIFT[4] for detection, Exemplar SVMs[3] for better classification, and Latent SVMs[5] for the same reason. We found SIFT to be ill-suited for our use case, and we were unable to implement Exemplar SVMs though we tried. 1

Cluster HOG Homework

Embed Size (px)

Citation preview

  • Prey: Stapler Detection System UsingClustered HOG Features

    CS543 Final Project Report

    Onur Karaman (karaman1), Sam Liu (liu130), Karan Parikh (parikh8)

    May 7, 2013

    Abstract

    For our project we were interested in exploring general object de-tection. Particularly, we wanted to implement a general object de-tection algorithm that had pretty good performance regardless of theobject. After consulting with Professor Lazebnik during office hoursas well as literature, we decided to reduce the scope of our project andonly focus on items found in a particular setting. For the project, ourtarget object was made to be a particular stapler, and all scenes are ofa particular desk (with clutter). We chose to implement an approachutilizing machine learning with HOG features.

    1 Background Material & Related Work

    For object detection we were inspired by a project called Kittydar[1][2],a sliding-window detection algorithm that uses HOG features and a neuralnetwork classifier to find cat faces. Throughout the duration of our project,we also explored using SIFT[4] for detection, Exemplar SVMs[3] for betterclassification, and Latent SVMs[5] for the same reason. We found SIFT tobe ill-suited for our use case, and we were unable to implement ExemplarSVMs though we tried.

    1

  • 2 Approach Summary

    2.1 Training

    Our general training algorithm outline is as follows:

    1. Extract HOG features from all our training images.

    2. Cluster the HOG features using the k-means clustering algorithm overthe HOG feature space for positive training images. 12 clusters areused for our particular stapler.

    3. Train 1 SVM for each cluster with the positive examples being theHOG descriptors in the ith cluster and the negative examples being allthe HOG descriptors for the negative training images.

    Figure 1: An example of a positive training image we used

    2

  • Figure 2: An example of a negative training image we used

    2.2 Detection

    Our general detection algorithm outline is as follows:

    1. Slide a window at increasing scales over the test image.

    2. For each window, compute the HOG descriptor.

    3. Use the k SVMs trained earlier to classify the window using the HOGdescriptor for that window and label the window according to the mostconfident SVM.

    4. Use the top n windows by distance to produce a heat map.

    3 Building the System

    3.1 Data Collection

    We began by collecting data for training our classifier. For positives, thisconsisted of taking pictures of a stapler at different orientations and crop-ping tight squares containing only the stapler. Because the aspect ratio ofthe stapler isnt a square, we were concerned about background noise. Tomitigate this, we took pictures of staplers with a few different backgrounds.Furthermore, we manually erased background noise in a subset of our staplertraining images, and added rotated or flipped versions of some of the picturesto add robustness to reorientation.

    3

  • We built up our negative training data using two techniques. We first tookimages of a typical cluttered desk environment without a stapler, passed itinto a sliding window, and saved every window that we slid over as a negativetraining image. After running our detector on a fresh set of negative trainingscenes, we saw that it was often misclassifying various parts of a desk suchas keys of a keyboard, cables, or a cup full of pencils.

    As a result, we developed a feedback loop for gathering data. It workedas such: when we ran the algorithm on scenes, we manually selected and fedsome of the misclassified windows into our set of negative training data. Thisway it would learn to better classify those types of patches. We then tookbrand new positive and negative scene images to test our performance with.After checking the performance status, we proceeded to add more negativewindows, take new negative test images, and repeat this process. We sawthat this improved our results quite effectively until we reached about 1000negative and 500 positive training images.

    Note that at every evaluation stage, we were using brand new images thatthe classifier had not yet seen - neither negatives or positives from thesewindows were yet added to our training set. And clearly, our final set of testimages was not used in the active learning at all. In addition, our final setof scenes had some new clutter & new poses, because our goal was to havesome robustness to noise. Our final test set of images consists of 100 negativescenes (containing no stapler) and 61 positive scenes (containing a stapler).Our final experiments are using this dataset.

    3.2 Classifier and Training

    We chose to use SVMs over neural networks since there was more relatedliterature [3] [5] and because there were many libraries to choose from [6][7][8].

    Our initial approach was to simply train a single SVM on the HOG featuresof our training images. The results were inadequate (1 cluster results will bedisplayed in the experiment section). We believe this is because our positivetraining data contains images of the stapler from many viewpoints and out-of-plane rotations. If we were to reconstruct a HOG descriptor based on theweight vectors of the single SVM after training, the descriptor wouldnt really

    4

  • resemble a stapler at all since it would try to cover all of these different typesof perspectives. So the SVM wouldnt be able to find any clear patterns inwhich to separate the feature space by a hyperplane.

    Based on suggestions from the professor and the TA, we realized that itmade more sense to instead have a group of classifiers. Each classifier trainson a set of similar positive training image features, so that each classifier spe-cializes in a certain class of perspectives. The way we implemented this wasby using k-means clustering on the features. We experimented with differ-ent numbers of clusters, and generally this technique significantly improvedresults.

    Our SVM uses a quadratic kernel. Our HOG descriptors use a cell size of10px.

    Our training algorithm was as follows:

    1. We extract HOG features from all our training images.

    2. We clustered the HOG features using k-means over the HOG featurespace for positive training images. 8 clusters are used.

    3. Train 1 SVM for each cluster.

    4. For each SVM, positive examples are given by the HOG descriptors forthe positive training images in the ith cluster.

    5. For each SVM, negative examples are given by the HOG descriptorsfor all the negative training images in our training set.

    3.3 Features

    We chose HOG for detection over SIFT because we wanted our detectionalgorithm to be robust to out-of-plane rotations. We did experiment withSIFT but our preliminary results were poor for our use case, so we ignoredit. SIFT is not tilt invariant [9].

    5

  • We also experimented with appending a grayscale histogram of the win-dow to each feature. We figured the stapler has a pretty unique grayscalehistogram compared to the rest of the scene. While this improved the de-tection in some images, it resulted in other false positives. In the end wedecided to leave these intensity-based histograms out of the feature vectorsince it didnt really improve our accuracy and complicated the code.

    3.4 Detection Algorithm

    Our detection algorithm behaves as follows:

    1. We begin by resizing a given scene image to have a height of 600px.This generally lowered the resolution of the image enough so that ouralgorithm wouldnt behave too slowly.

    2. We then slide a square window at increasing scales over the test image.We chose fixed-size pixel scales of 64, 128, 192, 256, 512.

    3. For each window, we computed the HOG descriptor.

    4. We used the k SVMs trained earlier to classify the window using theHOG descriptor for that window and label the window according tothe most confident SVM.

    5. We sorted windows by confidence and took the top n (we used n=20)windows to use for generating a heat map.

    3.5 Experiments

    For our experiments, we varied the number of clusters we were using tobetter see the effect of clustering on our training data. We performed 4 runs,and the results are as follows:

    6

  • 3.5.1 Experiment 1

    Variable ValueClusters 12Positive test images used 61Negative test images used 100Distance Threshold 0.1Accuracy 0.670807Precision 0.546512Recall 0.770492

    Table 1: Results of experiment 1 (12 clusters)

    Figure 3: A true positive detection using 12 clusters.

    7

  • Figure 4: A true negative detection using 12 clusters.

    Figure 5: A false positive detection using 12 clusters.

    8

  • 3.5.2 Experiment 2

    Variable ValueClusters 8Positive test images used 61Negative test images used 100Distance Threshold 0.1Accuracy 0.596273Precision 0.478723Recall 0.737705

    Table 2: Results of experiment 2 (8 clusters)

    Figure 6: A true positive detection using 8 clusters.

    9

  • Figure 7: A true negative detection using 8 clusters.

    10

  • Figure 8: A false positive detection using 8 clusters.

    11

  • 3.5.3 Experiment 3

    Variable ValueClusters 4Positive test images used 61Negative test images used 100Distance Threshold 0.1Accuracy 0.453416Precision 0.390244Recall 0.786885

    Table 3: Results of experiment 3 (4 clusters)

    12

  • Figure 9: A true positive detection using 4 clusters.

    13

  • Figure 10: A true negative detection using 4 clusters.

    Figure 11: A false positive detection using 4 clusters.

    14

  • 3.5.4 Experiment 4

    Variable ValueClusters 1Positive test images used 61Negative test images used 100Distance Threshold 0.1Accuracy 0.571429Precision 0.441176Recall 0.491803

    Table 4: Results of experiment 4 (1 cluster)

    Figure 12: A true positive detection using 1 cluster.

    15

  • Figure 13: A true negative detection using 1 cluster.

    16

  • Figure 14: A false positive detection using 1 cluster.

    17

  • 3.6 Analysis

    Through our experimentation, we found that 12 clusters was optimal fordetecting our particular stapler. We also conducted a few experiments onsmaller test sets where we set the distance from hyperplane threshold to 0,but increased it to 0.1 to better tolerate noise.

    Clustering worked well for us because our SVM was trying to take intoaccount too many orientations of the stapler at once. By separating ourdata out into more SVMs using clustering, we enabled our classification tobecome more precise, as seen in our experimental results. We noticed thatour performance on negative scenes decreased a bit when we increased thenumber of clusters, but this can be attributed to the unclustered classifierbeing poorly trained on too many disparate positive images. As a result, theunclustered classifier didnt perform well on positives because it had a hardtime finding anything that looked like what it considered to be a stapler.

    We found our algorithm to run a little slow, but we sped it up by applyingmatlabs parallel for construct. We suspect improvements can be made toour choice of window sizes, as in extreme closeup situations we found ourdetector only highlighting portions of the stapler, typically the frontal area.The reason for this was that our maximum window size (512px by 512px)was smaller than the stapler in the image. Since we do resize images to have600px height, our window sizes work for the general case but some furtherperformance tuning could be applied here.

    We could also probably further tune our SVM parameters. We are cur-rently using Matlabs implementation of SVM, so we did try a few differentkernels. Particularly, we tried the RBF kernel, the Linear kernel, and theQuadratic kernel. We found that our results were best using the Quadratickernel. We thought about tuning the slack parameter but ultimately wereshort on time. Since our results seemed acceptable, we decided this is some-thing we can probably try in the future.

    18

  • 3.7 Conclusion

    Ultimately we found that more clusters aided in our detection of staplers inpositive scenes. Through this project, we learned that the multiple-classifierapproach can improve overall results for general object detection. We alsolearned that data quality helps machine-learning based approaches by leapsand bounds. Throughout our project, we were constantly refining our datasetand thinking of ways to augment it to make it more robust to noise. Theamount of data required for machine learning approaches to detection makesit a little more difficult than a method such as SIFT, but the algorithm ismore robust and able to handle transformations and rotations of all sorts. Wesuccessfully achieved our project goal of implementing an algorithm that canpotentially be used for general object detection within a particular settingthat is able to handle affine transformations as well as out-of-plane rota-tions, as long as the correct training data is supplied to the classifier and anappropriate number of clusters are used.

    19

  • 3.8 Statement of Individual Contribution

    Our team met a few times each week using Google Hangout. Together wewrote each component of our algorithm and had headbanging sessions forbugs. We went to both normal office hours and office hours by appointmentwith Professor Lazebnik as well as the TA, Thomas Paine.

    This was our first time implementing any detection algorithm. Our pre-vious experience was limited to using OpenCVs [10] built-in face detectionand by using color-based detection.

    3.9 Future Work

    We plan to continue working on the project to approach our original goalof general object detection. In order to do this, an early step would be tospeed up the algorithm. We currently take advantage of multiple threads,but we calculate the HOG descriptor for every window during the testingphase. It would be more efficient to compute the HOG descriptor once forthe entire test image and then do a sliding window over the cells of the HOGdescriptor. This will help us develop and try different approaches quickly.

    We would like to improve our accuracy even further as well. We could looktrying to augment our existing feature vector with other useful features thatwould help the classifier identify useful patterns.

    From there we can make our detector more general by training the clas-sifiers with a variety of staplers and environments. Once this works we canextend the detector to classifying between different objects, such as identify-ing if a stapler or laptop is in the scene.

    Finally, the dream is to have a user simply provide a test image and anoun describing the object theyre looking for, and having the system detectsaid object in the test image. This would involve pulling objects from GoogleImages or ImageNet for training data in real-time.

    20

  • References

    [1] Heather Arthur. Kittydar. 2012. https://github.com/harthur/kittydar

    [2] Weiwei Zhang, Jian Sun, Xiaoou Tang. Cat Head Detection - Howto Effectively Exploit Shape and Texture Features. http://research.microsoft.com/pubs/80582/ECCV_CAT_PROC.pdf

    [3] Tomasz Malisiewicz, Abhinav Gupta, Alexei A. Efros. Ensemble ofExemplar-SVMs for Object Detection and Beyond. ICCV, 2011. http://www.cs.cmu.edu/~tmalisie/projects/iccv11/exemplarsvm-iccv11.pdf

    [4] Lowe, David G. Distinctive Image Features from Scale-Invariant Key-points. IJCV, 2004. http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf

    [5] P.F. Felzenszwalb, R.B. Girshick, D. McAllester and D. Ramanan. ObjectDetection with Discriminatively Trained Part Based Model. PAMI, 2010http://lear.inrialpes.fr/~oneata/reading_group/dpm.pdf

    [6] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guideto Support Vector Classification. Last updated April 15, 2010. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

    [7] Mathworks. Svmtrain Documentation. May 6, 2013 http://www.mathworks.com/help/stats/svmtrain.html

    [8] VLFeat. PEGASOS SVM solver. May 6, 2013 http://www.vlfeat.org/api/pegasos.html

    [9] Jean-Michel Morel and Guoshen Yu. ASIFT: A New Framework for FullyAffine Invariant Image Comparison. http://www.cmap.polytechnique.fr/~yu/publications/SIAM_ASIFT.pdf

    [10] OpenCV. May 6, 2013. http://opencv.org/

    21