Collision recognition from a video part A

Back Vision part A

Appendix D: Collision DetectionCollision courseOn a collision course,the lines between the camera centers and the object are almostparallel

Thus, the reconstructions will be very distant from one anotherWe identify this by measuring dynamic points scatteringNote - This property is not unique to collision courses

Appendix E: Collision DetectionClustering AlgorithmWe want to count how many balls are needed to cover all the reconstructed pointsWhile there are points remaining:Choose a random pointDraw a ball around itRemove all points inside the ballThe number of balls used is the result of the algorithmThis is used as a metric for points scatteringWe implemented a k-medoids algorithmProduced almost the same results, but performance was much worse we chose the above random algorithm[8]Appendix F: Triangulation ambiguity Uncertainty of reconstruction depends on the angle between the triangulation raysReconstructed points has more ambiguity along the ray as the rays become more parallelForward\ backward motion rays almost parallel , thus reconstruction is even more weak

Less ambiguity Higher ambiguity BackgroundFeature Detection and MatchingInterest points detectionLaplacian Pyramids(computed by DoG)

Interest points are the extrema in scale-space (x,y;s)

[2]

[3]

Feature Detection and MatchingSIFTImage descriptor - for each interest pointGrid 4x4Scale Normalization by level in pyramidOrientation Normalization by largest gradientGradient histogram per cellBy pixel gradient8 quantized directions

Descriptor size 4x4x8 = 128 dimensions[4]

Feature Detection and MatchingSIFTMatching Closest neighbor by Euclidean distance between descriptors[5]

Feature Detection and MatchingASIFTAffine extension of SIFT

ASIFT is much more accurate, gives more featuresASIFT is slower than SIFT (~50x)Weve used ASIFT for accuracy reasons

Perspective ProjectionCamera - Pinhole model(X0, Y0 , Z0) (U0, V0)

Perspective ProjectionMatrix RepresentationTranslation and RotationProjectionIdeal camera calibration matrix

Real camera calibration matrixFinal model of camera transformationUsing homogenous coordinates

(Xf, Yf , Zf) = pinhole coordinates

normalization

3D Reconstruction Fundamental MatrixRepresents transformation between two framesx - 2D point in frame 1 (projection of X in 3D world) x 2D point in frame 2 (projection of same X)Fx epipolar line on frame 2 Also the projection of the epipolar plane on frame 2Geometric constraintMeaning: x must be on the line Fxrank(F) = 2[6]

l = FxxxX[6]

3D Reconstruction Fundamental MatrixEstimating using RANSACGenerating many hypotheses (e.g. 500)Choosing 8 random pointsEstimating F using these 8 points (eight point algorithm)

Choosing the best hypothesisMinimizes the sum of error for all points

3D Reconstruction Estimating transformation between framesEssential Matrix ESimilar to fundamental matrix, with normalized coordinatesCan be defined as Satisfies t,R - translation and Rotation between the two frames

Using SVD for E we get 4 OptionsR is determined up to degrees rotation (= 2 options)t is determined up to sign (= 2 options)

3D Reconstruction Triangulation We now know the relative translation and Rotation (R,t) between the two framesWe set the first camera to be at the origin :

We can draw two lines in 3D space: from each interest point to camera centerIdeally, these two lines should intersect at the real 3D pointRealistically, due to noise, the two lines dont intersectWe approximate by linearization and error minimization is the reconstructed point

[7]

14

Our Approach Block Diagram

Our Implementation Feature Detection & Matching using ASIFT

MatchesFeature Detection&Image DescriptorsFrame1Matching Interest PointsFrame2Feature Detection&Image Descriptors

Our Implementation 3D Reconstruction

[*] Assuming the Calibration Matrix is known

Using the methods explained earlierOut of 4 solutions, we eliminate 3 impossible ones:Angular difference between the frames is larger than 180The reconstructed points are behind the camera

MatchesFundamental Matrix[*] Estimating transformation between frames Triangulation3DReconstructed pointsRecognition and DifferentiationBetween Static and Moving ObjectsFor N Frames creating N-1 reconstructionsEach reconstruction is between frames i and i-5Reconstructions MatchingFor each 3D point in the newest reconstruction , finding the closest points in N-2 earlier reconstructions

DynamicFeature PointsReconstructions MatchingVariance Calculation for each pointStaticFeature Points3DReconstructed points3DReconstructed points3DReconstructed points3DReconstructed pointsN-1Recognition and DifferentiationBetween Static and Moving ObjectsIndicatorsDynamic points have greater epi-polar errorDynamic points have higher variance (for each point and its matches)

Variance NormalizationWe need to normalize by the expected errorDistance from camera -

Angle between triangulation lines -

Setting some threshold for each indicatorPoints that have variance above the threshold are Dynamic Point that have variance below the threshold are Static

DynamicFeature PointsReconstructions MatchingVariance Calculation for each pointStaticFeature Points3DReconstructed points3DReconstructed points3DReconstructed points3DReconstructed pointsN-1

Collision Detection Reconstruction by static pointsMore accurate reconstructions of the dynamic points than the ones we hadEstimate dynamic points scatteringOn collision course, the reconstructed points are widely scattered

Counting how many balls are needed to cover all the points If greater than some threshold (e.g. 10), we assume some object is on a collision course

Estimate dynamic points scatteringIs there collision?Static PointsStatic PointsStatic PointsStatic FeaturePointsN - - Reconstructionof the Dynamic points N-1Static PointsStatic PointsStatic PointsDynamicFeaturePointsN - - Estimating Fundamental Matrix by the Static pointsN-1Results

Synthetic Testing Environment

3D Synthetic WorldObjects in picture are represented by trees (static objects) and cars (moving objects)Each tree is a blue boxEach car is a green boxFrom each object we randomlychoose a predetermined number of 3D points (~64)

Vehicle represented by a moving cameraThe camera is a pink pyramidThe camera has an angle relative to the moving directionTakes a picture every 1/20 secondThe interest points are the perspective projection of the chosen 3D pointsGaussian noise is added to the 2D projected points3D Synthetic WorldScenariosCreation - We chose 6 scenarios for testing where the direction of the car changes. e.g. Collision direction :

Same direction:

3D Synthetic WorldScenarios Reconstruction Results Collision direction :

Same direction :

3D Synthetic WorldCollision Detection Results :

Conclusions : Setting the threshold to 10, we can correctly identify collision2% false negatives on collision scenario (collision but no alarm)12% false positives on the worst scenario (alarm but no collision)

Synthetic ResultsTests - The error in 3D reconstruction by noise Changing different parametersReconstruction based on Static vs. Static & Dynamic pointsThe error is significantly larger when dynamic points are includedConclusion:Separation between static and dynamic objects is crucial for a reliable 3D reconstructionImplementation:We reconstruct the world using based on the static points only after separation

Synthetic ResultsFrame rate : 1 - 20 per secThe error is very largewhen comparingconsecutive framesConclusion: Reconstruction should be based on frames farther apart. The bigger difference between frames makes the noise less significant.

Implementation:Reconstruction is based on frames that are 5 frames apart

Synthetic ResultsCamera angle 0 -90The camera angle significantlyaffects the error - the larger theangle*, the smaller the error

* relative to the forward direction

Conclusion: The camera angle creates a larger difference between frames, so the noise has less affect

Implementation: The camera should be positioned in an angle relative to the forward direction

Synthetic ResultsTrees position distance from camera : 7-31 metersThe tree position significantlyaffects the error the farther the tree , the less accurate the result

Number of interest points of each object : 32-128The more points the merrier

Movie ResultsTwo movie typesCamera on cyclists helmet

Camera on Roomba

Movie ResultsCalibrationUsing an external toolbox for MatlabGetting the calibration matrix K

Fixing radial distortion using an external algorithm

Movie ResultsFeature detection and matchingDynamic points

Rolling shutter caused distortion due to the vibrations of RoombaASIFT misses the dynamic points in majority of moviesSolution: manual feature matching (using cpselect tool)

Movie ResultsEstimating Ego motion using essential matrixRotation Camera was fixed to the robot during the shootingExpecting rotation ~ 0The result was as expectedTranslation The translation size was determined by usExpecting angle between x-y axis 30The result was around 25Conclusion Ego motion is estimated correctly Thus we assume Fundamental matrix and calibration of the camera are correct.

Movie ResultsReconstruction of the world

Movie ResultsRecognition and Differentiation Between Static and Moving ObjectsEpi-polar error

The epi-polar error does not correlate well with the expected resultWe get a lot of static points with a high error and some dynamic points with a low errorWe have decided not to use it

Movie ResultsRecognition and Differentiation Between Static and Moving ObjectsVarianceMeasuring variance among several 3D reconstructionsDistant objects have a high varianceUsing un-normalized variance, We cannot distinguish between distant and dynamic points

Movie ResultsRecognition and Differentiation Between Static and Moving ObjectsNormalized Variance1) Distance from camera - Threshold = 0.05

2) Angle between triangulation lines Threshold = 3.3e-6

We get better results than previous methodsStill, there are scenes where it doesnt work as expected

Summary and Conclusions There were several major problems in the project1) Matching features of moving objectsDoesnt work, largely due to the vibrations in video capturingIn a real scenario, we expect much less vibrations2) Classifying static and moving objectsEven the best algorithm fails on many casesA form of tracking (e.g. KLT) can help solve this problem3) Long running time (~3 minutes per frame)Most of the time is spent on ASIFTA faster feature matching algorithm can resolve thisSummary and Conclusions Further researchUsing a tracking algorithm (e.g. KLT)Should solve the matching problemMuch better classification between static and moving objectsIdentifying vehiclesAn algorithm that recognizes vehicles (e.g. Viola and Jones)Allows focusing only on interesting objects instead of the entire frameAccurate triangulationUsing the full polynomial error estimation instead of the linear approximationThank you for Listening

Appendices

References [1] E.Dagan, O.Mano, G. P. Stein, A.Shashua, Forward Collision Warning with a Single Camera, 2004[2] Mikhail Sizintsev, http://www.cse.yorku.ca/~sizints[3] http://www.scholarpedia.org/article/File:Strandvagen2-Laplace1500pts.png[4] David G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, 60, 2 (2004), pp. 91-110.[5] http://www.scholarpedia.org/article/SIFT[6] http://www.consortium.ri.cmu.edu/projMultiView.php[7] Hartley and Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. p.311[8] http://en.wikipedia.org/wiki/K-medoids

Documents

Collision recognition from a video part A