Predicting Failures of Vision Systems - Robotics · PDF filePredicting Failures of Vision Systems ... volve pipelines, ... failure prediction approaches, and show that ALERT out

Embed Size (px)

Citation preview

  • Predicting Failures of Vision Systems

    Peng Zhang1 Jiuling Wang2 Ali Farhadi3 Martial Hebert4 Devi Parikh11Virginia Tech 2Univ. of Texas at Austin 3Univ. of Washington 4Carnegie Mellon University

    1{zhangp, parikh}@vt.edu [email protected] [email protected] [email protected]

    AbstractComputer vision systems today fail frequently. They also

    fail abruptly without warning or explanation. Alleviatingthe former has been the primary focus of the community. Inthis work, we hope to draw the communitys attention to thelatter, which is arguably equally problematic for real appli-cations. We promote two metrics to evaluate failure predic-tion. We show that a surprisingly straightforward and gen-eral approach, that we call ALERT, can predict the likelyaccuracy (or failure) of a variety of computer vision sys-tems semantic segmentation, vanishing point and cameraparameter estimation, and image memorability prediction on individual input images. We also explore attribute pre-diction, where classifiers are typically meant to generalizeto new unseen categories. We show that ALERT can be use-ful in predicting failures of this transfer. Finally, we lever-age ALERT to improve the performance of a downstreamapplication of attribute prediction: zero-shot learning. Weshow that ALERT can outperform several strong baselinesfor zero-shot learning on four datasets.

    1. IntroductionComputer vision systems today are not perfect. Unfor-

    tunately, given the ambiguous nature of many vision prob-lems, they will never be perfect. Our communitys primaryfocus has been on making these systems lets call themBASESYS1 more and more accurate. To encourage sys-tematic progress, we have established benchmarks like Cal-tech 101 [16], PASCAL [14], SUN [60], etc. and we striveto minimize the failures of BASESYS on these benchmarks.

    Computer vision as part of a system: It is crucial tokeep in mind that in the real world, many applications in-volve pipelines, where the output of one system is fed intoanother as input. For instance, models trained to classify lo-cal image patches may be fed into probabilistic models forsemantic segmentation [51]. Semantic segmentation maybe fed to a robots path planning algorithm for navigat-ing through its environment [41]. Estimates of vanishingpoints in an image may be fed into a 3D layout estima-tion algorithm [23, 46]. Attribute predictions may be fed

    1BASESYS may be a segmentation engine, an attribute predictor, a van-ishing point estimator, an iPhone app that predicts aesthetic quality, etc.

    (a) (b) (c) (d)Figure 1: Qualitative results of our proposed approach (ALERT) on twosegmentation datasets: (a) and (c) show images predicted by ALERT as un-reliable poor performing images; (b) and (d) show images predicted to bethe best performing. For a ground robot (a,b), ALERT correctly flagged theimages that are corrupted with strong shadows and over/under-exposurewhile retaining the images with good contrast. For PASCAL segmentation(c,d), the images for which BASESYS [7] generated poor segmentationsare marked by a red border. ALERT clearly favored images with easy tosegment isolated objects over complex images containing a large numberof small regions. See authors webpage for example segmentations.

    to recognition systems to classify previously unseen objectcategories [15, 31]. The subsequent system may even bea user: an image memorability predictor [25] may be usedby a graphics designer.

    In all these cases, it is of course desirable for BASESYSto fail infrequently. But arguably, it is equally desirablefor BASESYS to fail gracefully, instead of failing abruptlywithout warning. While minimizing failures has been theprimary focus of the community, embracing and effectivelydealing with the failures has been mostly ignored. For avariety of reasons such as verifiable safety standards for au-tonomous systems, there is a recent push towards BASESYSthat can explain their outputs and are interpretable. In thispaper we take a small step in this direction.

    We push for a capability where BASESYS can generatea warning I am unable to make a reliable decision for thisinput. Not all applications may have the luxury of takingadvantage of such a warning. We argue that many do. Forinstance, if BASESYS issues a warning that it is unlikelyto be able to accurately predict the presence of certain at-tributes in an image, the recognition system downstreamcan choose to ignore these unreliable attributes, and onlyuse the reliable ones. This will make the overall recogni-tion pipeline more robust as compared to one that acceptsall attribute predictions from BASESYS at face value andpropagates the errors. Consider a robot segmenting everyframe in the video feed during a field experiment, where

    1

  • it often encounters poor quality images. If BASESYS canidentify such frames where the segmentation output is likelyto be inaccurate (See Figure 1(a)), the robot can choose toskip those frames all together, thus saving computationaltime, and simply wait for a later, more reliable frame. De-pending on the application, a delayed but accurate decisionmay be preferred over a timely inaccurate decision, whichmay be catastrophic. If the user of an app can be warnedthat the output cannot be trusted for a particular input, (s)hecan make an informed decision. In image search if a useris looking for black high-heeled shiny shoes, it might bebetter to not return images where the high-heel predictorcan not be trusted, and only return images where all threeattributes can be reliably detected. Generally, the resultanthigher precision may be worth the possibly lower recall.

    How to detect failures? There are two ways to equip aBASESYS with such a warning capability. First, we couldanalyze the output of the vision system to assess its confi-dence on any given input. In this case, the confidence eval-uation has to be designed specifically for each system. In-stead, we explore the alternative approach of evaluating theinput itself in order to assess BASESYSs confidence. Themain thesis of our approach is that while BASESYS maynot be able to reliably predict the correct output for a cer-tain input, predicting the difficulty or ambiguity of the inputmay still be feasible. A benefit of this approach is that it isapplicable to any vision system because the reliability es-timate is generated based on the input alone, without everlooking at the inner workings of BASESYS. This approachhas the added benefit of not having to run BASESYS typ-ically computationally expensive to assess its confidence.Instead it can quickly reject inputs that are not reliable tobegin with. Finally, failure patterns may be different acrossdomains. Designers of BASESYS can not foresee all do-mains in which it will be used. Hence, any confidence esti-mation or early-rejection mechanisms built into BASESYSmay not generalize. Our approach, that we call ALERT, pro-vides a simple tool for the consumer of BASESYS to train afailure predictor catered to his domain.

    Why is it feasible? The idea of analyzing the inputto classification and detection algorithms to assess its suit-ability is not new and has been around in a variety ofapplication-driven communities such as Automatic TargetRecognition (ATR) [32, 59] and biometrics [20, 53]. Inthese signal and image processing fields, the distinctionbetween errors due to inherent ambiguities in the featurespace and errors due to corrupted input becomes critical.While both may be difficult to recover from, the latter can atleast be detected. As the vision community inches closer tohaving real-world applications and away from hand-curateddatasets such as Caltech 101 [16] or PASCAL [14], we ar-gue that recognizing this distinction becomes critical in ourcommunity too. Note that, while analyzing performancesof algorithms as a function of biases in these datasets has

    been addressed [28, 54], we are advocating the distinct taskof predicting the performance of a system on a given input.With this work we hope to bring the communitys attentionto building self-evaluating systems that can reliably predicttheir own failures.

    A valid concern a reader may have is that if a system thatcan reliably predict failures of BASESYS can be trained,does that not mean BASESYS could have already beentrained to be better? Assuming BASESYS has been trainedwell, should it not be impossible to train ALERT with accu-racy better than chance?2 We argue that this reasoning is notapplicable to a wide range of vision systems. First, manyBASESYS are not learning-based approaches in the firstplace (e.g., vanishing point estimation as in [33]). Amongthose that are, applications where the output space is verylarge (e.g., exponential for segmentation or any structuredoutput task), ALERT identifying that the label predictedby BASESYS is incorrect provides little information aboutwhat the correct label in fact is. Further, the features usedto train BASESYS (e.g., local image patches for segmenta-tion) may be complementary to the features used to trainALERT (e.g., image-level gist). It is not trivial to incorpo-rate these features (e.g., gist) into BASESYS itself (e.g., van-ishing point estimation or segmentation). Hence, generallyspeaking, research efforts towards systems such as ALERTare orthogonal to efforts towards improving BASESYS. Forapplications such as attribute-predictors where BASESYShas a small label-space (binary) ALERT may boil down toidentifying images where the response can not be trusted(e.g., gray-scale images for predicting the attribute red), andnot images where the response can be confidently classifiedas being wrong. This is a subtle but important difference.While a good classifier should ideally have reliable con-fidence estimates, most existing training procedures do notexplicitly enforce this. They optimize for assigning correctlabels to training images, and not for the