A guided multi-scale categorization of plant species in ...A guided multi-scale categorization of plant species in natural images Jonas Krause, Kyungim Baek, and Lipyeow Lim Dept

A guided multi-scale categorization of plant species in natural images

Jonas Krause, Kyungim Baek, and Lipyeow LimDept. of Information and Computer Sciences, University of Hawai’i at Manoa

1680 East-West Rd, Honolulu, HI 96822krausej,kyungim,[email protected]

Abstract

Automatic categorization of plant species in natural im-ages is an important computer vision problem with numer-ous applications in agriculture and botany. The problem isparticularly challenging due to the large number of plantspecies, the inter-species similarity, the large scale varia-tions in natural images, and the lack of annotated data. Inthis paper, we present a guided multi-scale approach thatsegments the regions of interest (containing a plant) froma complex background of the natural image and systemat-ically extracts scale-representative patches based on thoseregions. These multi-scale patches are used to train state-of-the-art Convolutional Neural Network (CNN) modelsthat analyze a given plant image and determine its species.Focusing specifically on the identification of plant speciesin natural images, we show that the proposed approach isa very effective way of making deep learning models morerobust to scale variations. We perform a comprehensive ex-perimental evaluation of our proposed method over severalCNN models. Our best result on the Inception-ResNet-v2model achieves a top-1 classification accuracy of 89.21%for 100 plant species which represents a 5.4% increase overusing random cropping to generate training data.

1. IntroductionTraditionally, botanists analyze different characteristics

of a plant to categorize its species. But identifying a plantspecies accurately based on visual characteristics requiresconsiderable expertise [27], and is almost impossible for thegeneral public. Therefore, an automated system to classifyplants has important implications for the society at large notonly in the preservation of ecosystem biodiversity includingpublic education, but also in agricultural activities such asautomatic crop analysis, species variability analysis, analy-sis of phylogenetic relationships, identification of pests anddiseases, and identification of invasive species.

Computer vision approaches for plant identification us-ing controlled images have shown promising results [12,

14, 17]. However, a real-world plant categorization systemneeds to deal with natural images, which is a major chal-lenge for any automated method. While there are dozens ofplant identification apps available, they typically do not per-form well on unconstrained natural images. The analysis ofnatural images can be extremely difficult due to complexbackgrounds, objects appearing in any scale, occlusions,and the presence of numerous different objects in the sameimage. While the human visual system deals with those fac-tors with ease, an equivalent computational model for plantcategorization using natural images is still an open problem.

In this paper, we propose a deep learning-based approachthat, given a natural image of a plant, categorizes its speciesand deal with problems such as: i) If there is a plant in theimage, where is it located? ii) What is the most representa-tive area of the image for the plant categorization problem?iii) How to handle the same plant in different scales? Andiv) how to improve the use of state-of-the-art computer vi-sion methods to categorize the plant species? In particular,we focus on a guided multi-scale approach that is used totrain existing CNN models for the fine-grained categoriza-tion of plants. Our method exploits current deep learningmodels to identify regions of interest, i.e., regions contain-ing at least one plant in the natural image. Multiple patchesof different sizes are then extracted from those regions.These patches are further post-processed and resized for theparticular CNN model used for categorization. To the bestof our knowledge, no one has proposed a method designedto extract multi-scale representative patches of plants fromnatural images. The contributions of this paper are:

• We propose a new approach to make CNNs more ro-bust to scale variations when analyzing plants in natu-ral images.

• We implemented our method with different CNN mod-els for the fine-grained categorization of plants.

• We performed a comprehensive experimental valida-tion and evaluation of the proposed method on differ-ent data sets containing natural images. Our results

show a considerable improvement in accuracy whenthe proposed approach is used.

In the next section, we present the related work by brieflydescribing methods proposed to address the multi-scale is-sue of plants in natural images and other fine-grained cate-gorization problems that also require a similar scale analy-sis. The proposed guided multi-scale approach, as well asits step-by-step process, is explained in Section 3. A de-tailed analysis and discussions of the experimental resultsare presented in Sections 4 and 5. We conclude the discus-sion in Section 6.

2. Related WorkFocusing on multi-scale approaches that try to handle the

analysis of objects in natural images, we survey some rel-evant previous work and organize them as per their imple-mented method to address the scale issue.

2.1. Human-in-the-Loop

To improve accuracy and address the multi-scale issuein the fine-grained categorization of plant species, a visualanalysis (human-in-the-loop) is implemented in [5] wherethe correct annotation is done by human labelers, allowingthe incorrectly classified images to be reintegrated into thedataset after this laboring classification. Another approachusing human-in-the-loop is proposed by Wah et al. [26],which is designed for the fine-grained categorization of birdspecies. Their visual recognition system is composed of amachine and a human user, who is asked to provide addi-tional information by clicking on the object parts and an-swering binary questions. Using a dataset called CUB-200[28] of 200 bird species and their annotated parts, Wah etal. propose to solve the bird classification problem by an-alyzing specific areas of the image with the assistance of ahuman user, who can easily indicate the bird parts (head,beak, body, wing, and tale) independent of the image scale.

2.2. Different Feature Representations

Other computer vision techniques have been proposedto solve the multi-scale issue though they are not specifi-cally designed for plants. Nevertheless, some of these ideascan be adapted to the fine-grained categorization of plantspecies and may help the training process of the CNNs.For example, Yang and Ramanan [29] propose an approachthat combines different feature representations to addressthe multi-scale recognition problem. Their method seeks toextract features from multiple layers of a single deep net-work. Essentially, a directed acyclic graph (DAG) struc-tured CNN is used to learn a set of multi-scale features ateach level, which are shared with the final output predictorsimultaneously. Experiments are performed using a widevariety of environmental scenes with different backgrounds

and objects in them. Reported results suggest that encodingscale-specific features may be beneficial for training CNNsboth for general image classification and fine-grained cate-gorization tasks.

2.3. Multi-Scale Fusion

Back to plant species categorization, Hu et al. [7] pro-pose a CNN with multi-scale fusion designed for leaf recog-nition. Using the MK Leaf [13] and the LeafSnap Plant Leaf[12] datasets, a customized CNN is trained by slowly infus-ing images of multiple resolutions with the list of bilinearinterpolation operations used to sample them. In this way,down-sampled images are progressively fed to the CNN,concatenating extracted features at each level of the net-work to perform a multi-scale analysis. Nevertheless, theirmethod is designed to work with leaf images taken in con-trolled backgrounds only, limiting its application.

Another fusion method is presented by Karpathy et al.[8] where a multi-resolution CNN architecture is proposedfor video classification. In their approach, each frame ofthe video is fed into two separate streams that converge tofully connected layers at the end. The first stream modelslow-resolution images, while the second stream processeshigh-resolution patches cropped at the center of the inputframes. As a result, a fast and dual-scale classification oneach frame of the videos is enabled through the two streamsof the CNN. A similar approach is presented by Mo et al.[16] where patches cropped at the center of the images areanalyzed by a dedicated CNN while the entire input imagepasses through another identical network. The outputs ofthe two networks are concatenated and a third CNN is usedat the top of the extracted features for a deeper represen-tation and classification. Both methods [8, 16] implementtheir preprocessing stage of cropping representative areasfocusing on the center of the images. However, when deal-ing with plants in natural images, it should be consideredthat plants may not be centered, making a guided approachnecessary to indicate where the plant is located.

2.4. Pose Normalized Feature Spaces

An approach for categorization of bird species guided bythe selection of useful parts is introduced by Branson et al.[2]. Their approach employs pose normalized CNNs and agraph-based clustering algorithm is used to learn a compactpose normalization space. In this case, cropped patches ofthe bird’s head and body, as well as the entire image areused to train each CNN. The multi-scale issue is addressedby randomly extracting cropped image patches with arbi-trary sizes to be used by the CNNs to learn scale-invariantfeatures. However, all cropped areas have to be resized to asquare to fit the first layer of the CNNs, changing its aspectratio. Paying attention to this detail, Liu et al. [15] presenta similar multi-scale approach with additional CNNs called

attention networks. These auxiliary networks are indepen-dent CNNs incorporated to identify representative squaresample areas at two different scales. The two-scale fea-tures extracted at the identified locations are combined withthe entire input image for classification. As a result, thisapproach focuses on three main scales to extract and clas-sify the bird’s head, its body, and the entire scene outper-forming previously described methods in the fine-grainedcategorization of birds. However, this method relies onannotated parts to construct the match between parts andclasses, which makes it difficult to apply it in the catego-rization of plant species. To the best of our knowledge,there is no available dataset with annotated plant speciesand their respective parts (leaf, flowers, bark, steam, fruit,etc.). Therefore, alternative methods to extract representa-tive patches for fine-grained categorization of plants have tobe designed.

2.5. Segmentation-based Approaches

An interesting idea that inspired our guided approach isproposed by Krause et al. [9]. Although being developedfor the fine-grained categorization of bird species and us-ing annotated bounding boxes for training, their approachdoes not use part annotations in its classification process.Instead, segmentation and alignment are used to generatepart images that are combined to represent the entire bird.Nevertheless, annotated bounding boxes are used to trainthis approach, limiting its application to datasets such asthe CUB-200 [28]. Even so, the idea of segmenting theobject to extract representative patches can be adapted forfine-grained categorization of plant species.

A systematic object detection and segmentation ap-proach for fine-grained categorization of flowers is pre-sented by Angelova and Zhu [1]. Their method first de-tects low-level regions that could potentially belong to theobject of interest and then performs a full-object segmenta-tion within those regions. They also zoom-in on the object,center it, and normalize its size to a single scale discount-ing the effects of the background. To understand the ben-efits of the segmentation step for fine-grained categoriza-tion tasks, Angelova and Zhu compare their approach witha baseline model. The baseline model does not use seg-mentation and is outperformed by their model in all testeddatasets, suggesting that the segmentation step helps to im-prove the recognition performance. For plants in naturalimages, it is difficult to correctly segment all the details ofa plant from a complex background and other similar plantsthat may be in the same image. However, as suggested byprevious studies [1, 9, 10, 11, 19, 20], a segmentation stepcan help guide the extraction of representative samples froma natural image. Section 3 details how the segmentation isincorporated for plant species categorization using CNNs.

3. Multi-Scale Plant CategorizationThe proposed guided multi-scale approach exists as part

of a larger plant categorization system and framework,called WTPlant (What’s That Plant?) [10, 11].

3.1. The WTPlant Framework

The WTPlant framework is a system of CNN models forcategorizing plants in natural images that is designed to ad-dress the challenges of segmenting the plant from a com-plex background and dealing with large variation in scale ofnatural images. WTPlant addresses these issues by:

• Using stacked convolutional blocks for scene parsing.

• Implementing a preprocessing stage for multi-scaleanalysis.

Figure 1 shows a diagram of the basic building blockor pipeline of the WTPlant framework. In general, thedepicted pipeline first extracts a various number of multi-scale patches with the guidance of the segmentation pro-cess, predicts the plant species at various scales by classi-fying the patches individually, and combines those predic-tions to make a final decision on plant species. In the WT-Plant system, multiple of such pipelines can be constructedto classify specific parts of plants (e.g., flower, fruit, andbark) and the classification results of each pipeline can becombined using ensemble-like methods or a final soft-maxlayer. The framework is also designed to be modular andextensible, so that newer and better deep learning modelscan be incorporated with minimal re-customization. Eachcomponent of the framework (e.g., segmentation methods,patch extraction processes, and CNN models) can be inde-pendently upgraded or swapped with different implementa-tions. A byproduct of this extensibility is that we are able touse WTPlant to train and evaluate different CNN models.

3.2. Multi-Scale Approach Guided by Segmentation

In the following, we describe the proposed multi-scaleapproach where images of plants are first segmented toextract multi-scale representative patches used for trainingCNNs. This is an important process when working with nat-ural images due to the possible presence of other objects inthe scene that may adversely affect the plant categorization.

Segmenting Plant Region: Given a natural image, abroad analysis of the entire scene is performed to detectthe presence of plants. This is one of the major problemsin computer vision and it is called scene parsing, or seg-mentation and recognition of objects in an image. Using aCNN architecture with stacked convolutional blocks, Zhouet al. [30] developed a cascade segmentation module forthe scene parsing problem (henceforth referred to as MITScene Parsing). They trained a three-level stacked CNN

Figure 1. WTPlant framework.

using a dataset called ADE20K to segment common back-ground objects (sky, road, building, etc.), foreground ob-jects (car, people, plant, etc.) and object parts (car wheels,peoples head and torso, etc.). The MIT Scene Parsing istrained to segment 150 different objects from a scene, in-cluding plants. Due to the highly accurate results reportedon the segmentation of plants and the usage of stacked con-volutional blocks in their process, we have incorporated thismethod in the WTPlant framework for segmentation pro-cess to localize plants in natural images.

The segmentation process produces Regions of Interest(RoI) delimitating the plants’ areas in the input image. Ifmore than one RoI is detected, only the largest area is cho-sen to represent the plant in the image. (We leave the iden-tification of multiple plants in an image to the future work.)If any RoI is collected, meaning the potential presence ofplants in the image, the RoI is assumed to contain the mostrepresentative information of the plant and is further pro-cessed to predict the plant species. If no RoI is identifiedduring the segmentation process, the image is considered as“No Plant Image”.

Extracting Multi-Scale Patches: Once the RoI (i.e.plant region) is identified (green boundary in Figure 2(c)),we first define a square bounding box of the RoI basedon the minimum and maximum x and y coordinate valuesof the RoI (red square enclosing green boundary in Figure2(c)). This bounding box forms the largest patch to be ex-tracted. Secondly, the centroid (center of mass) of the RoIis calculated (red dot in Figure 2(c)) and we define a close-up area centered at the centroid using the input size of theclassification CNN (224x224 or 299x299 pixels). It is themost “zoomed in” patch at a minimum resolution, called theclose-up patch (smaller red square in Figure 2(c)). Finally,we extract patches with various scales by placing multiplesquares evenly between the close-up patch and the boundingbox of the RoI (i.e. the largest square for patch extraction).For example, if n multi-scale patches are extracted, n − 2squares are evenly placed between the close-up patch andthe bounding box. All extracted square areas are resized tofit the first layer of the CNN, resulting in a set of multi-scalepatches used to train and test the classification model. Sinceeach patch covers a square region in the image, the extracted

areas are not stretched in any way by the resizing. In thisguided multi-scale patch extraction process, the size of theclose-up patch and the number of patches to be extractedcan be customized depending on the resolution of the inputimage to the classification CNN.

Figure 2 illustrates the guided multi-scale patch extrac-tion process described above. The binary mask (Figure2(b)) generated by the segmentation process defines theRoI, which is used to guide the extraction of representa-tive patches in various scales. Figure 2(d) shows the re-sulting patches extracted for 10 different scales. The MITScene Parsing works very well to indicate where the plantis located in a natural scene. Therefore, these extractedmulti-scale patches provide well-represented samples forthe fine-grained categorization of plants, discarding noisybackgrounds and focusing on highly informative regionsof the image. This guided approach is initially set to ex-tract 10 representative multi-scale patches but it can be pro-grammed to extract more if necessary. Close-up patchesaround the centroids are extracted with the minimum reso-lution (or maximum zoom-in) and do not need to be resized.They are extracted with an area of 224x224 pixels for resid-ual networks [6, 21] and 299x299 pixels for inception mod-ule networks [3, 22, 24] as recommended in the literature.As described above, all other extracted patches are resizedto match these common sizes according to the input layer ofthe classification CNN. Furthermore, the guided multi-scaleprocess allows the extraction of patches of arbitrary size invarious scales, so it can be used by most of the CNN models.After this preprocessing stage of segmentation and multi-scale sample collection, extracted patches are now ready tobe fed into the CNNs. The detailed steps of the proposedguided multi-scale method are shown in Algorithm 1.

4. Experimental AnalysisWe implemented the proposed guided multi-scale ap-

proach using Python 3.6 and Keras 2.2.4 API. Our testbeduses the Ubuntu 16.04 operating system and a NVIDIAGeForce GTX 1080 Ti GPU to train the CNNs. Two datasetswith annotated plant images are used for these experiments:the BJFU100 [21] and the UHManoa100 [10, 11]. Each ofthem contains 100 different plant species from the Beijing

Figure 2. (a) Input image (Koelreuteria formosana), (b) Mask produced by the MIT Scene Parsing [30], (c) Bounding box of the RoI (redsquare enclosing green), centroid (red dot), and close-up (red square around the centroid), and (d) Multi-scale extracted patches.

Algorithm 1: PATCHEXTRACTION(I ,m,n,p)Input: Image I , Mask m, # of patches n, Patch size pOutput: A setM of multi-scale patches each of size p

1 rlargest ← Find largest RoI from m;2 mbr ← Find min. square bounding box from rlargest;3 c← Calculate centroid of rlargest;4 q ← Find coordinates of square of size p around c;5 δ ← [Area(mbr)−Area(q)]÷ (n− 1);6 M← ∅ ;7 k ← q // square coordinates;8 for j ← 1 to n do9 i← Crop I using k;

10 patch← Resize crop image i to size p;11 M←M∪ {patch};12 k ← Increase the size of k by δ;

Forestry University campus (BJFU100) and the Universityof Hawai’i at Manoa campus (UHManoa100). All imagesused to train and test the CNNs are natural images, present-ing complex backgrounds, varying illumination, occlusions,shadows, and a rich local covariance structure.

4.1. Training and Evaluating the CNNs

We implement the training process of our CNNs by split-ting the datasets into training and testing sets. The train-ing set is balanced with respect to the plant species to pre-vent the learning from being biased toward specific species.Patches extracted from the training images are randomlydivided by selecting 80% for training and 20% for valida-tion to perform cross-validation to assess the performanceof each model during training. After each epoch of train-ing, the predictive accuracy of the model is calculated onthe validation set. The final trained model is the one withthe minimum validation error rate after training is done fora pre-defined number of epochs.

4.2. Testing Process and Metrics

The trained CNN models are evaluated on the test set,which contains at least one image of each plant speciesthat are unseen to the network during training. Multi-scalepatches are extracted from each test image to perform theclassification for evaluation. The CNN models make pre-dictions for all patches in different scales extracted from atest image. These predictions are then averaged to classifythe plant present in the image. This final averaging process

helps the models make a more robust prediction when cate-gorizing plants in natural images since the plant is analyzedmultiple times in different scales. For performance evalu-ation, we use the prediction accuracy, i.e. the percentageof test images correctly categorized versus the total numberof images in the test set, as a metric. An image is consid-ered correctly categorized when the top-1 output predictionmatches the annotated species of the plant in the image.

4.3. Identifying Species with BJFU100 Dataset

Recently, a collection of annotated high-resolution im-ages called BJFU100 is presented by Sun et al. [21].This is one of the few available datasets containing plantsin natural images. The BJFU100 dataset has 100 imagesper plant species, totalizing 10,000 natural images of orna-mental plant species present on the campus of the BeijingForestry University. Sun et al. used this dataset to train andtest residual networks (ResNets) with different depths.

To explore the capability of WTPlant’s preprocessingstage to handle scale changes, we implement the guidedmulti-scale approach described in Section 3.2 using theBJFU100 dataset. This new approach differs from previ-ous ones [10, 11] in three aspects: i) it uses 10 multi-scalepatches guided by segmentation, ii) it limits the extractedarea to the minimum square bounding the segmented plant,and iii) it converges to the center of mass or centroid of theplant region instead of the geometric center. In this way,unlike the previous approaches, which extract patches frommultiple locations, it focuses exclusively on multi-scale rep-resentative patches of the plants extracted from a single lo-cation. The resulting system, called WTPlant, is used totrain and test ResNets with 18, 34, and 50 layers similarto the work performed by Sun et al.. For comparison withour multi-scale approach, the ResNets are also trained withresized training images, as well as using the random andcentral crop methods.

As done by Sun et al., 80% of the dataset is used fortraining and the rest for testing. Random crop extracts thesame number of patches as used in WTPlant and the ex-tracted patch size is 224x224 pixels as indicated in [6].The ResNets are trained for 100 epochs in a two-fold cross-validation process. Table 1 presents the resulting predictionaccuracy. It is clear that, regardless of the number of lay-ers of ResNets, our proposed multi-scale approach guidedby segmentation greatly improves the performance of thenetworks, significantly outperforming all other approaches.Sun et al. also proposed a customized ResNet architec-ture with 26 layers, which resulted in their best accuracy of91.78%. However, it is still far below the performance ofthe ResNet trained and tested using the guided multi-scaleapproach, which yielded the best accuracy of 97.80%.

Although BJFU100 dataset provides a fair amount of an-notated plant images of high quality, images in the dataset

are all in the same size (3120x4208 pixels) and show rela-tively small variations of plant location (mostly centered),lighting condition (captured around the same time of theday) and, in particular, the scale across samples within eachspecies. These aspects make the dataset relatively easy toclassify, which explains the high prediction accuracy pre-sented in Table 1. Even so, deeper ResNets present underfit-ting problems requiring more epochs to be fully trained. Inaddition, these CNNs may not have learned scale-invariantfeatures because of the small intra-species scale variationin the dataset. Therefore, this dataset may not be the bestchoice for testing a model that aims to recognize plants innatural images showing a wide range of scales.

4.4. Identifying Species with UHManoa100 Dataset

Focusing on capturing variations in scale and appear-ance of plants in nature, we constructed the UHManoa100dataset by collecting 4,500 natural images of plants, 45images per each of the 100 plant species [11]. For eachplant species, a set of test images at different scales andof different parts (leaf, flower, bush, and tree) is set asidefor performance evaluation, which comprises a test set ofaround 300 images unseen by trained models. The annota-tion of the plant species in this dataset indicates the domi-nant plant present in the image. Different plants may appearin the background or even in front of the dominant plant,but the largest areas of these images are covered by the an-notated plant species. Another important characteristic ofthis dataset is that images have different resolutions (rang-ing from 300x300 to 6000x4000 pixels) with varying orien-tations and locations of plants. Therefore, approaches thatfocus only on the center of the image such as the CentralCrop and [7, 8, 16, 25] are unlikely to perform well on thiscomplex dataset.

UHManoa100 dataset was first used in the work ofKrause et al. [10, 11], where multi-location and multi-scale extractions of representative patches were proposedand used to train and test CNNs. Analyzing the individ-ual patches extracted in WTPlant as well as in the work byKrause et al., we noticed that zoomed in areas (close-uppatches) are not helping the CNNs. As an alternative, WT-Plant v2.0 is implemented using only the five larger scalesand their respective mirrored images balancing the trainingdata for a fair comparison. The set of multi-scale patchesextracted in this work is available online1.

4.4.1 Comparison with Various CNN Models

Using this guided multi-scale approach, state-of-the-artCNN models including those with inception modules aretrained to evaluate how helpful the proposed method isfor the training of these CNNs. The six CNN models

1https://github.com/jonaskrause/UHManoa100

Table 1. Top-1 prediction accuracy of ResNets for the BJFU100 dataset.CNN Model Resizing Random Crop Sun et al. [21] Central Crop WTPlantResNet18 74.33% 87.78% 89.27% 90.05% 97.80%ResNet34 71.38% 85.53% 88.28% 83.85% 97.58%ResNet50 53.73% 73.73% 86.15% 75.25% 95.30%

incorporating residual blocks [6] and inception modules[3, 22, 23, 24] are listed in Table 2. In the experiment, fourdifferent ways of data preparation have been compared: i)Resizing images to fit to the input layer of CNNs, ii) extract-ing patches based on Random Crop, iii) extracting patchesfrom the largest central square area (Central Crop), and iv)extracting patches based on our proposed method with se-lected scales and their mirrored images (WTPlant v2.0). Re-sulting performance for each case is shown in columns 2through 5 of Table 2.

In this experiment, all CNNs are trained from scratch,initializing weights randomly for learning. The trainingis conducted for 100 epochs using 20% of the trainingset for validation. Only top-1 accuracy results are con-sidered, meaning that presented percentages show the ra-tio of correctly categorized plants among the 100 speciesfor all test images. Results presented in Table 2 show thatWTPlant v2.0 performs the best, indicating that the guidedmulti-scale approach improves the performance of all testedCNNs significantly. The results also show that inceptionmodels perform worse, potentially due to the overfittingproblem. In that case, the use of pre-trained models andlarger augmented data may improve the accuracy even fur-ther. As an initial conclusion, the WTPlant preprocessingstep generally improves the performance of CNNs, and aguided multi-scale process that extracts more representativepatches can further enhance the predictive power comparedto other approaches that have been used commonly in theliterature.

4.4.2 Fine-Tuning Pre-trained CNN Models

Although there are techniques that can help to deal withthe limited data problem to some degree, it may not beenough to provide a good-sized dataset for training deepnetworks to obtain the best performance. CNN models suchas the Inception-v3 [24], Inception-ResNet-v2 [22], and theXception [3] have a large number of parameters (up to 54million), and the lack of training data generally leads tooverfitting and poor generalization. Consequently, theseCNN models are commonly implemented using pre-trainedweights [4]. To fully explore the capability of the six CNNmodels evaluated in the previous section, a similar exper-iment has been performed using the UHManoa100 datasetbut applying fine-tuning to the pre-trained networks. Weuse CNN models pre-trained on the ImageNet dataset [18]and fine-tune them by initiating a training process using the

learned weights as initial parameter values. In this way, fil-ters learned from a general dataset such as the ImageNetcan be adapted to the classification of plant species. Usingdata prepared the same ways as in the experiment describedin the previous section, the pre-trained CNNs are fine-tunedfor 50 epochs to classify 100 plant species.

The results presented in Table 3 show that the use of pre-trained models and fine-tuning improves the performancemore significantly for CNNs with inception modules thanthe ResNets. This phenomenon is natural since the numberof parameters in CNNs with inception modules is almosttwice larger than the number of parameters of the ResNetstrained in this experiment (up to 26 million), therefore thepre-learned weights have more impact on the CNNs withinception modules making them perform well on a rela-tively small dataset. The consistently superior performanceof WTPlant v2.0 across all CNN models reinforces the hy-pothesis that the guided multi-scale approach is helpful fortraining CNNs and even when fine-tuning pre-trained mod-els. Also, the use of pre-trained models dramatically im-proves the performance of large CNNs, resulting in predic-tion accuracy of 89.21% by the Inception-ResNet-v2 modelwhich is far better than the best performance (65.11%) ofResNet18 trained from scratch.

5. Observations and Discussions

To better understand the complexity of the UHManoa100dataset, we investigated the distribution of geometric cen-ters of the extracted plant regions used by Krause et al[10, 11] and the centroids used in WTPlant v2.0. Point loca-tions are estimated by normalizing all the image sizes andcalculating the relative position to the center of the imageas shown in Figure 3. Figure 3(a) shows the distribution ofgeometric centers of UHManoa100 dataset used for patchextraction in [10, 11]. It tightly clusters around the centeras well as along the major axes, which is reasonable giventhat the object of interest can often be off from the cen-ter horizontally and/or vertically when the picture is taken.The centroids of the segmented areas from UHManoa100dataset are shown in Figure 3(b). They are also clusteredaround the center mostly but a fair amount of the pointsare scattered all over the images, which means that plantsin many images appear off-centered in this dataset. Thesecentroids are the reference points for multi-scale patch ex-traction in the guiding process described in Section 3.2, en-suring that each extracted patch covers the plant area.

Table 2. Top-1 prediction accuracy of CNNs trained from scratch on the UHManoa100 dataset.CNN Model Resizing Random Crop Central Crop WTPlant v2.0ResNet18 39.21% 43.89% 44.24% 65.11%ResNet34 40.29% 44.60% 42.81% 59.71%ResNet50 28.42% 43.53% 39.21% 57.19%Inception-v3 30.58% 40.65% 30.94% 49.28%Inception-ResNet-v2 35.25% 59.35% 37.41% 62.23%Xception 28.78% 39.57% 29.50% 53.24%

Table 3. Top-1 prediction accuracy of CNNs fine-tuned using the UHManoa100 dataset.CNN Model Resizing Random Crop Central Crop WTPlant v2.0ResNet18 60.79% 53.60% 60.07% 61.51%ResNet34 56.83% 51.80% 56.83% 57.91%ResNet50 53.60% 49.28% 54.68% 56.83%Inception-v3 71.94% 79.50% 78.42% 85.61%Inception-ResNet-v2 76.26% 83.81% 79.14% 89.21%Xception 75.90% 82.37% 83.09% 87.05%

Figure 3. Heatmap of point locations, relative to the image center.(a) Geometric centers of UHManoa100 dataset, and (b) centroidsof UHManoa100 dataset.

Regarding the proposed guided multi-scale method, theWTPlant system demonstrates consistent performance im-provement across almost all experiments conducted for thefine-grained categorization of plants in natural images. Asshown in Table 1, over 97% accuracy is achieved whenthe proposed approach is applied to classify plants in theBJFU100 dataset. For the UHManoa100 dataset, WTPlantv2.0 shows over 89% accuracy when the pre-trained andfine-tuned Inception-ResNet-v2 model is used for classifi-cation. These results show that WTPlant may help address-ing image-based phenotyping of plants in wild by observingproperties of each species appearing at various scales in nat-ural images.

6. ConclusionIn this paper, we present a multi-scale approach guided

by segmentation for the training of CNNs designed forfine-grained plant classification tasks. Building on previ-ous works on addressing scale issues for object classifi-cation, our approach uses a CNN [30] to parse a natural

scene and segment the plant in the image from the complexbackground. The resulting segmentation information drivesthe extraction of representative patches at various scales.These patches are then fed into state-of-the-art CNNs al-lowing them to learn features that can address the large scalevariation occurring in natural images of plant species. Theproposed approach extends the previous work [10, 11] anduses segmentation information more efficiently, extractingpatches that are limited to the minimum square boundingthe segmented plant and using the center of mass to guidethe extraction of multi-scale patches. As a direct result, ex-tracting multi-scale samples using better guidance from thecentroids help to select more scale representative patches.

A series of comparative experiments are conducted withtwo datasets of plants in natural images testing several CNNmodels. Our experimental analysis shows that (1) the pro-posed approach is effective in dealing with large scale vari-ations within the natural images, (2) it is also robust to avarying location of the plant in the scene, and (3) it con-sistently improves classification accuracy of CNN modelscompared to other approaches.

We also notice that the scales of patches used for trainingcan have a great influence on the performance. Hence, esti-mating an appropriate set of scales to extract patches fromeach image would be an important problem to investigate.In future work, we plan to explore whether the use of frac-tal dimensions can help us find this adequate range of scalesfor the extraction of more representative patches.

References[1] Anelia Angelova and Shenghuo Zhu. Efficient Object De-

tection and Segmentation for Fine-Grained Recognition. InCVPR, pages 811–818. IEEE Computer Society, 2013. 3

[2] Steve Branson, Grant Van Horn, Serge J Belongie, and PietroPerona. Bird Species Categorization Using Pose Normalized

Deep Convolutional Nets. CoRR, abs/1406.2952, 2014. 2[3] Franois Chollet. Xception: Deep Learning with Depthwise

Separable Convolutions. CoRR, abs/1610.02357, 2016. 4, 7[4] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge J

Belongie. Large Scale Fine-Grained Categorization andDomain-Specific Transfer Learning. In CVPR, pages 4109–4118. IEEE Computer Society, 2018. 7

[5] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge J Belongie.Fine-grained Categorization and Dataset Bootstrapping us-ing Deep Metric Learning with Humans in the Loop. CoRR,abs/1512.05227, 2015. 2

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition. CoRR,abs/1512.03385, 2015. 4, 6, 7

[7] Jing Hu, Zhibo Chen, Meng Yang, Rongguo Zhang, and YajiCui. A Multiscale Fusion Convolutional Neural Networkfor Plant Leaf Recognition. IEEE Signal Processing Letters,25(6):853–857, 2018. 2, 6

[8] Andrej Karpathy, George Toderici, Sanketh Shetty, ThomasLeung, Rahul Sukthankar, and Li Fei-Fei. Large-Scale VideoClassification with Convolutional Neural Networks. In 2014IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1725–1732, 2014. 2, 6

[9] Jonathan Krause, Hailin Jin, Jianchao Yang, and Li Fei-Fei.Fine-grained recognition without part annotations. In 2015IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pages 5546–5555, 6 2015. 3

[10] Jonas Krause, Gavin Sugita, Kyungim Baek, and LipyeowLim. What’s That Plant? WTPlant is a Deep Learning Sys-tem to Identify Plants in Natural Images. In BMVC, page330. BMVA Press, 2018. 3, 4, 6, 7, 8

[11] Jonas Krause, Gavin Sugita, Kyungim Baek, and LipyeowLim. WTPlant (What’s That Plant?): A Deep Learning Sys-tem for Identifying Plants in Natural Images. In ICMR, pages517–520. ACM, 2018. 3, 4, 6, 7, 8

[12] Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W.Jacobs, W. John Kress, Ida C. Lopez, and Joo V. B. Soares.Leafsnap: A Computer Vision System for Automatic PlantSpecies Identification. In Andrew W. Fitzgibbon, SvetlanaLazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid,editors, ECCV (2), pages 502–516, 2012. 1, 2

[13] Sue Han Lee, Chee Seng Chan, Simon Mayo, and Paolo Re-magnino. How deep learning extracts and learns leaf featuresfor plant classification. Pattern Recognition, 71:1–13, 2017.2

[14] Sue Han Lee, Chee Seng Chan, Paul Wilkin, and Paolo Re-magnino. Deep-plant: Plant identification with convolutionalneural networks. 2015 IEEE International Conference onImage Processing (ICIP), pages 452–456, 2015. 1

[15] Xiao Liu, Tian Xia, Jiang Wang, and Yuanqing Lin. FullyConvolutional Attention Localization Networks: EfficientAttention Localization for Fine-Grained Recognition. CoRR,abs/1603.06765, 2016. 2

[16] Jeff Mo, Eibe Frank, and Vetrova Vetrova. Large-scale auto-matic species identification. In D. Alahakoon W. Peng andX. Li, editors, Proceedings of 30th Australasian Joint Con-ference on Advances in Artificial Intelligence, page 301312.Springer, 2017. 2, 6

[17] Michael P. Pound, Jonathan A. Atkinson, Darren M. Wells,Tony P. Pridmore, and Andrew P. French. Deep learning formulti-task plant phenotyping. In 2017 IEEE InternationalConference on Computer Vision Workshops (ICCVW), pages2055–2063, Oct 2017. 1

[18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C Berg, andLi Fei-Fei. ImageNet Large Scale Visual Recognition Chal-lenge. International Journal of Computer Vision (IJCV),115(3):211–252, 2015. 7

[19] Asma R. Sfar, Nozha Boujemaa, and Donald Geman. Van-tage Feature Frames for Fine-Grained Categorization. In2013 IEEE Conference on Computer Vision and PatternRecognition, pages 835–842, 2013. 3

[20] Wen Shi, Fanman Meng, and Qingbo Wu. Segmentationquality evaluation based on multi-scale convolutional neuralnetworks. In 2017 IEEE Visual Communications and ImageProcessing (VCIP), pages 1–4, 12 2017. 3

[21] Yu Sun, Yuan Liu, Wang Guan, and Haiyan Zhang.Deep Learning for Plant Identification in Natural Envi-ronment. Computational Intelligence and Neuroscience,2017(7361042):6 pages, 2017. 4, 6, 7

[22] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke.Inception-v4, Inception-ResNet and the Impact of ResidualConnections on Learning. CoRR, abs/1602.07261, 2016. 4,7

[23] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott E Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going Deeper withConvolutions. CoRR, abs/1409.4842, 2014. 7

[24] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jonathon Shlens, and Zbigniew Wojna. Rethinking theInception Architecture for Computer Vision. CoRR,abs/1512.00567, 2015. 4, 7

[25] Ahmad P. Tafti, Fereshteh S. Bashiri, Eric LaRose, andPeggy Peissig. Diagnostic Classification of Lung CT ImagesUsing Deep 3D Multi-Scale Convolutional Neural Network.In 2018 IEEE International Conference on Healthcare Infor-matics (ICHI), pages 412–414, 6 2018. 6

[26] Catherine Wah, Steve Branson, Pietro Perona, and Serge Be-longie. Multiclass recognition and part localization with hu-mans in the loop. In 2011 International Conference on Com-puter Vision, pages 2524–2531, 11 2011. 2

[27] Jana Waldchen and Patrick Mader. Plant species identifica-tion using computer vision techniques: A systematic liter-ature review. Archives of Computational Methods in Engi-neering, 25(2):507–543, Apr 2018. 1

[28] Peter Welinder, Steve Branson, Takeshi Mita, CatherineWah, Florian Schroff, Serge Belongie, and Pietro Perona.Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. 2, 3

[29] Songfan Yang and Deva Ramanan. Multi-scale recognitionwith DAG-CNNs. CoRR, abs/1505.05232, 2015. 2

[30] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, AdelaBarriuso, and Antonio Torralba. Scene Parsing throughADE20K Dataset. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2017. 3, 5, 8

Documents

A guided multi-scale categorization of plant species in ...A guided multi-scale categorization of plant species in natural images Jonas Krause, Kyungim Baek, and Lipyeow Lim Dept