4
Vision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman T. Poggio A. Steele * T. Serre McGovern Institute for Brain Research, Massachusetts Institute of Technology *Broad Fellows in Brain Circuitry Program, Division of Biology, California Institute of Technology corresponding authors: [email protected] * [email protected] Abstract We describe a trainable computer vision system en- abling the automated analysis of complex mouse behaviors. We also collect and manually annotate a very large video database used for training and testing the system. Our sys- tem performs on par with human scoring, as measured from the ground-truth manual annotations. Our video-based software should complement existing sensor based auto- mated approaches and help develop an adaptable, compre- hensive, high-throughput, fine-grained, automated analysis of mouse behavior. 1. Introduction Automated quantitative analysis of mouse behavior will play a significant role in comprehensive phenotypic analysis - both on the small scale of detailed characterization of indi- vidual gene mutants and on the large scale of assigning gene functions across the entire mouse genome [1]. One key ben- efit of automating behavioral analysis arises from inherent limitations of human assessment: namely cost, time, and re- producibility. Although automation in and of itself is not a panacea for neurobehavioral experiments, it allows for ad- dressing an entirely new set of questions about mouse be- havior such as conducting experiments on time scales that are orders of magnitude larger than traditionally assayed. Most previous automated systems [3, 5] rely mostly on the use of sensors like infrared beams or tracking techniques to monitor behavior. These approaches are limited in the complexity of the behavior that they can measure. While such systems can be used effectively to monitor locomotor activity and perform operant conditioning, they cannot be used to study home-cage behaviors such as grooming, hang- ing, and smaller movements (termed ”micromovements” below). Visual analysis is a potentially powerful comple- ment to these sensor-based approaches for the recognition of such fine animal behaviors. A few computer-vision systems for the recognition of mice behaviors have been recently described( a commer- cial system CleverSys, Inc, and [2, 7]). They have not been tested yet in a real-world lab setting using long uninter- rupted video sequences which contain potentially ambigu- ous behaviors. In this paper, we describe a trainable, general-purpose, automated and potentially high-throughput system for the behavioral analysis of mice in their home-cage. 2. System overview Our system consists of two stages: (1) a feature com- putation stage, and (2) a classification stage. In the feature computation stage, a 310 dimensional feature descriptor is computed for each frame of an input sequence based on the motion and the position of the mouse. In the classification stage, the classifier is trained from the feature descriptors and labels of temporal sequences and outputs a label for ev- ery frame of a test video-sequence. The system is illustrated in Figure 1. 2.1. Feature computation stage The feature computation stage takes as input a video se- quence and outputs for each frame a feature vector of 310 dimensions. This comes from the concatenation of 300 mo- tion features and 10 position- and velocity-based features, which are normalized separately before concatenation. A background subtraction procedure is first applied to an input video to compute a foreground mask for pixels belonging to the animal based on the instantaneous location of the animal in the cage (Figure 1(A)). The background subtraction pro- cedure is adapted from our previous work for the recogni- tion of human actions [4]. A bounding box centering on the animal is derived from the foreground mask (Figure 1(B)). Two types of features are then computed: position- and velocity-based features as well as motion features. Position- and velocity-based features are computed directly from the foreground mask (Figure 1(C)), and motion-features are computed on the bounding-box within a hierarchical archi- tecture (Figure 1(D)).

Vision-Based Recognition of Mice Home-Cage …cbcl.mit.edu/publications/ps/Jhuang-etal_VAIB10.pdfVision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vision-Based Recognition of Mice Home-Cage …cbcl.mit.edu/publications/ps/Jhuang-etal_VAIB10.pdfVision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman

Vision-Based Recognition of Mice Home-Cage Behaviors

H. Jhuang E. Garrote N. Edelman T. Poggio A. Steele∗ T. Serre†

McGovern Institute for Brain Research, Massachusetts Institute of Technology*Broad Fellows in Brain Circuitry Program, Division of Biology, California Institute of Technologycorresponding authors: † [email protected][email protected]

Abstract

We describe a trainable computer vision system en-abling the automated analysis of complex mouse behaviors.We also collect and manually annotate a very large videodatabase used for training and testing the system. Our sys-tem performs on par with human scoring, as measured fromthe ground-truth manual annotations. Our video-basedsoftware should complement existing sensor based auto-mated approaches and help develop an adaptable, compre-hensive, high-throughput, fine-grained, automated analysisof mouse behavior.

1. IntroductionAutomated quantitative analysis of mouse behavior will

play a significant role in comprehensive phenotypic analysis- both on the small scale of detailed characterization of indi-vidual gene mutants and on the large scale of assigning genefunctions across the entire mouse genome [1]. One key ben-efit of automating behavioral analysis arises from inherentlimitations of human assessment: namely cost, time, and re-producibility. Although automation in and of itself is not apanacea for neurobehavioral experiments, it allows for ad-dressing an entirely new set of questions about mouse be-havior such as conducting experiments on time scales thatare orders of magnitude larger than traditionally assayed.

Most previous automated systems [3, 5] rely mostly onthe use of sensors like infrared beams or tracking techniquesto monitor behavior. These approaches are limited in thecomplexity of the behavior that they can measure. Whilesuch systems can be used effectively to monitor locomotoractivity and perform operant conditioning, they cannot beused to study home-cage behaviors such as grooming, hang-ing, and smaller movements (termed ”micromovements”below). Visual analysis is a potentially powerful comple-ment to these sensor-based approaches for the recognitionof such fine animal behaviors.

A few computer-vision systems for the recognition ofmice behaviors have been recently described( a commer-

cial system CleverSys, Inc, and [2, 7]). They have not beentested yet in a real-world lab setting using long uninter-rupted video sequences which contain potentially ambigu-ous behaviors.

In this paper, we describe a trainable, general-purpose,automated and potentially high-throughput system for thebehavioral analysis of mice in their home-cage.

2. System overview

Our system consists of two stages: (1) a feature com-putation stage, and (2) a classification stage. In the featurecomputation stage, a 310 dimensional feature descriptor iscomputed for each frame of an input sequence based on themotion and the position of the mouse. In the classificationstage, the classifier is trained from the feature descriptorsand labels of temporal sequences and outputs a label for ev-ery frame of a test video-sequence. The system is illustratedin Figure 1.

2.1. Feature computation stage

The feature computation stage takes as input a video se-quence and outputs for each frame a feature vector of 310dimensions. This comes from the concatenation of 300 mo-tion features and 10 position- and velocity-based features,which are normalized separately before concatenation. Abackground subtraction procedure is first applied to an inputvideo to compute a foreground mask for pixels belonging tothe animal based on the instantaneous location of the animalin the cage (Figure 1(A)). The background subtraction pro-cedure is adapted from our previous work for the recogni-tion of human actions [4]. A bounding box centering on theanimal is derived from the foreground mask (Figure 1(B)).Two types of features are then computed: position- andvelocity-based features as well as motion features. Position-and velocity-based features are computed directly from theforeground mask (Figure 1(C)), and motion-features arecomputed on the bounding-box within a hierarchical archi-tecture (Figure 1(D)).

Page 2: Vision-Based Recognition of Mice Home-Cage …cbcl.mit.edu/publications/ps/Jhuang-etal_VAIB10.pdfVision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman

HMMSVM Gclassification

input video stream t-2t-1

t

S1/C1

S2/C2D...

300 motion features

C

background subtraction A

t+1

bounding box extraction B

y k

y k y k

y k

10 position- and velocity-based features

htdfd

cx w

cy

cx cy w vx vy ax h/whfd td ...

Fy tk

H

10 20 30

rest

drink

eat

groom

mmove

walk

rear

hang

Circadian Time (min)246 12 18

rest

drink

eat

groom

mmove

walk

rear

hang

Circadian Time (hr)

E

Figure 2

Figure 1. Overview of the proposed system for recognizing thehome-cage behavior of mice. The system consists of a featurecomputation stage (A-F) and a classification stage(G). See text fordetail.

Motion features The use of motion features is taken fromour previous work, which models the organization of thedorsal stream (motion pathway) in the visual cortex and wasapplied for the recognition of human actions [4]. The modelcomputes features for the space-time volume centering atevery frame of an input video sequence via a hierarchy ofprocessing stages, whereby features become increasinglycomplex and invariant with respect to 2D transformationsalong the hierarchy. The model starts with the S1/C1 stageconsisting of an array of spatio-temporal filters (9 pixels ×9 pixels × 9 frames) tuned to 4 motion directions equallyspaced between 0 and 3600. By convolving the input se-quence with these filters, we obtain a sequence of C1 maps,each corresponding to motion present at a frame along the 4directions (Figure 1(E)). In the S2/C2 stage, we computedfor every C1 map, a vector of scores which measures thesimilarity between the motion present in the current mapand 300 motion templates (Figure 1(F)). These templates

are learned from the motion present in the training videosequences. More specifically, at every spatial position of aC1 map, we perform a template matching between a mo-tion template and a patch, with the same size of the tem-plate, centering at the current position; this matching gen-erates a score. The C2 output is the global maximum ofall the scores computed at the current map. We repeat thisprocedure for all the motion templates, and obtain a 300-dimensional C2 feature vector for each frame.

Learning the dictionary of motion templates In orderto train a set of motion templates that are useful for dis-criminating between behavior categories, we manually col-lected a set of 4, 200 clips with the best and most exemplaryinstances of each behavior (each clip contains one singlebehavior). This set contains different mice (differ in coatcolor, size, gender, etc) recorded at different times duringday and night during 12 separate sessions. We firstly draw12, 000 motion templates, each as a patch of a random C1

map computed from the clips. These templates are of sizesn × n pixels (n = 4, 8, 12). We then do a feature selec-tion as in [4], retaining the most representative 300 motiontemplates.

Position- and velocity-based feature computation Inaddition to the motion features, we computed an additionalset of features derived from the instantaneous location of theanimal in the cage (Figure 1(C)). We perform a backgroundsubtraction technique to obtain a foreground bounding boxcentering at the animal. For a static camera as used here, thebackground can be well approximated by a median framein which each pixel value is the median value across all theframes at the same pixel location. Position- and velocity-based measurements were estimated for each frame basedon the 2D coordinates (x, y) of the bounding box. Theseinclude the position and the aspect ratio of the boundingbox (indicating whether the animal is in a horizontal or ver-tical posture), the distance of the animal from the feeder aswell as the instantaneous velocity and acceleration. Thereare totally 10 features used, Figure 1(C) illustrates 6 typesof features.

2.2. Classification stage

Existing systems for the recognition of mouse behaviorfocus on recognizing highly-exemplary instances of behav-iors present in short clips (less than 100 frames). Perform-ing a reliable phenotyping of an animal requires more thanthe mere detection of stereotypical non-ambiguous behav-iors. In particular, the proposed system aims at classify-ing every frame of a video sequence even for those frameswhose underlying actions are difficult to categorize. For thischallenging task, the temporal context of a specific behavior

Page 3: Vision-Based Recognition of Mice Home-Cage …cbcl.mit.edu/publications/ps/Jhuang-etal_VAIB10.pdfVision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman

drink eat groom hang micro-movement rear rest walk

Figure 2. Snapshots taken from representative videos for the eight home-cage behaviors of interest.

becomes an essential source of information; thus, learningan accurate temporal model for the recognition of actionsbecomes critical. Here we used a Hidden Markov SupportVector Machine(SVMHMM) [6], which is an extension ofthe SVM for sequential tagging.

SVMHMM combines the advantage of SVM and HMMby discriminatively training models that are similar to hid-den Markov models. Here we use a first-order transitionmodel. Given an input sequence X = (x1 . . .xT ) offeature vectors, the model predicts a sequence of labelsy = (y1 . . . yT ) according to the following linear discrimi-nant function:

y = argmaxy

T∑t=1

[xt ·wyt + wtr(yt−1, yt)] (1)

Here xt is the motion + position feature described abovefor the t-th frame of a video sequence, yt is the label (onebehavior of interest) for the t-th frame. wyt is an emis-sion weight vector for the label yt and wtr is a transitionweight vector for the transition between the label yt−1 andyt. These weight vectors are learned from 11 labeled videos(see Sec. 3.1 for training videos, and [6] for the training pro-cedure). Each training video is split into non-overlapping 1minute segments, each as a training example X.

In the classification stage, the SVMHMM model takesinput as a sequence of feature vectors of an input video andoutputs a predicted label for each frame (Figure 1(G)). Theresulting time sequence of labels can be further used to con-struct ethograms of the animal behavior. For example, theright panel of Figure 1(H) shows the ethogram of an animalfor 24 hours, and the left panel provides a zoom-in versioncorresponding to the first 30 minutes of recording. The an-imal is highly active as a human experimenter just placedthe mouse in a new cage prior to starting the video record-ing. The animal’s behavior alternates between ’walking’,’rearing’ and ’hanging’ as it explores its new cage.

3. Experiments and results3.1. Video dataset

Currently, the only public dataset for mouse behavior isa set of clips and is limited in the scope [2]. In order to trainand test our system on a real-world lab setting where micebehaviors are continuously observed and scored over hours

or even days, we collect a dataset, full database. This setcontains 12 continuously labeled videos, i.e., each frame islabeled with a behavior of interest. Each video is 30 − 60min in length, resulting in a total of over 10 hours of data.These videos contain different mice recorded at differenttimes. We annotate 8 types of common mouse behaviors:

• drinking: the mouse’s mouth being juxtaposed to thetip of the drinking spout

• eating: the mouse reaches and acquires food from thefood bin

• grooming: the fore- or hind-limbs sweeps across theface or torso, typically as the animal is reared up

• hanging: grasping of the wire bars with the fore-limbs and/or hind-limbs with at least two limbs off theground

• micro-movements: small movements of the animal’shead or limbs

• rearing: an upright posture and forelimbs off theground

• resting: inactivity or nearly complete stillness

• walking: ambulation

These behaviors are shown in Figure 2.

3.2. Data Annotation

The videos were annotated using a freeware subtitle edit-ing tool, Subtitle Workshop from UroWorks. A team of 8investigators (’Annotators group 1’) was trained to annotatemouse home cage behaviors. Two annotators of the ’Anno-tator group 1’ further performed a secondary screening onthese annotations to correct mistakes and make sure the an-notation style is consistent throughout the whole database.In order to evaluate the agreement between two independentlabelers, we consider a a small subset of the full database,denoted as doubly annotated subset. It consists of manyshort video segments which are randomly selected from thefull database. Each segment is 5−10 min long and they addup to a total of about 1.6 hours of dataset. The doubly an-notated subset has a second set of annotations made by the’Annotators group2’, 4 annotators randomly selected fromthe ’Annotators group 1’.

Page 4: Vision-Based Recognition of Mice Home-Cage …cbcl.mit.edu/publications/ps/Jhuang-etal_VAIB10.pdfVision-Based Recognition of Mice Home-Cage Behaviors H. Jhuang E. Garrote N. Edelman

3.3. Training and Testing the system

The evaluation as shown in Table 1 was obtained usinga leave-one-out cross-validation procedure, i.e., training thesystem on all but one of the videos and test on the left outvideo; repeating this procedure (n=12) times for all videos.The system accuracy is computed as: (total # frames cor-rectly predicted by the system)/(total # frames) and thehuman-to-human agreement as: (total # frames correctlylabeled by ’Annotator group 2’)/(total # frames). Here aprediction or label is considered ’correct’ if/when it matchesthe annotations by the ’Annotator group 1’.

3.4. Comparison with a commercial software andwith human performance

The system uses the annotations of the ’Annotatorgroup 1’ as ground truth, and compared the perfor-mance of the system against a commercial software HCS(HomeCageScan 2.0, CleverSys, Inc) for mouse homecage behavior classification and against human manualscoring (’Annotator group 2’). Table 1 shows the com-parison. Overall we found that our system achieves76.6% agreement with human labelers (’Annotator group1’) on the doubly annotated subset, a result significantlyhigher than the HCS and on par with humans (’Annota-tor group 2’). For the full database, we only comparethe system against the HCS, and our system also outper-forms the HCS by 17%. Two online videos demonstrat-ing the automatic scoring of the system are at http://techtv.mit.edu/videos/5561 and http://techtv.mit.edu/videos/5562.

Our system HCS Human(’Ann. group 2’)

doubly annotated set 76.6% 60.9% 71.6%full databse 77.6% 61.0%

Table 1. Accuracy of our system, human annotators andHomeCageScan 2.0 CleverSys system.

3.5. Discussion

A common source of disagreement between human an-notators as well as humans and the system is about the pre-cise boundary between 2 actions (when one action startsand the other ends). Furthermore some our behaviors ofinterest can be very short (10-20ms) thus making it hardto allow for longer tolerances in the precise locations ofthese boundaries. The disagreement also comes from theambiguity of actions per se. For example, a mouse stand-ing against the back side of a cage (rearing) does look verysimilar to reaching for the food bin (eating) because in bothcases, the head of the mouse seems to touch the food binas seen from the front side of the cage. Small movementsof an animal’s limbs (micro-movement) is sometimes asso-

ciated with a slow movement of the whole body blurringthe boundary between walking and micro-movements. Thevideos in http://techtv.mit.edu/videos/5563and http://techtv.mit.edu/videos/5564 showannotations from two humans simultaneously and illustratethe confusions described above. We believe that errors fromthe system result from inconsistencies in the annotationsproduced by multiple annotators.

4. ConclusionWe have applied a biological model of motion processing

in the dorsal stream to the recognition of human and animalactions. It has also been suggested that analysis of shapein the ventral stream of visual cortex may also be impor-tant for the recognition of actions. Future work will extendthe present approach to integrate shape/contour, motion andsensor information. Another important future work is to ex-tend the study of single mouse behavior to multiple micebehaviors and social behaviors.

5. AcknowledgeThis research was sponsored by grants from: DARPA

(IPTO and DSO), NSF-0640097, NSF-0827427, IIT, andthe McGovern Institute for Brain Research. ADS werefunded by the Broad Fellows Program in Brain Circuitryat Caltech. HJ was founded by the Taiwan National ScienceCouncil (TMS-094-1-A032).

References[1] J. Auwerx and A. El. The European dimension for the mouse

genome mutagenesis program. Nature Genetics, 36(9):925–927, 2004.

[2] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behav-ior recognition via sparse spatio-temporal features. In vi-sual surveillance and Performance Evaluation of Trackingand Surveillance, 2005.

[3] E. H. Goulding, A. K. Schenk, P. Juneja, A. W. Mackay, J. M.Wade, and L. H. Tecott. A robust automated system elucidatesmouse home cage behavioral structure. Proceedings of theNational Academy of Sciences, 105(52), 2008.

[4] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. IEEE InternationalConference on Computer Vision, 2007.

[5] L. P. Noldus, A. J. Spink, and R. A. Tegelenbosch. Etho-Vision: a versatile video tracking system for automation ofbehavioral experiments. Behavior Research Methods, Instru-ment, sComputers, 33(3):398–414, August 2001.

[6] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.Large Margin Methods for Structured and InterdependentOutput Variables. Journal of Machine Learning Research,6:1453–1484, 2005.

[7] X. Xue and T. C. Henderson. Feature fusion for basic be-havior unit segmentation from video sequences. Robotics andAutonomous Systems, 57(3):239–248, March 2009.