20
Learning to Segment Actions from Observation and Narration Daniel Fried Jean-Baptiste Alayrac Phil Blunsom Chris Dyer Stephen Clark Aida Nematzadeh DeepMind, London, UK Computer Science Division, UC Berkeley [email protected], {jalayrac,pblunsom,cdyer,clarkstephen,nematzadeh}@google.com Abstract We apply a generative segmental model of task structure, guided by narration, to action seg- mentation in video. We focus on unsupervised and weakly-supervised settings where no ac- tion labels are known during training. De- spite its simplicity, our model performs com- petitively with previous work on a dataset of naturalistic instructional videos. Our model al- lows us to vary the sources of supervision used in training, and we find that both task structure and narrative language provide large benefits in segmentation quality. 1 Learning to Segment Actions Finding boundaries in a continuous stream is a crucial process for human cognition (Martin and Tversky, 2003; Zacks and Swallow, 2007; Levine et al., 2019; ¨ Unal et al., 2019). To understand and remember what happens in the world around us, we need to recognize the action boundaries as they unfold and also distinguish the important actions from the insignificant ones. This process, referred to as temporal action segmentation, is also an im- portant first step in systems that ground natural language in videos (Hendricks et al., 2017). These systems must identify which frames in a video depict actions – which amounts to distinguishing these frames from background ones – and identify which actions (e.g., boiling potatoes) each frame depicts. Despite recent advances (Miech et al., 2019; Sun et al., 2019), unsupervised action seg- mentation in videos remains a challenge. The recent availability of large datasets of natural- istic instructional videos provides an opportunity for modeling of action segmentation in a rich task Work begun while DF was interning at Deep- Mind. Code is available at https://github.com/dpfried/action- segmentation. context (Yu et al., 2014; Zhou et al., 2018; Zhukov et al., 2019; Miech et al., 2019; Tang et al., 2019); in these videos, a person teaches a specific high- level task (e.g., making croquettes) while describ- ing the lower-level steps involved in that task (e.g., boiling potatoes). However, the real-world na- ture of these datasets introduces many challenges. For example, more than 70% of the frames in one of the YouTube instructional video datasets, CrossTask (Zhukov et al., 2019), consist of back- ground regions (e.g., the video presenter is thank- ing their viewers), which do not correspond to any of the steps for the video’s task. These datasets are interesting because they pro- vide (1) narrative language that roughly corre- sponds to the activities demonstrated in the videos and (2) structured task scripts that define a strong signal of the order in which steps in a task are typically performed. As a result, these datasets provide an opportunity to study the extent to which task structure and language can guide ac- tion segmentation. Interestingly, young children can segment actions without any explicit super- vision (Baldwin et al., 2001; Sharon and Wynn, 1998), by tapping into similar cues – action reg- ularities and language descriptions (e.g., Levine et al., 2019). While previous work mostly focuses on building action segmentation models that perform well on a few metrics (Richard et al., 2018; Zhukov et al., 2019), we aim to provide insight into how vari- ous modeling choices impact action segmentation. How much do unsupervised models improve when given implicit supervision from task structure and language, and which types of supervision help most? Are discriminative or generative models better suited for the task? Does explicit structure modeling improve the quality of segmentation? To answer these questions, we compare two exist- arXiv:2005.03684v2 [cs.CL] 12 Aug 2020

arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Learning to Segment Actions from Observation and Narration

Daniel Fried‡ Jean-Baptiste Alayrac† Phil Blunsom†Chris Dyer† Stephen Clark† Aida Nematzadeh†

†DeepMind, London, UK‡Computer Science Division, UC Berkeley

[email protected], {jalayrac,pblunsom,cdyer,clarkstephen,nematzadeh}@google.com

AbstractWe apply a generative segmental model of taskstructure, guided by narration, to action seg-mentation in video. We focus on unsupervisedand weakly-supervised settings where no ac-tion labels are known during training. De-spite its simplicity, our model performs com-petitively with previous work on a dataset ofnaturalistic instructional videos. Our model al-lows us to vary the sources of supervision usedin training, and we find that both task structureand narrative language provide large benefitsin segmentation quality.

1 Learning to Segment Actions

Finding boundaries in a continuous stream is acrucial process for human cognition (Martin andTversky, 2003; Zacks and Swallow, 2007; Levineet al., 2019; Unal et al., 2019). To understand andremember what happens in the world around us,we need to recognize the action boundaries as theyunfold and also distinguish the important actionsfrom the insignificant ones. This process, referredto as temporal action segmentation, is also an im-portant first step in systems that ground naturallanguage in videos (Hendricks et al., 2017). Thesesystems must identify which frames in a videodepict actions – which amounts to distinguishingthese frames from background ones – and identifywhich actions (e.g., boiling potatoes) each framedepicts. Despite recent advances (Miech et al.,2019; Sun et al., 2019), unsupervised action seg-mentation in videos remains a challenge.

The recent availability of large datasets of natural-istic instructional videos provides an opportunityfor modeling of action segmentation in a rich task

Work begun while DF was interning at Deep-Mind. Code is available at https://github.com/dpfried/action-segmentation.

context (Yu et al., 2014; Zhou et al., 2018; Zhukovet al., 2019; Miech et al., 2019; Tang et al., 2019);in these videos, a person teaches a specific high-level task (e.g., making croquettes) while describ-ing the lower-level steps involved in that task (e.g.,boiling potatoes). However, the real-world na-ture of these datasets introduces many challenges.For example, more than 70% of the frames inone of the YouTube instructional video datasets,CrossTask (Zhukov et al., 2019), consist of back-ground regions (e.g., the video presenter is thank-ing their viewers), which do not correspond to anyof the steps for the video’s task.

These datasets are interesting because they pro-vide (1) narrative language that roughly corre-sponds to the activities demonstrated in the videosand (2) structured task scripts that define a strongsignal of the order in which steps in a task aretypically performed. As a result, these datasetsprovide an opportunity to study the extent towhich task structure and language can guide ac-tion segmentation. Interestingly, young childrencan segment actions without any explicit super-vision (Baldwin et al., 2001; Sharon and Wynn,1998), by tapping into similar cues – action reg-ularities and language descriptions (e.g., Levineet al., 2019).

While previous work mostly focuses on buildingaction segmentation models that perform well ona few metrics (Richard et al., 2018; Zhukov et al.,2019), we aim to provide insight into how vari-ous modeling choices impact action segmentation.How much do unsupervised models improve whengiven implicit supervision from task structure andlanguage, and which types of supervision helpmost? Are discriminative or generative modelsbetter suited for the task? Does explicit structuremodeling improve the quality of segmentation? Toanswer these questions, we compare two exist-

arX

iv:2

005.

0368

4v2

[cs

.CL

] 1

2 A

ug 2

020

Page 2: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

ing models with a generative hidden semi-Markovmodel, varying the degree of supervision.

On a challenging and naturalistic dataset of in-structional videos (Zhukov et al., 2019), we findthat our model and models from past work bothbenefit substantially from the weak supervisionprovided by task structure and narrative language,even on top of rich features from state-of-the-art pretrained action and object classifiers. Ouranalysis also shows that: (1) Generative modelstend to do better than discriminative models of thesame or similar model class at learning the fullrange of step types, which benefits action segmen-tation; (2) Task structure affords strong, feature-agnostic baselines that are difficult for existingsystems to surpass; (3) Reporting multiple met-rics is necessary to understand each model’s ef-fectiveness for action segmentation; we can de-vise feature-agnostic baselines that perform wellon single metrics despite producing low-qualityaction segments.

2 Related Work

Typical methods (Rohrbach et al., 2012; Singhet al., 2016; Xu et al., 2017; Zhao et al., 2017; Leaet al., 2017; Yeung et al., 2018; Farha and Gall,2019) for temporal action segmentation consist ofassigning action classes to intervals of videos andrely on manually-annotated supervision. Such an-notation is difficult to obtain at scale. As a result,recent work has focused on training such modelswith less supervision: one line of work assumesthat only the order of actions happening in thevideo is given and use this weak supervision toperform action segmentation (Bojanowski et al.,2014; Huang et al., 2016; Kuehne et al., 2017;Richard et al., 2017; Ding and Xu, 2018; Changet al., 2019). Other approaches weaken this super-vision and use only the set of actions that occur ineach video (Richard et al., 2018), or are fully un-supervised (Sener and Yao, 2018; Kukleva et al.,2019).

Instructional videos have gained interest over thepast few years (Yu et al., 2014; Sener et al.,2015; Malmaud et al., 2015; Alayrac et al., 2016;Zhukov et al., 2019) since they enable weakly-supervised modeling: previous work most sim-ilar to ours consists of models that localize ac-tions in narrated videos with minimal supervision

(Alayrac et al., 2016; Sener et al., 2015; Elhamifarand Naing, 2019; Zhukov et al., 2019).

We present a generative model of action segmen-tation that incorporates duration modeling, narra-tion and ordering constraints, and can be trainedin all of the above supervision conditions by max-imizing the likelihood of the data; while these pastworks have had these individual components, theyhave not yet all been combined.

Our work is also related to work on inducing orpredicting scripts (Schank and Abelson, 1977) ei-ther from text (Chambers and Jurafsky, 2008; Pi-chotta and Mooney, 2014; Rudinger et al., 2015;Weber et al., 2018) or in a grounded setting (Bisket al., 2019).

3 The CrossTask Dataset

We use the recent CrossTask dataset (Zhukovet al., 2019) of instructional videos. To our knowl-edge, CrossTask is the only available dataset thathas tasks from more than one domain, includesbackground regions, provides step annotations andnaturalistic language. Other datasets lack one ofthese; e.g.they focus on one domain (Kuehne et al.,2014) or do not have natural language (Tang et al.,2019) or step annotations (Miech et al., 2019). Anexample instance from the dataset is shown in Fig-ure 1, and we describe each aspect below.

Tasks Each video comes from a task, e.g. makea latte, with tasks taken from the titles of se-lected WikiHow articles, and videos curated fromYouTube search results for the task name. We fo-cus on the primary section of the dataset, contain-ing 2,700 videos from 18 different tasks.

Steps and canonical order Each task has a setof steps: lower-level action step types, e.g., steammilk and pour milk, which are typically completedwhen performing the task. Step names consist ofa few words, typically naming an action and anobject it is applied to. The dataset also providesa canonical step order for each task: an ordering,like a script (Schank and Abelson, 1977; Cham-bers and Jurafsky, 2008), in which a task’s stepsare typically performed. For each task, the set ofstep types and their canonical order were hand-constructed by the dataset creators based on sec-tion headers in the task’s WikiHow article.

Page 3: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Regions background pour mixture into pan flip pancakebackground background

Video

Narration "hey folks here welcome to my kitchen [...] folks my pan is nice and hot [...] just change the angle to show you [...] let cook [...] sit on towel [...] big old stack [...]

TimestepTime (in s)

Step

backgroundflip pancakerm pancakebackground

Figure 1: An example video instance from the CrossTask dataset (Sec. 3). The video depicts a task, make pancakes,and is annotated with region segments, which can be either action steps (e.g., pour mixture into pan) or backgroundregions. Videos also are temporally-aligned with transcribed narration. We learn to segment the video into theseregions and label them with the action steps (or background), without access to region annotations during training.

Annotations Each video in the primary sectionof the dataset is annotated with labeled temporalsegments identifying where steps occur. (In theweak supervision setting, these step segment la-bels are used only in evaluation, and never in train-ing.) A given step for a task can occur multipletimes, or not at all, in any of the task’s videos.Steps in a video also need not occur in the task’scanonical ordering (although in practice our re-sults show that this ordering is a helpful inductivebias for learning). Most of the frames in videos(72% over the entire corpus) are background – notcontained in any step segment.

Narration Videos also have narration text (tran-scribed by YouTube’s automatic speech recogni-tion system) which typically consists of a mixof the task demonstrator describing their actionsand talking about unrelated topics. Although nar-ration is temporally aligned with the video, andsteps (e.g., pour milk) are sometimes mentioned,these mentions often do not occur at the same timeas the step they describe (e.g., “let the milk coolbefore pouring it”). Zhukov et al. (2019) guideweakly-supervised training using the narration bydefining a set of narration constraints for eachvideo, which identify where in the video stepsare likely to occur, using similarity between thestep names and temporally-aligned narration (seeSec. 6.1).

4 Model

Our generative model of the video features and la-beled task segments is a first-order semi-Markovmodel. We use a semi-Markov model for the ac-tion segmentation task because it explicitly modelstemporal regions of the video, their duration, their

probable ordering, and their features.1 It can betrained in an unsupervised way, without labeledregions, to maximize the likelihood of the fea-tures.

Timesteps Our atomic unit is a one-second re-gion of the video, which we refer to as a timestep.A video with T timesteps has feature vectorsx1:T . The features xt at timestep t are derivedfrom the video, its narration, or both, and inour work (and past work on the dataset) are pro-duced by pre-trained neural models which sum-marize some non-local information in the regioncontaining each timestep, which we describe inSec. 6.3.

Regions Our model segments a video withT timesteps into a sequence of regions, eachof which consists of a consecutive number oftimesteps (the region’s duration). The numberof regions K in a video and the duration dk ofeach region can vary; the only constraint is thatthe sum of the durations equals the video length:∑K

k=1 dk = T . Each region has a label rk, whichis either one of the task’s step labels (e.g., pourmilk) or a special label BKG indicating the regionis background. In our most general, unconstrainedmodel, a given task step can occur multiple times(or not at all) as a region label in any video forthe task, allowing step repetitions, dropping, andreordering.

Structure We define a first-order Markov (bi-gram) model over these region labels:

P (r1:K) = P (r1)K∏k=2

P (rk | rk−1) (1)

1Semi-Markov models are also shown to be successfulin the similar domain of speech recognition (e.g., Pylkkonenand Kurimo, 2004).

Page 4: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

with tabular conditional probabilities. While re-gion labels are part of the dataset, they are primar-ily used for evaluation: we seek models that can betrained in the unsupervised and weakly-supervisedconditions where labels are unavailable. Thismodel structure, while simple, affords a dynamicprogram allowing efficient enumeration over bothall possible segmentations of the video into re-gions and assignments of labels to the regions, al-lowing unsupervised training (Sec. 4.1).

Duration Our model, following past work(Richard et al., 2018), parameterizes region dura-tions using Poisson distributions, where each la-bel type r has its own mean duration λr: dk ∼Poisson(λrk). These durations are constrained sothat they partition the video: e.g., region r2 beginsat timestep d1 (after region r1), and the final regionrK ends at the final timestep T .

Timestep labels The region labels r1:K (step, orbackground) and region durations d1:K togethergive a sequence of timestep labels l1:T for alltimesteps, where a timestep’s label is equal to thelabel for the region it is contained in.

Feature distribution Our model’s feature dis-tribution p(xt|lt) is a class-conditioned multivari-ate Gaussian distribution: xt ∼ Normal(µlt ,Σ),where lt is the step label at timestep t. (We notethat the assignment of labels to steps is latentand unobserved during unsupervised and weakly-supervised training.) We use a separate learnedmean µl for each label type l, both steps and back-ground. Labels are atomic and task-specific, e.g.,the step type pour milk when it occurs in the taskmake a latte does not share parameters with thestep add milk when it occurs in the task make pan-cakes.2 We use a diagonal covariance matrix Σwhich is fixed to the empirical covariance of eachfeature dimension.3

4.1 Training

In the unsupervised setting, labels l are unavail-able at training (used only in evaluation). We de-scribe training in this setting, as well as two su-

2We experimented with sharing steps, or step compo-nents, across tasks in initial experiments, but found that itwas helpful to have task-specific structural probabilities.

3We found that using a shared diagonal covariance ma-trix outperformed using full or unshared covariance matrices.

pervised training methods which we use to ana-lyze properties of the dataset and compare modelclasses.

Unsupervised We train the generative model asa hidden semi-Markov model (HSMM). We opti-mize the model’s parameters to maximize the logmarginal likelihood of the features for all video in-stance features x(i) in the training set:

ML =N∑i

logP (x(i)1:Ti

) (2)

Applying the semi-Markov forward algorithm(Murphy, 2002; Yu, 2010) allows us to marginal-ize over all possible sequences of step labelsto compute the log marginal likelihood for eachvideo as a function of the model parameters,which we optimize directly using backpropaga-tion and mini-batched gradient descent with theAdam (Kingma and Ba, 2015) optimizer.4 See Ap-pendix A for optimization details.

Generative supervised Here the labels l are ob-served; we train the model as a generative semi-Markov model (SMM) to maximize the log jointlikelihood:

JL =N∑i

logP (l(i)1:Ti

, x(i)1:Ti

) (3)

We maximize this likelihood over the entire train-ing set using the closed form solution giventhe dataset’s sufficient statistics (per-step featuremeans, average durations, and step transition fre-quencies).

Discriminative supervised To train the SMMmodel discriminatively in the supervised setting,we use gradient descent to maximize the log con-ditional likelihood:

CL =

N∑i

logP (l(i)1:T | x

(i)1:T ) (4)

5 Benchmarks

We identify five modeling choices made in re-cent work: imposing a fixed ordering on steps(not allowing step reordering); allowing for steps

4This is the same as mini-batched Expectation Maxi-mization using gradient descent on the M-objective (Eisner,2016).

Page 5: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Richard et al. (2018) Zhukov et al. (2019) Oursstep reordering X Xstep repetitions X Xstep duration X Xlanguage X Xgenerative model X X

Table 1: Characteristics of each model we compare.

to repeat in a video; modeling the duration ofsteps; using the language (narrations) associ-ated with the video; and using a discrimina-tive/generative model. We picked the recent mod-els of Zhukov et al. (2019) and Richard et al.(2018) since they have non-overlapping strengths(see Table 1).

ORDEREDDISCRIM (Zhukov et al., 2019).This work uses a discriminative classifier whichgives a probability distribution over labels at eachtimestep: p(lt | xt). Inference finds an as-signment of steps to timesteps that maximizes∑

t log p(lt|xt) subject to the constraints that: allsteps are predicted exactly once; steps occur inthe fixed canonical ordering defined for the task;one background region occurs between each step.Unsupervised training of the model alternates be-tween inferring labels using the dynamic program,and updating the classifier to maximize the proba-bility of these inferred labels.5

ACTIONSETS (Richard et al., 2018). This workuses a generative model which has structure sim-ilar to ours, but uses dataset statistics (e.g., aver-age video length and number of steps) to learn thestructure distributions, rather than setting parame-ters to maximize the likelihood of the data. As inour model, region durations are modeled using aclass-conditional Poisson distribution. The featuredistribution is modeled using Bayesian inversionof a discriminative classifier (a multi-layer percep-tron) with an estimated label prior. The structuralparameters of the model (durations and class pri-ors) are estimated using the length of each video,and the number of possible step types. As orig-inally presented, this model depends on knowingwhich steps occur in a video at training time; forfair comparison, we adapt it to the same supervi-sion conditions of Zhukov et al. (2019) by enforc-ing the canonical step ordering for the task duringboth training and evaluation.

5To allow the model to predict step regions with durationlonger than a single timestep, we modify this classifier to alsopredict a background class, and incorporate the scores of thebackground class into the dynamic program.

6 Experimental Setting

We compare models on the CrossTask datasetacross supervision conditions. We primarily eval-uate the models on action segmentation (Sec. 1).Past work on the dataset (Zhukov et al., 2019) hasfocused on a step recognition task, where modelsidentify individual timesteps in videos that cor-respond to possible steps; for comparison, wealso report performance for all models on thistask.

6.1 Supervision Conditions

In all settings, the task for a given video is known(and hence the possible steps), but the settingsvary in the availability of other sources of super-vision: step labels for each timestep in a video,and constraints from language and step ordering.Models are trained on a training set and evaluatedon a separate held-out testing set, consisting of dif-ferent videos (from the same tasks).

Supervised Labels for all timesteps l1:T are pro-vided for all videos in the training set.

Fully unsupervised No labels for timesteps areavailable during training. The only supervision isthe number of possible step types for each task(and, as in all settings, which task each video isfrom). In evaluation, the task for a given video(and hence the possible steps, but not their or-dering) are known. We follow past work in thissetting (Sener et al., 2015; Sener and Yao, 2018)by finding a mapping from model states to re-gion labels that maximizes label accuracy, aver-aged across all videos in the task. See Appendix Cfor details.

Weakly supervised No labels for timesteps areavailable, but two supervision types are used in theform of constraints (Zhukov et al., 2019):

(1) Step ordering constraints: Step regions areconstrained to occur in the canonical step ordering(see Sec. 3) for the task, but steps may be sepa-rated by background. We constrain the structureprior distributions p(r1) and transition distribu-tions p(rk+1|rk) of the HSMM to enforce this or-dering. For p(r1), we only allow non-zero proba-bility for the background region, BKG, and for the

Page 6: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

first step in the task’s ordering. p(rk | rk−1) con-strains each step type to only transition to the nextstep in the constrained ordering, or to BKG.6 Asstep ordering constraints change the parameters ofthe model, when we use them we enforce themduring both training and testing. While this obvi-ates most of the learned structure of the HSMM,the duration model (as well as the feature model)is still learned.

(2) Narration constraints: These give regions inthe video where each step type is likely to occur.Zhukov et al. (2019) obtained these using similar-ities between word vectors for the transcribed nar-ration and the words in the step labels, and a dy-namic program to produce constraint regions thatmaximize these similarities, subject to the step or-dering matching the canonical task ordering. SeeZhukov et al. for details. We enforce these con-straints in the HSMM by penalizing the featuredistributions to prevent any step labels that occuroutside of one of the allowed constraint regionsfor that step. Following Zhukov et al., we only usethese narration constraints during training.7

6.2 Evaluation

We use three metrics from past work, outlined hereand described in more detail in Appendix D. Toevaluate action segmentation, we use two varietiesof the standard label accuracy metric (Sener andYao, 2018; Richard et al., 2018): all label accu-racy, which is computed on all timesteps, includ-ing background and non-background, as well asstep label accuracy: accuracy only for timestepsthat occur in a non-background region (accord-ing to the ground-truth annotations). Since thesetwo accuracy metrics are defined on individualframes, they penalize models if they don’t cap-ture the full temporal extent of actions in their pre-dicted segmentations. Our third metric is step re-call, used by past work on the CrossTask dataset(Zhukov et al., 2019) to measure step recognition(defined in Sec. 6). This metric evaluates the frac-tion of step types which are correctly identifiedby a model when it is allowed to predict only oneframe per step type, per video. A high step recall

6To enforce ordering when steps are separated by BKG,we annotate BKG labels with the preceeding step type (butall BKG labels for a task share feature and duration parame-ters, and are merged for evaluation).

7We also experiment with using features derived fromtranscribed narration in Appendix G.

indicates a model can accurately identify at leastone representative frame of each action type in avideo.

We also report three other statistics to analyze thepredicted segmentations: (1) Sequence similarity:the similarity of the sequence of region labels pre-dicted in the video to the groundtruth, using in-verse Levenshtein distance normalized to be be-tween 0 and 100. See Appendix D for more de-tails. (2) Predicted background percentage: thepercentage of timesteps for which the model pre-dicts the background label. Models with a higherpercentage than the ground truth background per-centage (72%) are overpredicting background. (3)Number of segments: the number of step seg-ments predicted in a video. Values higher thanthe ground truth average (7.7) indicate overly-fragmented steps. Sequence similarity and num-ber of segments are particularly relevant for mea-suring the effects of structure, as they do not factorover individual timesteps (as do the all label andstep label accuracies and step recall).

We average values across the 18 tasks in the eval-uation set (following Zhukov et al. 2019).

6.3 Features

For our features x1:T , we use the same base fea-tures as Zhukov et al. (2019), which are producedby convolutional networks pre-trained on separateactivity, object, and audio classification datasets.See Appendix B for details. In our generativemodels, we apply PCA (following Kuehne et al.,2014 and Richard et al., 2018) to project fea-tures to 300 dimensions and decorrelate dimen-sions (see Appendix B for details).8

7 Results

We first define several baselines based on datasetstatistics (Sec. 7.1), which we will find to be strongin comparison to past work. We then analyze eachaspect of our proposed model on the dataset ina supervised training setting (Sec. 7.2), removingsome error sources of unsupervised learning andevaluating whether a given model fits the dataset(Liang and Klein, 2008). Finally, we move to our

8This reduces the number of parameters that need to belearned in the emission distributions, both by reducing thedimensionality and allowing a diagonal covariance matrix. Inearly experiments we found PCA improved performance.

Page 7: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

All Label Step Label Step Sequence Predicted Num.# Model Accuracy Accuracy Recall Similarity Bkg. % Segments.

Dataset Statistic Baselines (Sec. 7.1)GT Ground truth 100.0 100.0 100.0 100.0 71.9 7.7B1 Predict background 71.9 0.0 0.0 9.0 100.0 0.0B2 Sample from train distribution 54.6 7.2 8.3 12.8 72.4 69.5B3 Ordered uniform 55.6 8.1 12.2 55.0 73.0 7.4

Supervised (Sec. 7.2)Unstructured

S1 Discriminative linear 71.0 36.0 31.6 30.7 73.3 27.1S2 Discriminative MLP 75.9 30.4 27.7 41.1 82.8 13.0S3 Gaussian mixture 69.4 40.6 31.5 33.3 68.9 23.9

StructuredS4 ORDEREDDISCRIM 75.2 18.1 45.4 54.4 90.7 7.4S5 SMM, discriminative 66.0 37.3 24.1 50.5 65.9 8.5S6 SMM, generative 60.5 49.4 28.7 46.6 52.4 10.6

Un- and Weakly-Supervised (Sec. 7.3)Fully Unsupervised

U1 HSMM (with opt. acc. assignment) 31.8 28.8 10.6 31.0 31.1 15.4Ordering Supervision

U2 ACTIONSETS 40.8 14.0 12.1 55.0 49.8 7.4U3 ORDEREDDISCRIM (without Narr.) 69.5 0.2 2.8 55.0 97.2 7.4U4 HSMM + Ord 55.5 8.3 7.3 55.0 70.6 7.4

Narration SupervisionU5 HSMM + Narr 65.7 9.6 8.5 35.1 84.6 4.5

Ordering + Narration SupervisionU6 ORDEREDDISCRIM 71.0 1.8 24.5 55.0 97.2 7.4U7 HSMM + Narr + Ord 61.2 15.9 17.2 55.0 73.7 7.4

Table 2: Model comparison on the CrossTask validation data. We evaluate primarily using all label accuracy andstep label accuracy to evaluate action segmentation, and step recall to evaluate step recognition.

main setting, the weakly-supervised setting of pastwork, incrementally adding step ordering and nar-ration constraints (see Sec. 6.1) to evaluate the de-gree to which each helps (Sec. 7.3).

Results are given in Table 2 for models trained onthe CrossTask training set of primary tasks, andevaluated on the held-out validation set. We willdescribe and analyze each set of results in turn.See Figure 2 for a plot of models’ performanceon two key metrics, and Appendix I for examplepredictions.

7.1 Dataset Statistic Baselines

Table 2 (top block) shows baselines that do notuse video (or narration) features, but predict stepsaccording to overall statistics of the training data.These demonstrate characteristics of the data, andthe importance of using multiple metrics.

Predict background (B1) Since most timestepsare background, a model that predicts backgroundeverywhere can obtain high overall label accuracy,showing the importance of also using step label ac-curacy as a metric for action segmentation.

B1

B2

B3

S1

S2

S3

S4

S6

S5

U1U2

U3 U4

U5

U6

U7

0

10

20

30

40

50

0 10 20 30 40 50 60

Ste

p R

ecal

l

Step Label Accuracy

BaselinesSupervised: UnstructuredSupervised: StructuredFully UnsupervisedOrdering ConstraintsNarration ConstraintsOrdering+Narration

OrderedDiscrim

OrderedDiscrimSMM, discrim.

HSMM+Narr+Ord

SMM, gen.

Figure 2: Baseline and model performance on two keymetrics: step label accuracy and step recall. Points arecolored according to their supervision type, and labeledwith their row number from Table 2. We also labelparticular important models.

Page 8: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Sample from the training distribution (B2)For each timestep in each video, we sample a labelfrom the empirical distribution of step and back-ground label frequencies for the video’s task in thetraining data.

Ordered uniform (B3) For each video, we pre-dict step regions in the canonical step order, sepa-rated by background regions. The length of eachregion is set so that all step regions in a video haveequal duration, and the percentage of backgroundtimesteps is equal to the corpus average. See Uni-form in Figure 3a for sample predictions.

Sampling each timestep label independently fromthe task distribution (row B2), and using a uniformstep assignment in the task’s canonical orderingwith background (B3) both obtain similar step la-bel accuracy, but the ordered uniform baseline im-proves substantially on the step recall metric, indi-cating that step ordering is a useful inductive biasfor step recognition.

7.2 Full Supervision

Models in the unstructured block of Table 2are classification models applied independently toall timesteps, allowing us to compare the perfor-mance of the feature models used as componentsin our structured models. We find that a Gaussianmixture model (row S3), which is used as the fea-ture model in the HSMM, obtains comparable steprecall and substantially higher step label accuracythan a discriminative linear classifer (row S1) sim-ilar to the one used in Zhukov et al. (2019), whichis partially explained by the discriminative classi-fier overpredicting the background class (compar-ing Predicted Background % for those two rows).Using a higher capacity discriminative classifier, aneural net with a single hidden layer (MLP), im-proves performance over the linear model on sev-eral metrics (row S2); however, the MLP still over-predicts background, substantially underperform-ing the Gaussian mixture on the step label accu-racy metric.

In the structured block of Table 2, we com-pare the full models which use step constraints(Zhukov et al., 2019) or learned transition distri-butions (the SMM) to model task structure. Thestructured models learn (or in the case of Zhukovet al., enforce) orderings over the steps, whichgreatly improve their sequence similarity scores

when compared to the unstructured models, anddecrease step fragmentation (as measured by num.segments). Figure 3a shows predictions for a typ-ical video, demonstrating this decreased fragmen-tation.9

We see two trends in the supervised results:

(1) Generative models obtain substantially higherstep label accuracy than discriminative models ofthe same or similar class. This is likely due to thefact that the generative models directly parameter-ize the step distribution. (See Appendix E.)

(2) Structured sequence modeling naturally im-proves performance on sequence-level metrics (se-quence similarity and number of segments pre-dicted) over the unstructured models. However,none of the learned structured models improve onthe strong ordered uniform baseline (B3) whichjust predicts the canonical ordering of a task’ssteps (interspersed with background regions). Thiswill motivate using this canonical ordering as aconstraint in unsupervised learning.

Overall, the SMM models obtain strong actionsegmentation performance (high step label accu-racy without fragmenting segments or overpredict-ing background).

7.3 No or Weak Supervision

Here models are trained without supervision forthe labels l1:T . We compare models trained with-out any constraints, to those that use constraintsfrom step ordering and narration, in the Un- andWeakly Supervised block of Table 2. Exampleoutputs are shown in Appendix I.

Our generative HSMM model affords trainingwithout any constraints (row U1). This model hashigh step label accuracy (compared to the otherunsupervised models) but low all label accuracy,and similar scores for both metrics. This hints,and other metrics confirm, that the model is notadequately distinguishing steps from background:the percentage of predicted background is verylow (31%) compared to the ground truth (72%,row GT). See HSMM in Figure 3b for predic-tions for a typical video. These results are at-tributable to features within a given video (evenacross step types) being more similar than features

9We also perform an ablation study to understand the ef-fect of the duration model. See Appendix F for details.

Page 9: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

GT

Uniform

GMM

SMM

0 50 100 150 200 250 300

BKG pour sesame oil add onion add hamadd kimchi add rice stir mixture

(a) Step segmentations in the full supervision conditionfor a video from the make kimchi fried rice task, com-paring the ground truth (GT), ordered uniform baseline(Uniform), and predictions from the Gaussian mixture(GMM) and semi-Markov (SMM) models.

BKG

GT

OrderedDiscrim

HSMM+Narr+Ord

HSMM

pour eggadd flourpour mixture into pan

add sugarflip pancake

whisk mixturetake pancake from pan

pour milk

0 50 100 150 200 250 300 350

(b) Step segmentations in the no- or weak-supervision conditionsfor a video from the make pancakes task, comparing the groundtruth (GT) to predictions from our model without (HSMM)and with constraint supervision (HSMM+Narr+Ord) and fromZhukov et al. (2019) (ORDEREDDISCRIM).

Figure 3: Step segmentation visualizations for two sample videos in supervised (left) and unsupervised (right)conditions. The x-axes show timesteps, in seconds. See Appendix I for more visualizations.

of the same step type in different videos (see Ap-pendix H for feature visualizations). The inducedlatent model states typically capture this inter-video diversity, rather than distinguishing stepsacross tasks.

We next add in constraints from the canonical stepordering, which our supervised results showed tobe a strong inductive bias. Unlike in the fully un-supervised setting, the HSMM model with order-ing (HSMM+Ord, row U4) learns to distinguishsteps from background when constrained to pre-dict each step region once in a video, with pre-dicted background timesteps (70.6%) close to theground-truth (72%). However, performance of thismodel is still very low on the task metrics – com-parable to or underperforming the ordered uniformbaseline with background (row B3) on all met-rics.

This constrained step ordering setting also allowsus to apply ACTIONSETS (Richard et al., 2018)and ORDEREDDISCRIM (Zhukov et al., 2019).ACTIONSETS obtains high step label accuracy,but substantially underpredicts background, as ev-idenced by both the all label accuracy and the lowpredicted background percentage. The tendencyof ORDEREDDISCRIM to overpredict backgroundwhich we saw in the supervised setting (row S4) iseven more pronounced in this weakly-supervisedsetting (row U3), resulting in scores very close tothe predict background baseline (B1).

Next, we use narration constraints (U5), whichare enforced only during training time, followingZhukov et al. (2019). Narration constraints sub-stantially improve all label accuracy (comparingU1 and U5). However, the model overpredictsbackground, likely because it doesn’t enforce each

step type to occur in a given video. Overpredictingbackground causes step label accuracy and step re-call to decrease.

Finally, we compare the HSMM and ORDERED-DISCRIM models when using both narration con-straints (in training) and ordering constraints (intraining and testing) in the ordering + narrationblock. Both models benefit substantially from nar-ration on all metrics when compared to using onlyordering supervision, more than doubling theirperformance on step label accuracy and step recall(comparing U6 and U7 to U3 and U4).

Our weakly-supervised results show that:

(1) Both action segmentation metrics – all labelaccuracy and step label accuracy – are importantto evaluate whether models adequately distinguishmeaningful actions from background.

(2) Step constraints derived from the canonicalstep ordering provide a strong inductive bias forunsupervised step induction. Past work requiresthese constraints and the HSMM, when trainedwithout them, does poorly, learning to capturediversity across videos rather than to identifysteps.

(3) However, ordering supervision alone is not suf-ficient to allow these models to learn better seg-mentations than a simple baseline that just uses theordering to assign labels (ordered uniform); narra-tion is also required.

7.4 Comparison to Past Work

Finally, we compare our full model to the OR-DEREDDISCRIM model of Zhukov et al. (2019) inthe primary data evaluation setup from that work:

Page 10: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

All Label Step Label StepAcc. Acc. Recall

ORDEREDDISCRIM 71.3 1.2 17.9HSMM+Narr+Or 66.0 5.6 14.2

Table 3: Unsupervised and weakly supervised resultsin the cross-validation setting.

averaging results over 20 random splits of the pri-mary data (Table 3). This is a low data settingwhich uses only 30 videos per task as training datain each split.

Accordingly, both models have lower perfor-mance, although the relative ordering is the same:higher step label accuracy for the HSMM, andhigher all label accuracy and step recall for OR-DEREDDISCRIM. Although in this low-data set-ting, models overpredict background even more,this problem is less pronounced for the HSMM:97.4% of timesteps for ORDEREDDISCRIM arepredicted background (explaining its high all labelaccuracy), and 87.1% for HSMM.

8 Discussion

We find that unsupervised action segmentation innaturalistic instructional videos is greatly aided bythe inductive bias given by typical step orderingswithin a task, and narrative language describingthe actions being done. While some results aremore mixed (with the same supervision, differentmodels are better on different metrics), we do ob-serve that across settings and metrics, step order-ing and narration increase performance. Our re-sults also illustrate the importance of strong base-lines: without weak supervision from step order-ings and narrative language, even state-of-the-artunsupervised action segmentation models operat-ing on rich video features underperform feature-agnostic baselines. We hope that future work willcontinue to evaluate broadly.

While action segmentation in videos from diversedomains remains challenging – videos containboth a large variety of types of depicted actions,and high visual variety in how the actions are por-trayed – we find that structured generative mod-els provide a strong benchmark for the task due totheir abilities to capture the full diversity of actiontypes (by directly modeling distributions over ac-tion occurrences), and to benefit from weak super-vision. Future work might explore methods for in-corporating richer learned representations both of

the diverse visual observations in videos, and theirnarrative descriptions, into such models.

AcknowledgmentsThanks to Dan Klein, Andrew Zisserman, LisaAnne Hendricks, Aishwarya Agrawal, GaborMelis, Angeliki Lazaridou, Anna Rohrbach, JustinChiu, Susie Young, the DeepMind language team,and the anonymous reviewers for helpful feedbackon this work. DF is supported by a Google PhDFellowship.

References

Sami Abu-El-Haija, Nisarg Kothari, JoonseokLee, Paul Natsev, George Toderici, BalakrishnanVaradarajan, and Sudheendra Vijayanarasimhan.2016. YouTube-8M: A large-scale video classificationbenchmark. arXiv preprint arXiv:1609.08675.

Jean-Baptiste Alayrac, Piotr Bojanowski, NishantAgrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised learning from narrated in-struction videos. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Dare A Baldwin, Jodie A Baird, Megan M Saylor, andM Angela Clark. 2001. Infants parse dynamic action.Child Development, 72(3):708–717.

Yonatan Bisk, Jan Buys, Karl Pichotta, and Yejin Choi.2019. Benchmarking hierarchical script knowledge. InProceedings of the Conference of the North AmericanChapter of the Association for Computational Linguis-tics (NAACL), pages 4077–4085. Association for Com-putational Linguistics.

Piotr Bojanowski, Remi Lajugie, Francis Bach, IvanLaptev, Jean Ponce, Cordelia Schmid, and Josef Sivic.2014. Weakly supervised action labeling in videos un-der ordering constraints. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV).

Joao Carreira and Andrew Zisserman. 2017. Quovadis, action recognition? a new model and the Ki-netics dataset. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR).

Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised learning of narrative event chains. In Pro-ceedings of the Annual Meeting of the Association forComputational Linguistics (ACL).

Chien-Yi Chang, De-An Huang, Yanan Sui, Li Fei-Fei,and Juan Carlos Niebles. 2019. D3TW: Discriminativedifferentiable dynamic time warping for weakly super-vised action alignment and segmentation. In Proceed-ings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR).

Page 11: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Li Ding and Chenliang Xu. 2018. Weakly-supervisedaction segmentation with iterative soft boundary as-signment. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Jason Eisner. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial paper).In Proceedings of the Workshop on Structured Predic-tion for NLP.

Ehsan Elhamifar and Zwe Naing. 2019. Unsupervisedprocedure learning via joint dynamic summarization.In Proceedings of the IEEE International Conferenceon Computer Vision (ICCV).

Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN:Multi-stage temporal convolutional network for actionsegmentation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR).

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recogni-tion. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR).

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2017.Localizing moments in video with natural language. InProceedings of the IEEE International Conference onComputer Vision (ICCV).

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis,Jort F Gemmeke, Aren Jansen, R Channing Moore,Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey-bold, et al. 2017. CNN architectures for large-scaleaudio classification. In 2017 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP).

De-An Huang, Li Fei-Fei, and Juan Carlos Niebles.2016. Connectionist temporal modeling for weakly su-pervised action labeling. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV).

Will Kay, Joao Carreira, Karen Simonyan, BrianZhang, Chloe Hillier, Sudheendra Vijayanarasimhan,Fabio Viola, Tim Green, Trevor Back, Paul Natsev,et al. 2017. The Kinetics human action video dataset.arXiv preprint arXiv:1705.06950.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedings ofthe International Conference on Learning Representa-tions (ICLR).

Dan Klein and Christopher D. Manning. 2002. Con-ditional structure versus conditional estimation inNLP models. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP).

Hilde Kuehne, Ali Arslan, and Thomas Serre. 2014.The language of actions: Recovering the syntax andsemantics of goal-directed human activities. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Hilde Kuehne, Alexander Richard, and Juergen Gall.2017. Weakly supervised learning of actions from tran-scripts. In CVIU.

Harold W Kuhn. 1955. The Hungarian method for theassignment problem. Naval research logistics quar-terly, 2(1-2):83–97.

Anna Kukleva, Hilde Kuehne, Fadime Sener, and Ju-rgen Gall. 2019. Unsupervised learning of actionclasses with continuous temporal embedding. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Colin Lea, Michael D Flynn, Rene Vidal, Austin Re-iter, and Gregory D Hager. 2017. Temporal convolu-tional networks for action segmentation and detection.In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Dani Levine, Daphna Buchsbaum, Kathy Hirsh-Pasek,and Roberta M Golinkoff. 2019. Finding events in acontinuous world: A developmental account. Develop-mental Psychobiology, 61(3):376–389.

Percy Liang and Dan Klein. 2008. Analyzing the er-rors of unsupervised learning. In Proceedings of theAnnual Meeting of the Association for ComputationalLinguistics (ACL).

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research (JMLR), 9(Nov):2579–2605.

Jonathan Malmaud, Jonathan Huang, Vivek Rathod,Nicholas Johnston, Andrew Rabinovich, and KevinMurphy. 2015. What’s cookin’? interpreting cookingvideos using text, speech and vision. In Proceedings ofthe Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL).

Bridgette A Martin and Barbara Tversky. 2003. Seg-menting ambiguous events. In Proceedings of the An-nual Meeting of the Cognitive Science Society.

Antoine Miech, Dimitri Zhukov, Jean-BaptisteAlayrac, Makarand Tapaswi, Ivan Laptev, and JosefSivic. 2019. HowTo100M: Learning a text-videoembedding by watching hundred million narratedvideo clips. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV).

Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representations.In Proceedings of the International Conference on Lan-guage Resources and Evaluation (LREC).

Kevin Murphy. 2002. Hidden semi-markov models.Unpublished tutorial.

Karl Pichotta and Raymond Mooney. 2014. Statisti-cal script learning with multi-argument events. In Pro-ceedings of the Conference of the European Chapter ofthe Association for Computational Linguistics (EACL),pages 220–229. Association for Computational Lin-guistics.

Page 12: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Janne Pylkkonen and Mikko Kurimo. 2004. Durationmodeling techniques for continuous speech recogni-tion. In Eighth International Conference on SpokenLanguage Processing.

Alexander Richard, Hilde Kuehne, and Juergen Gall.2017. Weakly supervised action learning with RNNbased fine-to-coarse modeling. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition (CVPR).

Alexander Richard, Hilde Kuehne, and Juergen Gall.2018. Action sets: Weakly supervised action segmen-tation without ordering constraints. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR).

Marcus Rohrbach, Sikandar Amin, Mykhaylo An-driluka, and Bernt Schiele. 2012. A database for finegrained activity detection of cooking activities. In Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR).

Rachel Rudinger, Vera Demberg, Ashutosh Modi, Ben-jamin Van Durme, and Manfred Pinkal. 2015. Learn-ing to predict script events from domain-specific text.In Proceedings of the Fourth Joint Conference on Lex-ical and Computational Semantics, pages 205–210.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. ImageNetLarge Scale Visual Recognition Challenge. IJCV,115(3):211–252.

Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals, and understanding: An inquiry into hu-man knowledge structures. Psychology Press.

Fadime Sener and Angela Yao. 2018. Unsupervisedlearning and segmentation of complex activities fromvideo. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

O. Sener, A. Zamir, S. Savarese, and A. Saxena. 2015.Unsupervised semantic parsing of video collections. InProceedings of the IEEE International Conference onComputer Vision (ICCV).

Tanya Sharon and Karen Wynn. 1998. Individuationof actions from continuous motion. Psychological Sci-ence, 9(5):357–362.

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. International Conference on LearningRepresentations (ICLR).

Bharat Singh, Tim K Marks, Michael Jones, OncelTuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grainedaction detection. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Chen Sun, Fabien Baradel, Kevin Murphy, andCordelia Schmid. 2019. Contrastive bidirectional

transformer for temporal representation learning.arXiv preprint arXiv:1906.05743.

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng,Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou.2019. COIN: A large-scale dataset for comprehen-sive instructional video analysis. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR).

Ercenur Unal, Yue Ji, and Anna Papafragou. 2019.From event representation to linguistic meaning. Top-ics in Cognitive Science.

Noah Weber, Leena Shekhar, Niranjan Balasubrama-nian, and Nathanael Chambers. 2018. Hierarchicalquantized representations for script generation. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 3783–3792.

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3d network for temporalactivity detection. In Proceedings of the IEEE Inter-national Conference on Computer Vision (ICCV).

Serena Yeung, Olga Russakovsky, Ning Jin, MykhayloAndriluka, Greg Mori, and Li Fei-Fei. 2018. Everymoment counts: Dense detailed labeling of actions incomplex videos. International Journal of Computer Vi-sion (IJCV), 126(2-4):375–389.

Shoou-I Yu, Lu Jiang, and Alexander Hauptmann.2014. Instructional videos for unsupervised harvestingand learning of action examples. In Proceedings of theACM International Conference on Multimedia (MM).

Shun-Zheng Yu. 2010. Hidden semi-markov models.Artificial intelligence, 174(2):215–243.

Jeffrey M Zacks and Khena M Swallow. 2007. Eventsegmentation. Current Directions in PsychologicalScience, 16(2):80–84.

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu,Xiaoou Tang, and Dahua Lin. 2017. Temporal ac-tion detection with structured segment networks. InProceedings of the IEEE International Conference onComputer Vision (ICCV).

Luowei Zhou, Xu Chenliang, and Jason J. Corso. 2018.Towards automatic learning of procedures from web in-structional videos. In Proceedings of the Conferenceon Artificial Intelligence (AAAI).

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gok-berk Cinbis, David Fouhey, Ivan Laptev, and JosefSivic. 2019. Cross-task weakly supervised learningfrom instructional videos. In Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR).

Page 13: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

A Optimization

For both training conditions for our semi-Markovmodels that require gradient descent (generativeunsupervised and discriminative supervised), weinitialize parameters randomly and use Adam(Kingma and Ba, 2015) with an initial learningrate of 5e-3, a batch size of 5 videos, and decaythe learning rate when training log likelihood doesnot decrease for more than one epoch.

B Features

For our features x1:T , we use the same base fea-tures as Zhukov et al. (2019). There are threefeature types: activity recognition features, pro-duced by an I3D model (Carreira and Zisserman,2017) trained on the Kinetics-400 dataset (Kayet al., 2017); object classification features, from aResNet-152 (He et al., 2016) trained on ImageNet(Russakovsky et al., 2015), and audio classifica-tion features10 from the VGG model (Simonyanand Zisserman, 2015) trained by Hershey et al.(2017) on a preliminary version of the YouTube-8M dataset (Abu-El-Haija et al., 2016).11

For the generative mdoels which use Gaussianemission distributions, we apply PCA to the basefeatures above to reduce the feature dimensional-ity and decorrelate dimensions. We perform PCAseparately for features within task and within eachfeature group (I3D, ResNet, and audio features),but on features from all videos within that task.We use 100 components for each feature group,which explained roughly 70-100% of the variancein the features, depending on the task and featuregroup. The 100-dimensional PCA representationsfor the I3D, ResNet, and audio features for eachframe, at timestep t, are then concatenated to givea 300-dimensional vector for the frame, xt.

C Unsupervised Evaluation

The HSMM model, when trained in a fully unsu-pervised setting, induces class labels for regionsin the video; however while these class labels aredistinct, they do not correspond a priori to any of

10https://github.com/tensorflow/models/tree/master/research/audioset/vggish

11We also experiment with using features derived fromtranscribed narration in Appendix G.

the actual region labels (which can be step types,or background) for our task. Just as with other un-supervised tasks and models (e.g., part-of-speechinduction), we need a mapping from these classesto step types (and background) in order to evalu-ate the model’s predictions. We follow the evalua-tion procedure of past work (Sener and Yao, 2018;Sener et al., 2015) by finding the mapping frommodel states to region labels that maximizes labelaccuracy, averaged across all videos in the task,using the Hungarian method (Kuhn, 1955). Thisevaluation condition is only used in the “Unsuper-vised” section of Table 2 (in the rows marked withoptimal accuracy assignment).

D Evaluation Metrics

Label accuracy The standard metric for actionsegmentation (Sener and Yao, 2018; Richard et al.,2018) is timestep label accuracy, in datasets witha large amount of background, label accuracyon non-background timesteps. The CrossTaskdataset has multiple reference step labels in thegroundtruth for around 1% of timesteps, due tonoisy region annotations that overlap slightly. Weobtain a single reference label for these timestepsby taking the step that appears first in the canonicalstep ordering for the task. We then compute accu-racy of the model predictions against these refer-ence labels across all timesteps and all videos fora task (in the all label accuracy condition), or byfiltering to those timesteps which have a step label(non-background) in the reference (to focus on themodel’s ability to accurately predict step labels),in the step label accuracy condition.

Step recall This metric (Zhukov et al., 2019)measures a model’s ability to pick out instants foreach of the possible step types for a task, if theyoccur in a video. The model of Zhukov et al.(2019) predicted a single frame for each step type;while our extension of their model, ORDERED-DISCRIM, and our HSMM model can predict mul-tiple, when computing this metric we obtain a sin-gle frame for each step type to make the numberscomparable to theirs.When a model predicts multi-ple frames per step type, we obtain a single one bytaking the one closest to the middle of the tempo-ral extent of the predicted frames for that step type.We then apply their recall metric: First, count thenumber of recovered steps, step types from the

Page 14: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

true labels for the video that were identified by oneof the predicted labels (have a predicted label ofthe same type at one of the true label’s frames).These recovered step counts are summed acrossvideos in the evaluation set for a given task, andnormalized by the maximum number of possiblerecovered steps (the number of step types in eachvideo, summed across videos) to produce a steprecall fraction for the task.

Sequence similarity This measures the similar-ity of the predicted sequence of regions in a videoagainst the true sequence of regions. As in speechrecognition, we are interested in the high-level se-quence of steps recognized in a video (and wish toabstract away from noise in the boundaries of theannotated regions). We first compute the negatedLevenshtein distance between the true sequenceof steps and background r1, . . . , rK for a videoand the and predicted sequence r1, . . . , r′K . Thenegated distance for the sequence pairs for a givenvideo are scaled to be between 0 and 100, where 0indicates the Levenshtein distance is the maximumpossible between two sequences of their respectivelengths, and 100 corresponds to the sequences be-ing identical. These similarities are then averagedacross all videos in a task.

E Comparing Generative andDiscriminative Models

We observe that the generative models tend to ob-tain higher performance on the action segmen-tation task, as measured by step label accuracy,than discriminative models of the same or simi-lar class. We attribute this finding to two factors:first, the generative models explicitly parameterizeprobabilities for the steps, allowing better model-ing of the full distribution of step labels. Second,the discriminative models are trained to optimizep(lt | xt) for all timesteps t. We would expectthat this would produce better accuracies on met-rics aligned with this objective (Klein and Man-ning, 2002) – and indeed the all timestep accu-racy is higher for the discriminative models. How-ever, the discriminative models’ high accuracy of-ten comes at the expense of predicting backgroundmore frequently, leading to lower performance onstep label accuracy.

F Duration Model Ablation

All Label Step Label Step Seq.Model Acc. Acc. Recall Sim.

SupervisedSMM, gen. 60.5 49.4 28.7 46.6MM, gen. 60.1 48.6 28.2 46.8SMM, disc. 66.0 37.3 24.1 50.5MM, disc. 62.8 32.2 20.1 41.8

Weakly-SupervisedHSMM 31.8 28.8 10.6 31.0HMM 28.8 30.8 10.3 29.9HSMM+Ord+Narr 61.2 15.9 17.2 55.0HMM+Ord+Narr 60.6 17.0 20.0 55.0

Table 4: Comparison between the semi-Markov andhidden semi-Markov models (SMM and HSMM) withthe Markov and hidden Markov (MM and HMM) mod-els, which ablate the semi-Markov’s duration model.

We examine the effect of the (hidden) semi-Markov model’s Poisson duration model by com-paring to a (hidden) Markov model (HMM in theunsupervised/weakly-supervised settings, or MMin the supervised setting). We use the model asdescribed in Sec. 4 except for fixing all durationsto be a single timestep. We then train as describedin Sec. 4.1. While this does away with explicitmodeling of duration, the transition distributionstill allows the model to learn expected durationsfor each region type by implicitly parameterizinga geometric distribution over region length. Re-sults are shown in 4. We observe that results areoverall very similar, with the exceptions that re-moving the duration model decreases performancesubstantially on all metrics in the discriminativesupervised setting, and increases performance onstep label accuracy and step recall in the con-strained unsupervised setting (HSMM+Ord+Narrand HMM+Ord+Narr). This suggests that theHMM transition distribution is able to model re-gion duration as well as the HSMM’s explicit du-ration model, or that duration overall plays a smallrole in modeling in most settings relative to theimportance of the features.

G Narration Features

The benefit of narration-derived hard constraintson labels (following past work by Zhukov et al.2019) raises the question of how much narrationwould help when used to provide features for themodels. We obtain narration features for eachvideo using FastText word embeddings (Mikolovet al., 2018) for the video’s time-aligned tran-

Page 15: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

(a) Feature vectors colored by their step label in thereference annotations.

(b) Feature vectors colored by the id of the videothey occur in.

Figure 4: t-SNE plots of feature vectors for 20 videos from the change a tire task, showing feature similarity.

scribed narration (see Zhukov et al. 2019 for de-tails on this transcription), pooled within a slidingwindow to allow for imperfect alignment betweenactivities mentioned in the narration and their oc-currence in the video. The features for a giventimestep t are produced by a weighted sum of em-beddings for all the words in the transcribed narra-tion within a 5-second window of t (i.e.from t− 2to t+ 2), weighted using a Hanning window12 (sothat words in the center of each window are mostheavily weighted for that window). We did nottune the window size, or experiment with otherweighting functions. The word embeddings arepretrained on Common Crawl, and are not fine-tuned with the rest of the model parameters.

Once these narration features are produced, asabove, we treat them in the same way as the otherfeature types (activity recognition, object classifi-cation, and audio) described in Appendix B: re-ducing their dimensionality with PCA, and con-catenating them with the other feature groups toproduce the features xt.

In Table 5, we show performance of key super-vised and weakly-supervised models on the vali-dation set, when using these narration features inaddition to activity recognition, object detection,and audio features. Narration features improveperformance over the corresponding systems fromTable 2 (differences are shown in parentheses) in13 out of 15 cases, typically by 1-4%.

12https://docs.scipy.org/doc/numpy/reference/generated/numpy.hanning.html

All Label Step Label StepAcc. Acc. Recall

SupervisedGaussian mixture 70.4 (+1.0) 43.7 (+3.1) 34.9 (+3.4)SMM, generative 63.3 (+2.8) 53.2 (+3.8) 32.1 (+3.4)

Weakly-SupervisedHSMM+Ord 53.6 (-1.9) 9.5 (+1.2) 8.5 (+1.2)HSMM+Narr 68.9 (+3.2) 8.0 (-1.6) 12.6 (+4.1)HSMM+Narr+Ord 64.3 (+3.1) 17.9 (+2.0) 21.9 (+4.7)

Table 5: Performance of key supervised and weakly-supervised models on the validation data when addingnarration vectors as features. Numbers in parenthesesgive the change from adding narration vectors to thesystems from Table 2.

H Feature Visualizations

To give a sense for feature similarities both withinstep types and within a video, we visualize fea-ture vectors for 20 videos randomly chosen fromthe change a tire task, dimensionality-reduced us-ing t-SNE (Maaten and Hinton, 2008) so thatsimilar feature vectors are close in the visualiza-tion.

Figure 4a shows feature vectors colored by steptype: we see little consistent clustering of featurevectors by step. On the other hand, we observe agreat deal of similarity across step types within avideo (see Figure 4b); when we color feature vec-tors by video, different steps from the same videoare close to each other in space. These togethersuggest that better featurization of videos can im-prove action segmentation.

Page 16: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

I Segmentation Visualizations

In the following pages, we show example seg-mentations from the various systems. Figure 5and 6 visualize predicted model segmentationsfor the unstructured Gaussian mixture and struc-tured semi-Markov model in the supervised set-ting, in comparison to the ground-truth and the or-dered uniform baseline. We see that while bothmodels typically make similar predictions in thesame temporal regions of the video, the structuredmodel produces steps that are much less frag-mented.

Figure 7 and 8 visualize segmentations in the un-supervised and weakly-supervised settings for theHSMM model and ORDEREDDISCRIM of Zhukovet al. (2019). The unsupervised HSMM hasdifficulty distinguishing steps from background(see Appendix H), while the model trained withweak supervision from ordering and narration(HSMM+Ord+Narr) is better able to induce mean-ingful steps. The ORDEREDDISCRIM model, al-though it has been modified to allow predictingmultiple timesteps per step, collapses to predict-ing a single label, background, nearly everywhere,which we conjecture is because the model is dis-criminatively trained: jointly inferring labels thatare easy to predict, and the model parameters topredict them.

Page 17: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Figure 5: Supervised segmentations We visualize segmentations from the validation set for a video from thetask make kimchi fried rice. We show the ground truth (GT), ordered uniform baseline (Uniform), and predictionsfrom the unstructured Gaussian mixture model (GMM), and structured semi-Markov model (SMM) trained in thesupervised setting. Predictions from the unstructured model are more fragmented than predictions from the SMM.The x-axis gives the timestep in the video.

Page 18: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Figure 6: Supervised segmentations We visualize segmentations from the validation set for a video from the taskbuild simple floating shelves. We show the ground truth (GT), ordered uniform baseline (Uniform), and predictionsfrom the unstructured Gaussian mixture model (GMM), and structured semi-Markov model (SMM) trained in thesupervised setting. Predictions from the unstructured model are more fragmented than predictions from the SMM.The x-axis gives the timestep in the video.

Page 19: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Figure 7: Unsupervised and weakly-supervised segmentations We visualize segmentations from the validationset for a video from the task make pancakes. We show the ground truth (GT), ordered uniform baseline (Uniform),and predictions from the hidden semi-markov trained without constraints (HSMM) and with constraints fromnarration and ordering (HSMM+Narr+Ord), and the system of Zhukov et al. The x-axis gives the timestep in thevideo.

Page 20: arXiv:2005.03684v1 [cs.CL] 7 May 2020 · 2020. 5. 11. · one of the YouTube instructional video datasets, CrossTask (Zhukov et al.,2019), consist of back-ground regions (e.g., the

Figure 8: Unsupervised and weakly-supervised segmentations We visualize segmentations from the validationset for a video from the task grill steak. We show the ground truth (GT), ordered uniform baseline (Uniform), andpredictions from the hidden semi-markov trained without constraints (HSMM) and with constraints from narrationand ordering (HSMM+Narr+Ord), and the system of Zhukov et al. The x-axis gives the timestep in the video.