Generating Racing Game Commentary from Vision, Language

Proceedings of the 14th International Conference on Natural Language Generation (INLG), pages 103–113,Aberdeen, Scotland, UK, 20-24 September 2021. ©2021 Association for Computational Linguistics

103

Generating Racing Game Commentary from Vision, Language, andStructured Data

Tatsuya Ishigaki† Goran Topic† Yumi Hamazono†◦ Hiroshi Noji†⋄Ichiro Kobayashi†◦ Yusuke Miyao†‡ Hiroya Takamura†

†National Institute of Advanced Industrial Science and Technology, Japan,◦Ochanomizu University, ‡The University of Tokyo, ⋄LeepMind Inc.

{ishigaki.tatsuya, goran.topic, hamazono-yumi, noji, takamura.hiroya}@aist.go.jp,[email protected], [email protected]

Abstract

We propose the task of automatically gener-ating commentaries for races in a motor rac-ing game, from vision, structured numerical,and textual data. Commentaries provide infor-mation to support spectators in understandingevents in races. Commentary generation mod-els need to interpret the race situation and gen-erate the correct content at the right moment.We divide the task into two subtasks: utterancetiming identification and utterance generation.Because existing datasets do not have suchalignments of data in multiple modalities, thissetting has not been explored in depth. In thisstudy, we introduce a new large-scale datasetthat contains aligned video data, structurednumerical data, and transcribed commentariesthat consist of 129,226 utterances in 1,389races in a game. Our analysis reveals thatthe characteristics of commentaries change de-pending on time and viewpoints. Our experi-ments on the subtasks show that it is still chal-lenging for a state-of-the-art vision encoder tocapture useful information from videos to gen-erate accurate commentaries. We make thedataset and baseline implementation publiclyavailable for further research.1

1 Introduction

Live commentary plays an important role insports matches and video games; it makes spec-tators more excited, more immersed, and bet-ter informed about the matches or games (e.g.,Schaffrath (2003)), as in the example of racinggame commentary “We are approaching the finallong straight. I wonder who is going to win!”.Live commentary also enhances the value of on-line videos and home videos. However, providinga live commentary requires a certain level of com-menting skills and knowledge of the target sports

1https://kirt.airc.aist.go.jp/RacingCommentary

or video games; the majority of online videosand home videos are left without live commen-tary.2 The application of natural language gener-ation technology would be a solution to this prob-lem. Thus, we select the racing game domain asan example, and propose a task of generating livecommentary on it.

Examples of utterances in a live commentary fora racing game are shown in Figure 1. Live com-mentaries should describe each important eventin the race at the moment when the event oc-curs, within a short period of time. Thus, wehave to make decisions on when to speak andhow long/elaborately to speak, in addition to whatto say and how to say it, which have been longstudied. The importance of each event wouldhave to be assessed in the context of a compe-tition in which participants are striving to win.It suggests that what to say for race commen-tary generation should be different from, for ex-ample, image captioning. In this sense, the taskof live commentary generation contains inherentlimitations that are not addressed in well-studiedgeneration problems such as summary genera-tion for basketball (Puduppully and Lapata, 2021),and image/video captioning (Vinyals et al., 2015;Yao et al., 2015); however, some techniques de-veloped for such existing generation tasks are alsouseful for live commentary generation.

As input, vision data, such as the videos shownin Figure 1, are common in many tasks. How-ever, it is not a trivial task to capture importantinformation from vision data, because many of theframes in a race would be similar to each other,unlike images for image captioning data. There-fore, we propose the use of structured data, thatis, telemetry data, including the positions and thespeeds of cars, and the steering wheel angles. The

2For example, many gameplay videos on Twitch do nothave live commentary (https://www.twitch.tv).

104

Figure 1: Translated Examples of commentaries (original utterances in Japanese are in brackets). For the com-mentaries on the top, the commentator is watching the race from the aerial viewpoint. For those at the bottom, thecommentator is watching the race through a camera just behind the driver.

assumption that such telemetry data are availableis not unrealistic. It is the general trend that manysensors are used to obtain telemetry data even inreal sports matches and motor races. For exam-ple, each race car in F1 races is monitored by 300sensors.3 Additionally, players are tracked by GPStechnology during football matches to obtain posi-tional data (Memmert and Raabe, 2018). We workon video games because telemetry or vision dataare easier to obtain than in real sports. This can ad-dress the huge demand in the gaming communityand serve as a favorable test bed for live commen-tary generation. As a result, the task addressed inthis paper is the live commentary generation forracing games from vision, structured, and textualdata.4

Live commentary generation has not been stud-ied in-depth, partly because of the lack of datasets.Thus, we create a new dataset for live commentarygeneration, which includes 129,226 utterances oflive commentary, aligned with gameplay videoand telemetry data of racing game. The teleme-try data contain the positions and speeds of racecars and various types of information about cir-cuits and cars. There are two types of live com-mentary. One is provided by the game playerswhile playing and watching the racing game fromthe virtual camera behind the car. The other is pro-vided by another person watching the game froma virtual helicopter. We analyze the differences in

3https://aws.amazon.com/f1/4We include textual data as input because we use the pre-

vious utterances as additional input.

the characteristics of commentaries from differentviewpoints.

We split the live commentary generation intotwo subtasks: the utterance timing identificationand utterance generation. We propose multimodalmodels for these subtasks and also provide an em-pirical evaluation. Our experiments suggest thatthe use of telemetry data works well for this task,whereas it is difficult for a state-of-the-art visionencoder to extract useful information from racevideos, especially for utterance generation.

Our contributions are threefold: (1) we proposea novel task of automatically generating motorracing game commentaries, (2) we create a datasetand analyze its characteristics, and (3) we proposemethods for this task and argue that combiningmultimodal data is challenging, which is worth ex-ploring in depth. We make the dataset and baselineimplementations publicly available to enhance fur-ther studies on this task.

2 Related Work

Existing studies on commentary generationcan be divided into real-time commen-tary (Taniguchi et al., 2019; Kim and Choi,2020) and commentaries written after-wards (Puduppully and Lapata, 2021). Ourfocus is on the former. Live commentarygeneration is formulated as the extractionof tweets (Kubo et al., 2013), the combina-tion of rules and keyword extraction fromvideos (Kim and Choi, 2020) and neural network-based data-to-text (Taniguchi et al., 2019). To

105

generate commentary in real-time, we need tosolve at least two tasks: timing identification andutterance generation tasks. However, existingstudies focus on the latter, where the timingsare given, for example, minute-by-minute up-dates (Kubo et al., 2013). Unlike baseball, thetiming identification task for race commentary isnot trivial because a race cannot be segmentedsimply.

Our setting can be considered as a combi-nation of two different research topics: videocaptioning (Kim and Choi, 2020) and data-to-text (Taniguchi et al., 2019). Various meth-ods for encoding video frames have been ac-tively studied (Dosovitskiy et al., 2021); com-mentaries often include comments that focus onthe positional relation between cars, which re-quires a more fine-grained understanding of videoframes. The performance of current vision en-coders still needs to be evaluated. Data-to-text is the task of converting structured datainto natural language, which has been appliedto the domain of finance (Murakami et al., 2017;Aoki et al., 2018, 2021; Uehara et al., 2020),weather forecast (Murakami et al., 2021), a sum-mary of sports matches (Puduppully and Lapata,2021; Iso et al., 2019) and live sports com-mentary (Taniguchi et al., 2019). The inputsused for existing studies are time-sequencenumerical data (Murakami et al., 2017), ta-bles (Puduppully and Lapata, 2021; Gardent et al.,2017) or simulated images (Murakami et al.,2021). These models focus on neural network-based approaches; however, data-to-text taskshave been studied for a long time (see a surveypaper (Gatt and Krahmer, 2018) for details).

Existing studies on generation mostly focus ongenerating text from a single viewpoint, i.e. theygenerate objective descriptions of video frames invideo captioning, and a data-to-text model gen-erates a text that focuses on the main content ofthe input data. A few existing studies state thatlive commentaries change depending on the view-points of commentators. For example, Kubo et al.(2013) found that the generated commentarieson soccer matches are not objective, and theseare biased to mention more popular teams. Theviewpoints are the key to characteristic commen-taries, but most studies have ignored the differ-ence caused by the viewpoints that our dataset ad-dresses.

Datasets play important roles in studies ongeneration. Existing datasets for generationtasks contain data in a single modality, suchas, videos (Zhou et al., 2018; Krishna et al.) orstructured data (Puduppully and Lapata, 2021;Gardent et al., 2017). We propose a new large-scale dataset that contains transcribed commen-taries aligned with videos and structured numer-ical data.

3 Dataset

We describe the procedure used to create ourdataset. We then show its statistics and the analy-sis to characterize the task.

3.1 Procedure for creating our dataset

Collecting recordings and spoken commen-taries: We hired five workers who regularlyplay e-sports games. Thus, some of the work-ers are familiar with playing racing games, butsome are not. They are not professional com-mentators. As a racing game, we used AssettoCorsa5. For each race consisting of two laps, oneworker plays while simultaneously commentatingit from the viewpoint of the virtual camera just be-hind the car (driver’s view). Another worker is as-signed to commentate the race from the viewpointof a virtual helicopter (aerial view), without play-ing the game. Note that the commentaries are inJapanese. Drivers used a physical steering con-troller to achieve a situation close to real sportscompetitions.

For both commentaries, we ask the commenta-tors to mainly mention the car driven by the player;however, the commentators could also mentionother cars if they found them worth mentioning.Circuit maps, in which each turn is numbered,are available to commentators so that they can re-fer to them by numbers (e.g., Turn 15). Well-known turns or straights are given names such asCasanova for turn six in the Laguna Seca circuit.6

Collecting transcriptions of commentaries:After the collection of recordings, we hired 149workers on a crowdsourcing service, Lancers7, totranscribe all the recordings. Workers are sup-posed to transcribe the recordings and add the

5Assetto Corsa is a game title developed and published byKunos Simulazioni:http://www.kunos-simulazioni.com

6The numbers and names are obtained from Wikipedia orother websites that describe circuits.

7http://lancers.jp

106

Telemetry data types Example valuescurrent lap [0..] 1is current lap invalidated? falselap time of current lap (ms) 256lap time of previous lap (ms) 156164progress on current lap [0, 1] 0.002780projected diff. from best lap 0.0speed (km/h) 177.693130steer rotation (rad) -59.793526world position (x, y, z) (m) (5.372770,

64.056038,-749.219971)

position on track (L=−1, R=1) -0.515301distance from ideal path (m) 0.854022

Table 1: List of collected structured telemetry data withexample values. The last two types of data are not fromthe API, but are calculated by the authors.

start and end timestamps to each utterance. Ut-terances are basically sentences, with some excep-tions; some utterances do not form complete sen-tences because they are truncated owing to speechrepair. Finally, we manually checked whether thetranscriptions aligned with the videos correctly.Collecting structured telemetry data: We alsocollected structured telemetry data. Using AssettoCorsa’s API, we extracted various structured nu-merical data, including the speeds of the cars par-ticipating in the race, % of the progress over theentire race, the angles of the steering wheel, andother 13 types of numerical values. The full list ofthe types of structured data collected is shown inTable 1. We repeated the extraction of these valuesevery 0.01 seconds on average.

3.2 Statistics and Analysis

In total, the five workers had played 1,389 races.1,084 out of the 1,389 races are given commen-taries from both the drivers’ and aerial viewpoints.The remaining 305 races are given only commen-taries from the drivers’ viewpoints. Thus, we col-lected a total of 2,473 video recordings alignedwith commentaries and multimodal data.8 The to-tal duration of the recordings is 226 hours, and thenumber of collected utterances is 129,226, whichis more than the number of descriptions in Ac-tivityNet Captions dataset (Krishna et al.), a largedataset for dense video captioning. Also, as anon-English dataset, it might provide some valu-able linguistic diversity, as most datasets are in En-glish. On average, they produced an utterance witha length of 2.73 seconds and then they kept silent

81,084+305 videos from driver’s viewpoints, and 1,084videos from aerial viewpoints.

# of unique circuits 4# of commentators 5total # of races 1,389total # of recordings 2,473- driver’s viewpoint 1,389- aerial viewpoint 1,084total recording duration 226:37:53- driver’s viewpoint 126:11:19- aerial viewpoint 100:26:34total # of utterances 129,226avg. # of utterances per race 52.25avg. # of characters per utterance 22.22avg. length of an utterance 2.73savg. length of silence 3.46s

Table 2: Statistics of the dataset.

for 3.46 seconds. The other statistics are listed inTable 2.

We manually designed labels for the utterancesto capture their characteristics. Each label is de-fined as a pair of two sub-labels, target label andthe content label, as presented in Table 3.

The target label represents the target subjectof the utterance, such as the player’s car,other cars, all cars, or the circuit.For example, the utterance “All the cars nowstart” is labeled as all cars, because it fo-cuses on all the cars participating in the race,whereas “The player is now approaching Turn15”is labeled as the player’s car, because itmentions only the target car. The content labelrepresents the content of the utterance, such asthe relative position, movement, laptime and other content types as presented inTable 3. As an example, the player is nowapproaching Turn15 is labeled as the player’scar:movement, because it mentions the movementof the player’s car.

We randomly extracted 874 utterances from 20videos, and then manually annotated them usingthe listed labels. It should be noted that this man-ual labelling task is performed under the multi-label setting, which allows us to assign one ormore labels to an utterance. We analyze the distri-butions of the labels to capture the characteristicsof the dataset. In this analysis, we are particularlyinterested in (1) how the label distribution changesover time, and (2) how the label distribution differsdepending on the commentator’s viewpoint.

How utterances change over time?We split a race into quarters according to the time-line (e.g., the first quarter corresponds to the inter-val from the beginning of a race to the 25% point).

107

Target labels Example utterancesplayer’s car This was a very elegant

overtake by the player.other cars The car behind just overtook

the player.all cars All the cars has now started.circuit Laguna Seca is well known

for its very long strait.none Ah!Content labels Example utterancesrelative position The player is now at the

second making the distanceclose to the first.

location on map The blue car is approachingTurn2 and others follow.

lap time The player now crossed thefinish line at the time 3.15

previous event Maybe this mistake mightcause big impact on the time

future event Can the player successfullypass the difficult Turn 15?

movement The player overtook the redcar on this long straight.

stable race All cars are stable withoutany problems.

features All the cars are the same,Porsche Macan.

greetings Ok, now I start mycommentary on this race.

reaction Oh!others —

Table 3: List of sub-labels and example utterances. Alabel assigned to an utterance is defined as a pair ofTarget and Content sub-labels.

Figure 2 shows the label distributions for differentquarters. In the figure, the proportions of the la-bels in the first quarter are represented by the topsof the four bars, which are colored blue. Similarly,the second (orange), third (black), and fourth (yel-low) bars from the top represent the proportions inthe second, third, and fourth quarters, respectively.

For the first quarter indicated by the top bar foreach label, which are colored blue in the figure, thelabels with features (i.e., circuit:features, player’scar:features and other cars:features) are frequentcompared with the other quarters. This suggeststhat commentators often start the commentary bymentioning the features of the circuit or the cars.

For the final quarter indicated by the bottombars, which are colored yellow, none:greetingsand player’s car:lap time are frequent, suggestingthat the commentators mention the elapsed timeafter the cars crossed the finish line and finallyconcluded the session with greetings.

Next, we focus on the differences between thetwo middle quarters, indicated by the second andthird bars, which are colored orange and black.

Figure 2: Distribution over utterance labels in differentperiods of timestamps

The proportion of player’s car:movement in thesecond quarter (the second bars colored orange), ishigher than in the third quarter (the third bars col-ored black). Thus, the commentators tend to men-tion more facts in the second quarter. In contrast,the third quarter (the third bars colored black)contains more future event and previous event la-bels that often include commentators’ comments,concerns, or opinions on the previous and futureevents. This suggests that there are more mentionson the objective facts in the early stages of races,whereas subjective utterances increase toward theend of the races.

How do utterances differ depending on theviewpoint?

　We examined the differences between commen-

taries from two different viewpoints: the driver’sand aerial. Figure 3 shows the label distribution,where the upper bars colored orange correspond tothe driver’s viewpoint, and the lower bars coloredblue correspond to the aerial viewpoint.

The proportion of the player’car:location forthe aerial viewpoint is almost double of that ofthe proportion of the same label for driver’s view-point. This is because the commentators withaerial viewpoint can capture the locations in mapsmore easily, whereas commentators with driver’s

108

Figure 3: Distribution over utterance labels annotatedfrom different viewpoints: driver’s (the upper bars col-ored orange) or aerial (the lower bars colored blue).

viewpoint cannot see the entire circuit. Addition-ally, the proportion of player’s car:stable raceis very high for the aerial viewpoint. Becausethe aerial viewpoint is farther from the cars thanthe driver’s viewpoint, the commentaries from theaerial viewpoint hardly mention slight movementsof the cars; they more often say that the race isstable.

In contrast, the proportion of the player’scar:previous event for the driver’s viewpoint ishigher than that of the aerial viewpoint. If thecommentaries are spoken by the players them-selves, they often comment on the events that justhappened, e.g., “OK, yes, the car turned success-fully!”.

The analysis above shows that the viewpoint in-fluences the characteristics of the utterances.

4 Tasks and Models

We formulate a live commentary generation taskand introduce the baseline models as shown in Fig-ure 4. We report the performances of the baselinemodels for both subtasks to better understand thecommentary generation task.

4.1 Task Formulation

To generate a live commentary, one needs to findmultiple timepoints and generate an utterance at

each timepoint. We solve this task in a sequentialfashion; given the previous timepoint and its utter-ance, we find the next timepoint and generate itsutterance, which will be solved below.

The task of timing identification is to determinethe timestamp t at which an utterance should begenerated. We formulate this problem as a binaryclassification for each second. Given the timepointof the previously generated utterance, we itera-tively classify each successive second according towhether the second is the next timepoint for gen-eration or not. If the second is classified as posi-tive, the model goes on to the generation step. Ifthe second is classified as negative, the model goeson to the classification of the next second. If themodel does not output positive for m seconds, thenext second is forced to be positive. We set m = 7,which is double the average interval between twoconsecutive utterances.

For the classification of each second, we en-code a given tuple (V , D, T ). V denotes asequence of the previous k video frames V =(img1, ..., imgk) captured every second. Weset k = 10 in our experiments. We usedtorchvision9 library to extract these imagesfrom videos. S denotes the structured dataD = {D1, ..., DN} consisting of N sets ofthe structured telemetry data, where each Dn ={val1,n, ..., valk,n} consists of k values tracked ateach of the previous k seconds. T represents thetextual information, which is the previous utter-ance in our setting.

The task of the utterance generation is to gener-ate a sequence of characters as an utterance, giventhe tuple of (V , D, T ) for the given/estimatedtimepoint. In other words, we use the same in-formation for both the second classification aboveand the utterance generation. We use a multimodalencoder-decoder architecture to generate an utter-ance.

4.2 Multi-modal EncoderThe models for both subtasks use the same net-work for encoding the input vision, structuredtelemetry and textual data. The encoded repre-sentation is then used in the network for sub-tasks. For video frames V , each video frameis converted to an image embedding by usingVision Transformer (Dosovitskiy et al., 2021)10.

9https://pytorch.org/vision10We used an open implementation at https://

github.com/lucidrains/vit-pytorch.

109

Figure 4: Baseline models for the timing identification and commentary generation tasks. Each sequence ofnumerical data i.e. speed, rotation, position and so on, is considered as a vector and we obtain a compressedvector. Vision information is encoded by using Vision Transformer and LSTM-based encoder. Textual informationis encoded by using another LSTM-based encoder. The concatenated vector of the encoded numerical, vision andtextual information is passed to the models for sub-tasks.

The image embeddings are then sequentiallyencoded by using a Long Short-term Mem-ory (LSTM) (Hochreiter and Schmidhuber, 1997):hi,V = LSTMV (hi−1,V , V iT (imgi)), whereViT returns the output vector for the [CLS] to-ken calculated by Vision Transformer. We treatthe final state hk,V of the LSTM as the representa-tion of V . For each sequence Dn of the structureddata D tracked for the previous k seconds, we con-sider the sequence Dn as a vector for the n-th typeof data in the structured telemetry data, and wetransform it into an another representation by us-ing a linear transformation: dn = ReLU(DnWd+b), where Wd is a weight matrix, b is a vec-tor, and ReLU activates the vector. The concate-nated vector of D1 to DN is the representationof D. For textual input, we simply embed char-acters in the textual input and then sequentiallyencode the embeddings by using another LSTM:hi,T = LSTMT (hi−1,T , embe(xi)), where embereturns the character embedding. We treat the finalstate of the LSTM as the representation for textualinput T . Finally, the concatenated vector of theencoded representations of V , D, and T is passedto the networks for the sub-tasks explained next.

4.3 Timing Identification Model

For the timing identification, the encoded rep-resentation is passed to a network that con-sists of a linear transformation and the soft-max function: Softmax(encode([V ;D;T ])Wt),

where encode() returns the outputs of the encoderand Wt converts the concatenated vector to two-dimensional vector that represents the scores forthe decisions to utter or not utter at this timepoint.We obtain the probability distribution over deci-sions by using the softmax function.

For training, we use the gold start timestampsfrom commentators as positive instances. We usethe midpoint of the silence between consecutiveutterances as negative instances. We train thisclassifier by using the cross-entropy loss. For test-ing, we classify every second after the time atwhich the previous utterance is given. We out-put the timestamp first classified as positive by ourmodel.

4.4 Utterance Generation Model

This second subtask generates an utterance as a se-quence of characters given an encoded represen-tation V, S, and T . We use an encoder-decoderarchitecture with an attention mechanism, whichconsists of an LSTM-based decoder initialized bythe representation passed from the encoder:

hj,d = LSTMd(hj−1,dec, embd(yj−1)), (1)

aji =exp(hj,dWhi,V )∑10i=1 exp(hj,dWhi,V )

, (2)

cj =∑i

ajihi,V , (3)

oj = Softmax([hj,d; cj ]Wd), (4)

110

where embd returns the embedding of a charac-ter,11 cj is a vector produced by an attention mech-anism over the outputs of LSTMV , and yj−1 isthe previously generated character. Wd is a ma-trix that converts the concatenation [hj,d; cj ] to avector of scores over the predefined vocabulary forthe target utterances, and Softmax converts it toa probability distribution. This generator is trainedby using cross-entropy loss.

5 Experiments

We conduct experiments for the two subtasks tofurther investigate the characteristics of the task.

5.1 Data and Parameters

We use 100 tuples of videos, commentaries, andstructured data for validation, another 100 tuplesfor testing, and the remaining tuples for training.For Vision Transformer, we set the number ofheads to six, the layer size to two, and each head isrepresented as a 100-dimensional vector. The pa-rameter for the patch size is 30×30. The dropoutrate was set to 0.1. Each type of telemetry datais represented as a 10-dimensional vector. We usethree types of data i.e., speed, progress in a lap,steer rotation, and position on track. The dimen-sions of both the hidden states and input vectors tothe LSTMs in encoders are set to 100. Thus, thedimension of the hidden state of the LSTM in thedecoder side is 230, which is the sum of the sizeof the encoded images, textual information andstructured data. The size of the character embed-dings in the decoder is set to 100. We use separatevocabularies for the textual input and the targettext. We use Adam (Kingma and Ba, 2015) withseveral initial learning rates ranging from 10−3 to10−5 for optimizing parameters. We continue thetraining iterations until the loss in the validationdataset does not decrease for 10 epochs. We con-duct the utterance generation experiments for thegold timestamps.

5.2 Timing Identification

We evaluate the models by using the average gapsin second between the gold timestamp and pre-dicted timestamp. We propose a simple base-line that outputs the timestamp after 3.46 secondsfrom the end timestamp of the previous utterance.3.46 is the average interval between two consecu-tive utterances as shown in Table 4. As a result,

11Note embd is different from embe.

Model Avg. gapbaseline: average interval 3.66struct 3.27struct+text 3.26struct+text+vision 3.12

Table 4: The average gap in seconds between the goldand predicted timestamps. Lower values are better.

Model 10−3 10−4 10−5

struct 18.22 22.78 23.39struct + text 18.03 23.78 23.86struct + text + vision 17.49 22.58 24.01only vision 0.30 2.74 7.46

Table 5: BLEU scores on the test dataset for the com-pared models trained on different learning rates. Themodel with the learning rate 10−5 achieves the best per-formance on the validation dataset.

the average gap between the gold timestamps andpredicted timestamps obtained from the baselinemodel was 3.66. When we use only structureddata as input, we obtained the average gap of 3.27seconds. Adding textual information achieved aslightly better value of 3.26, but the difference isnegligible. Adding vision information improvesthe performance to 3.12.

5.3 Utterance GenerationWe use BLEU (Papineni et al., 2002) to evalu-ate the baseline models for this task. The scoresare shown in Table 5. The model based only ontelemetry data worked well. Adding textual infor-mation improved BLEU score if the learning rateis set to lower values i.e., 10−4 or 10−5. However,we obtained a very low BLEU score when we usedonly vision-based input. Adding vision informa-tion to struct+text model degraded the score if thelearning rate is set to 10−3 or 10−4. Even with asmaller learning rate, 10−5, vision information didnot significantly improve the performance.

6 Discussion

We list the gold and the utterances generated bythe model with learning rate 10−4 in Table 6. Goldutterances often focus on relative position situa-tions, as in Example 1, which requires capturingthe physical relations between cars. However, asshown in the first example of a generated utteranceby data+text, we found only a few generatedutterances that mention the relative positions of

111

Example 1: timestamp: 00:55GoldThe player is now following very close to the car ahead.data+textNow we’re on Turn 10, the player is now acceleratingdata+text+visionI want to step on the brakes firmly here.Example 2: timestamp: 02:04 and 02:07GoldWe are now approaching the chicane on Turn 11 and 12.The player should properly use the curb and go ona straight line here, and the player showedstable race here.data+textThe player should brake properly here.The player should brake properly here.

Table 6: The gold and automatically generated com-mentaries. Texts are translated from Japanese.

the player’s car and other cars. Integrating visioninformation further reduces such utterances men-tioning relative positions and other detailed infor-mation, and also makes utterances less specific.To generate utterances with detailed information,a model must accurately capture the informationdisplayed in a small area of the image. However,it may be too hard for the model to, for exam-ple, capture the distance between the car drivenby the player and the car just behind, or the drasticchanges of speeds from the video frames shown inFigure 1, whereas telemetry data provides the use-ful information. From the perspective of studieson vision, methods to properly capture such fea-tures are worth exploring.

We also observed that generated commentariescontain many repetitions of the same utterance, es-pecially utterances generated by the model withvision information. The utterances in Example 2in Table 6 exemplifies repetitions. It should benote that the two utterances are only three secondsapart. The input to the model does not changesignificantly during such a short period of time,resulting in the two identical utterances. Somemechanisms to increase the diversity of utterancesmight alleviate this problem, which is a particularchallenge in commentary generation.

We found errors in the name of a countrye.g., Nurburgring in Germany was generated asNurburgring in Italy. Such errors are also knownas a common problem in other generation tasks.

7 Future Research Directions

Finally, we discuss the future directions. Wenoticed that evaluation is very difficult for this

task. Only BLEU scores of course cannot cap-ture the correctness because this evaluation ig-nores the relation between a commentary and arace represented in. However, manually check-ing videos, language, structured data, and gen-erated utterances would incur a huge labor cost.An exploration into correct and efficient automaticand manual evaluation methods that consider allvision, language, and structured data should beconducted in the future. For evaluation by usingBLEU, it might be helpful if we have multiple ref-erence utterances for one timestamps. However, itis difficult to collect multiple utterances simulta-neously in this task because different commenta-tors give utterances at different timings. We leavethem for an important future research direction.

Extensions of model would be considered asone of the main steps to produce better commen-taries. However, more importantly, we need toexplore an essential research question: “what is agood commentary?”. Further analysis of the char-acteristics that contributes to making commen-taries better need to be conducted.

8 Conclusion

In this paper, we proposed the task of generatingcommentaries for motor racing games. Our anal-ysis reveals that the characteristics of utteranceschange over time in a race, and such changes arealso caused by differences in viewpoints. Theyalso show that combining vision, language andstructured data is challenging, which worth study-ing in depth. For future work, exploring bettermethods to combine vision, language, and struc-tured data will be a promising direction for futurework. We release the data to enhance further stud-ies on generation tasks from multimodal inputs.

Acknowledgements

This paper is based on results obtained from aproject JPNP20006, commissioned by the NewEnergy and Industrial Technology DevelopmentOrganization (NEDO). For experiments, compu-tational resource of AI Bridging Cloud Infrastruc-ture (ABCI) provided by National Institute of Ad-vanced Industrial Science and Technology (AIST)was used. We thank KUNOS Simulazioni forgranting us the permission to distribute our datasetof Assetto Corsa.

112

ReferencesKasumi Aoki, Akira Miyazawa, Tatsuya Ishigaki, Tat-

suya Aoki, Hiroshi Noji, Keiichi Goshima, HiroyaTakamura, Yusuke Miyao, and Ichiro Kobayashi.2021. Controlling contents in data-to-documentgeneration with human-designed topic labels. Com-puter Speech Language, 66:101154.

Tatsuya Aoki, Akira Miyazawa, Tatsuya Ishigaki, Kei-ichi Goshima, Kasumi Aoki, Ichiro Kobayashi, Hi-roya Takamura, and Yusuke Miyao. 2018. Generat-ing market comments referring to external resources.In Proceedings of the 11th International Conferenceon Natural Language Generation, pages 135–139.

Alexey Dosovitskiy, Lucas Beyer, AlexanderKolesnikov, Dirk Weissenborn, Xiaohua Zhai,Thomas Unterthiner, Mostafa Dehghani, MatthiasMinderer, Georg Heigold, Sylvain Gelly, JakobUszkoreit, and Neil Houlsby. 2021. An imageis worth 16x16 words: Transformers for imagerecognition at scale. In International Conference onLearning Representations, pages 1–21.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. Creating train-ing corpora for nlg micro-planners. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 179–188.

Albert Gatt and Emiel Krahmer. 2018. Survey of thestate of the art in natural language generation: Coretasks, applications and evaluation. J. Artif. Int. Res.,61(1):65–170.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Hayate Iso, Yui Uehara, Tatsuya Ishigaki, HiroshiNoji, Eiji Aramaki, Ichiro Kobayashi, YusukeMiyao, Naoaki Okazaki, and Hiroya Takamura.2019. Learning to select, track, and generate fordata-to-text. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 2102–2113.

Byeong Jo Kim and Y. Choi. 2020. Automatic baseballcommentary generation using deep learning. Pro-ceedings of the 35th Annual ACM Symposium on Ap-plied Computing, page 1056–1065.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei,and Juan Carlos Niebles. Dense-captioning eventsin videos. In International Conference on ComputerVision (ICCV), pages 706–715.

Mitsumasa Kubo, Ryohei Sasano, Hiroya Takamura,and Manabu Okumura. 2013. Generating live sportsupdates from twitter by finding good reporters. In

2013 IEEE/WIC/ACM International Joint Confer-ences on Web Intelligence (WI) and Intelligent AgentTechnologies (IAT), volume 1, pages 527–534.

Daniel Memmert and Dominik Raabe. 2018. Data an-alytics in football: Positional data collection, mod-elling and analysis. Routledge.

Soichiro Murakami, Sora Tanaka, Masatsugu Hangyo,Hidetaka Kamigaito, Kotaro Funakoshi, HiroyaTakamura, and Manabu Okumura. 2021. Generat-ing weather comments from meteorological simu-lations. In Proceedings of the 16th Conference ofthe European Chapter of the Association for Com-putational Linguistics (EACL): Main Volume, pages1462–1473.

Soichiro Murakami, Akihiko Watanabe, AkiraMiyazawa, Keiichi Goshima, Toshihiko Yanase, Hi-roya Takamura, and Yusuke Miyao. 2017. Learningto generate market comments from stock prices.In Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume1: Long Papers), pages 1374–1384.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Compu-tational Linguistics, pages 311–318.

Ratish Puduppully and Mirella Lapata. 2021. Data-to-text generation with macro planning. Transac-tions of the Association for Computational Linguis-tics, 9(0):510–527.

Michael Schaffrath. 2003. Mehr als 1:0! Bedeu-tung des Live-Kommentars bei Fußballubertragun-gen– eine explorative Fallstudie [more than 1:0! theimportance of live commentary on football matches– an exploratory case study]. Medien und Kommu-nikationswissenschaft, 51.

Yasufumi Taniguchi, Yukun Feng, Hiroya Takamura,and Manabu Okumura. 2019. Generating livesoccer-match commentary from play data. Proceed-ings of the AAAI Conference on Artificial Intelli-gence, 33(1):7096–7103.

Yui Uehara, Tatsuya Ishigaki, Kasumi Aoki, HiroshiNoji, Keiichi Goshima, Ichiro Kobayashi, HiroyaTakamura, and Yusuke Miyao. 2020. Learning withcontrastive examples for data-to-text generation. InProceedings of the 28th International Conference onComputational Linguistics (COLING2020).

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Bal-las, Christopher Pal, Hugo Larochelle, and AaronCourville. 2015. Describing videos by exploitingtemporal structure. In 2015 IEEE International

113

Conference on Computer Vision (ICCV), pages4507–4515.

Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018.Towards automatic learning of procedures from webinstructional videos. In AAAI Conference on Artifi-cial Intelligence, pages 7590–7598.

Documents

Generating Racing Game Commentary from Vision, Language