A Melody-Conditioned Lyrics Language Model · Proceedings of NAACL-HLT 2018, pages 163–172 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics

Proceedings of NAACL-HLT 2018, pages 163–172New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

A Melody-conditioned Lyrics Language Model

Kento Watanabe1, Yuichiroh Matsubayashi1,Satoru Fukayama2, Masataka Goto2, Kentaro Inui1,3, Tomoyasu Nakano2

1Graduate School of Information Sciences, Tohoku University,2National Institute of Advanced Industrial Science and Technology (AIST),

3RIKEN Center for Advanced Intelligence Project{kento.w, y-matsu, inui}@ecei.tohoku.ac.jp,{s.fukayama, m.goto, t.nakano}@aist.go.jp

Abstract

This paper presents a novel, data-driven lan-guage model that produces entire lyrics for agiven input melody. Previously proposed mod-els for lyrics generation suffer from the in-ability of capturing the relationship betweenlyrics and melody partly due to the unavail-ability of lyrics-melody aligned data. Inthis study, we first propose a new practi-cal method for creating a large collection oflyrics-melody aligned data and then create acollection of 1,000 lyrics-melody pairs aug-mented with precise syllable-note alignmentsand word/sentence/paragraph boundaries. Wethen provide a quantitative analysis of thecorrelation between word/sentence/paragraphboundaries in lyrics and melodies. We thenpropose an RNN-based lyrics language modelconditioned on a featurized melody. Experi-mental results show that the proposed modelgenerates fluent lyrics while maintaining thecompatibility between boundaries of lyricsand melody structures.

1 Introduction

Writing lyrics for a given melody is a challengingtask. Unlike prose text, writing lyrics requires bothknowledge and consideration of music-specificproperties such as the structure of melody, rhythms,etc. (Austin et al., 2010; Ueda, 2010). A simple ex-ample is the correlation between word boundariesin lyrics and the rests in a melody. As shown inFigure 1, a single word spanning beyond a longmelody rest can sound unnatural. When writinglyrics, a lyricist must consider such constraints incontent and lexical selection, which can imposeextra cognitive loads.

This consideration when writing lyrics has mo-tivated a wide-range of studies for the task ofcomputer-assisted lyrics writing (Barbieri et al.,2012; Abe and Ito, 2012; Potash et al., 2015; Watan-abe et al., 2017). Such studies aim to model the

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

ma-daまだ

Rest

Rest

Example of awkward lyrics.

Example of natural lyrics.(Proceed to an unknown tomorrow)

(I walked alone... This road)

知らない明日へ行く(yet) (know) (not) (tomorrow) (to) (go)

shi- ra na- i a- shi- ・ ta e yu-ku

hi-to-ri一人

(alone)

deで

(FUNC)

a- ru- i- ta歩いた

(walked)

ko-noこの(this)

mi-chi道

(road)

Figure 1: Examples of awkward and natural lyrics.FUNC indicates a function word. The song isfrom the RWC Music Database (RWC-MDB-P-2001No.20) (Goto et al., 2002).

language in lyrics and to design a computer sys-tem for assisting lyricists in writing. They proposeto constrain their models to generate only lyricsthat satisfy given conditions on syllable counts,rhyme positions, etc. However, such constraintsare assumed to be manually provided by a humanuser, which requires the user to interpret a sourcemelody and transform their interpretation to a setof constraints. To assist users with transforming amelody to constraints, a language model that auto-matically captures the relationship between lyricsand melody is required.

Some studies (Oliveira et al., 2007; Oliveira,2015; Nichols et al., 2009) have quantitatively ana-lyzed the correlations between melody and phono-logical aspects of lyrics (e.g., the relationship be-tween a beat and a syllable stress). However, thesestudies do not address the relationship betweenmelody and the discourse structure of lyrics. Lyricsare not just a sequence of syllables but a meaningfulsequence of words. Therefore, it is desirable thatthe sentence/paragraph boundaries are determinedbased on both melody rests and context words.

Considering such line/paragraph structure oflyrics, we present a novel language model that gen-

163

erates lyrics whose word, sentence, and paragraphboundaries are appropriate for a given melody,without manually transforming the melody to syl-lable constraints. This direction of research hasreceived less attention because it requires a largedataset consisting of aligned pairs of melody andsegment boundaries of lyrics which has yet to exist.

To address this issue, we leverage a publicly-available collection of digital music scores and cre-ate a dataset of digital music scores each of whichspecifics a melody score augmented with syllableinformation for each melody note. We collected1,000 Japanese songs from an online forum wheremany amateur music composers upload their musicscores. We then automatically aligned each musicscore with the raw text data of the correspondinglyrics in order to augment it with the word, sen-tence, and paragraph boundaries.

The availability of such aligned, parallel dataopens a new area of research where one can con-duct a broad range of data-oriented research forinvestigating and modeling correlations betweenmelodies and discourse structure of lyrics. In thispaper, with our melody-lyrics aligned songs, we in-vestigate the phenomena that (i) words, sentences,and paragraphs rarely span beyond a long melodyrest and (ii) the boundaries of larger components(i.e., paragraphs) tend to coincide more with longerrests. To the best of our knowledge, there is noprevious work that provides any quantitative analy-sis of this phenomenon with this size of data (seeSection 7).

Following this analysis, we build a novel, data-driven language model that generates fluent lyricswhose sentence and paragraph boundaries fit aninput melody. We extend a Recurrent Neural Net-work Language Model (RNNLM) (Mikolov et al.,2010) so that its output can be conditioned on afeaturized melody. Both our quantitative and qual-itative evaluations show that our model capturesthe consistency between melody and boundaries oflyrics while maintaining word fluency.

2 Melody-lyric alignment data

Our goal is to create a melody-conditioned lan-guage model that captures the correlations betweenmelody patterns and discourse segments of lyrics.The data we need for this purpose is a collec-tion of melody-lyrics pairs where the melody andlyrics are aligned at the level of not only note-syllable alignment but also discourse components

⁄@@1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄5, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄9, - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄13, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄17, - 7) )) ) ) 7) ! . ) # ) ) ) ) - 7) )

) ) ) (

⁄21, ) ) ) ) ) - 7)) ! # ) ) ) ) - 7) )

! ) ) ) '

⁄25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . ))

⁄29) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄33) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄37( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄41

⁄45

⁄49

⁄@@1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄5, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄9, - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄13, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄17, - 7) )) ) ) 7) ! . ) # ) ) ) ) - 7) )

) ) ) (

⁄21, ) ) ) ) ) - 7)) ! # ) ) ) ) - 7) )

! ) ) ) '

⁄25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . ))

⁄29) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄33) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄37( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄41

⁄45

⁄49

na ni ka ta ri na i to o mo o ta

何か足りないと思った

MelodySyllable

Digital musical score data with syllables

[na-ni] [ka] [ta-ri] [na-i] [to] [o-mo] [ta]SyllableWord 〈BOL〉

Lyric text data with syllable and boundary

Rest

何か足りないと思った

Melody

Syllable

Melody-Lyric alignment data

Word Res

t

Needleman-Wunschalignment algorithm

na-ni ka ta- ri na- i to o-mo ta

Res

tRe

st

NULLNULL NULL

NULL

(some- (FUNC)(enough) (not) (FUNC) (think) (FUNC)

(I thought that something was missing)thing)〈

BO

L〉

Figure 2: Melody-lyrics alignment using the Needle-man Wunsch algorithm. BOL denotes a line boundary.

(i.e., word/sentence/paragraph boundaries) of alyric, as illustrated in the bottom of Figure 2. Wecreate such a dataset by automatically combiningtwo types of data available from online forum sites:digital music score data (the top of Figure 2) andraw lyrics data (the middle).

A digital music score specifies a melody scoreaugmented with syllable information for eachmelody note (see the top of Figure 2). Score dataaugmented in this way is sufficient for analyzingthe relationship between the phonological aspectsof lyrics and melody, but it is insufficient for ourgoal since the structural information of the lyricsis not included. We thus augment score data fur-ther with boundaries of sentences, and paragraphs,where we assume that sentences and paragraphsof lyrics are approximately captured by lines andblocks,1 respectively, of the lyrics in the raw text.

The integration of music scores and raw lyricsis achieved by (1) applying a morphological an-alyzer2 to raw lyrics for word segmentation andChinese character pronunciation prediction and (2)aligning music score with raw lyrics at the sylla-ble level as illustrated in Figure 2. For this align-ment, we employ the Needleman-Wunsch algo-rithm (Needleman and Wunsch, 1970). This align-ment process is reasonably accurate because it failsin principle only when the morphological analysisfails in Chinese character pronunciation prediction,which occurs for only less than 1% of the words inthe data set.

With this procedure, we obtained 54,181Japanese raw lyrics and 1,000 digital musical

1Blocks are assumed to be segmented by empty lines.2To extract word boundaries and syllable information for

Japanese lyrics, we apply MeCab parser (Kudo et al., 2004).

164

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

[te- ra-shi][te] [ku-re][ta] [to- ki] [ka-n-ji] [ta] [no]

[ha-ji- me-te] [ki-zu- i][ta] [hi][no] [a-na-ta] [no]

<BOB>

<BOL> Word boundary

Word boundary

Word boundary

(light) (FUNC) (FUNC)(FUNC)(when) (feel) (FUNC)(FUNC)

(first) (notice)(FUNC)(time)(FUNC) (you) (FUNC)

(I felt when you warmly lit the gap in my heart)

(The first time I noticed your lovely smile)

Rest#2

Rest#4 Rest#5

Rest#3Rest#1

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

⁄@@Flute1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl5 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl9 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄Fl13 , - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄Fl17 , - 7) )

) ) ) 7) ! . ) # ) ) ) ) - 7) )) ) ) (

⁄Fl21 , ) ) ) ) ) - 7)

) ! # ) ) ) ) - 7) )! ) ) ) '

⁄Fl25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . )

)

⁄Fl29 ) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄Fl33 ) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄Fl37 ( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄Fl41

⁄Fl45

⁄Fl49

Figure 3: Example boundaries appearing immediatelyafter a rest. BOB indicates a block boundary.

(MIDI Tick)

Figure 4: Distribution of the number of boundaries inthe melody-lyrics alignment data.

scores from online forum sites3; we thus created1,000 melody-lyrics pairs. We refer to these 1,000melody-lyrics pairs as a melody-lyrics alignmentdata4 and refer to the remaining 53,181 lyrics with-out melody as a raw lyrics data.

We randomly split the 1,000 melody-lyrics align-ments into two sets: 90% for analyzing/trainingand the remaining 10% for testing. From those,we use 20,000 of the most frequent words whosesyllable counts are equal to or less than 10, andconverted others to a special symbol 〈unknown〉.All of the digital music score data we collectedwere distributed in the UST format, a common fileformat designed specifically for recently emergingcomputer vocal synthesizers. While we focus onJapanese music in this study, our method for datacreation is general enough to be applied to otherlanguage formats such as MusicXML and ABC,because transferring such data formats to UST isstraightforward.

3For selecting the 1,000 songs, we chose only frequentlydownloaded or highly popular songs to ensure the quality ofthe resulting dataset.

4We publicly release all source URLs of the 1,000 songs(https://github.com/KentoW/melody-lyrics).

3 Correlations between melody and lyric

In this section, we examine two phenomena re-lated to boundaries of lyrics: (1) the positions oflyrics segment boundaries are biased to melodyrest positions, and (2) the probability of boundaryoccurrence depends on the duration of a rest, i.e.,a shorter rest tends to be a word boundary and alonger rest tends to be a block boundary, as shownin Figure 3. All analyses were performed on thetraining split of the melody-lyrics alignment data,which is described in Section 2.

For the first phenomenon, we first calculatedthe distribution of boundary appearances at the po-sitions of melody notes and rests. Here, by theboundary of a line (or block), we refer to the po-sition of the beginning of the line (or block).5 InFigure 3, we say, for example, that the boundaryof the first block beginning “te-ra-shi te” coincideswith Rest#1. The result, shown at the top of Fig-ure 4, indicates that line and block boundaries arestrongly biased to rest positions and are far lesslikely to appear at note positions. Words, lines, andblocks rarely span beyond a long melody rest.

The bottom of Figure 4 shows the detailed dis-tributions of boundary occurrences for differentdurations of melody rests, where durations of 480and 1920 correspond to a quarter rest and a wholerest, respectively. The results exhibit a clear, strongtendency that the boundaries of larger segmentstend to coincide more with longer rests. To thebest of our knowledge, this is the first study thathas ever provided such strong empirical evidencefor the phenomena related to the correlations be-tween lyrics segments and melody rests. It is alsoimportant to note that the choice of segment bound-aries looks like a probabilistic process (i.e., thereis a long rest without a block boundary). This ob-servation suggests the difficulty of describing thecorrelations of lyrics and melody in a rule-basedfashion and motivates our probabilistic approachas we present in the next section.

4 Melody-conditioned language model

Our goal is to build a language model that generatesfluent lyrics whose discourse segment fit a givenmelody in the sense that generated segment bound-aries follow the distribution observed in Section 3.We propose to pursue this goal by conditioning a

5The beginning of a line/block and the end of a line/blockare equivalent since there is no melody between the end andbeginning of a line/block.

165

⁄@@1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄5, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄9, - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄13, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄17, - 7) )) ) ) 7) ! . ) # ) ) ) ) - 7) )

) ) ) (

⁄21, ) ) ) ) ) - 7)) ! # ) ) ) ) - 7) )

! ) ) ) '

⁄25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . ))

⁄29) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄33) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄37( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄41

⁄45

⁄49

⁄@@1 , - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄5, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄9, - 2) ) ) ) ) ) ) ) ) ) ) ! - ) ) # ) ) ( ! ,

⁄13, - 2) ) ) ) ) ) ) ) ) ) ) , ) ) ) ) ( ! ,

⁄17, - 7) )) ) ) 7) ! . ) # ) ) ) ) - 7) )

) ) ) (

⁄21, ) ) ) ) ) - 7)) ! # ) ) ) ) - 7) )

! ) ) ) '

⁄25 ) ! ) ) ) ) # ) ) ) ! ) ) ) ) ! - ) ! ) ) ) 7) ! . ))

⁄29) ! ) ) # ) ) ) ) ! ) ) ) ) # ) ) ) ! ) ) ) ) - 7) ) ! # ) ) ) ) - 7)

⁄33) ! ) ) ) ) ! - ) ! 7) ( ) , ) ! ) ) ) ) ! 7) ) ) )

⁄37( , ) ) ( ! - 7) ) ! # ) ) ) ) ) ) '

⁄41

⁄45

⁄49

sa-ru(away) <BOL> 023

𝐲"#$% 𝐲&#$%

𝐱(#$%

𝐲"# 𝐲&#

𝐱(#

𝐲"#$) 𝐲&#$)

𝐯(𝑤#$-) 𝐱(#$)

𝐧#$)

sa-ru

𝑤#$)OutputWord

𝑚1(#)

TARGETNOTE

ContextMelody

Represen-tation

LSTM

𝑤#$%

𝐧#$% 𝐧#

𝐯(𝑤#$)) 𝐯(𝑤#$%)

𝑠#$) 𝑠#$% 𝑠#𝑤#

𝑤#$%

…

SyllableCount

(away)

sa-ru(away)

ka-ra(from)

…

𝑚1 # 3%4𝑚1 # $%4

WordEmbedd-

ing

𝑤#$- 𝑤#$) 𝑤#$%

Context Melody Vector

……𝑚1(#$%)

TARGETNOTE

𝑚1 #$% 3%4𝑚1 #$% $%4

𝑚1 # $- 𝑚1 # 3-𝑚1 #$% 3-

ha-shi-ri(run)

ha-shi-ri(run)

ha- shi- ri

𝑤#$)(run)

ha- shi- ri

𝑤#$)(run)

𝑚1 #$% $-

Figure 5: Melody-conditioned RNNLM.

standard RNNLM with a featurized input melody.We call this model a Melody-conditioned RNNLM.

The network structure of the model is illustratedin Figure 5. Formally, we are given a melodym = m1,...,mi,...,mI that is a sequence of notesand rests, where m includes a pitch and a dura-tion information. Our model generates lyrics w =w1,...,wt,...,wT that is a sequence of words andsegment boundary symbols: 〈BOL〉 and 〈BOB〉,special symbols denoting a line and a block bound-ary, respectively. For each time step t, the modeloutputs a single word or boundary symbol taking apair of the previously generated word wt−1 and themusical feature vector nt for the current word posi-tion which includes context window-based featuresthat we describe in the following section. In thismodel, we assume that the syllables of the gener-ated words and the notes in the input melody havea one-to-one correspondence. Therefore, the posi-tion of the incoming note/rest for a word positiont (referred to as a target note for t) is uniquely de-termined by the syllable counts of the previouslygenerated words.6 The target note for t is denotedas mi(t) by defining a function i(·) which mapstime step t to the index of the next note in t.

Here, the challenging issue with this model istraining. Generally, language models require alarge amount of text data to learn well. Moreover,this is also the case for learning correlation betweenrest positions and syllable counts. As shown in Fig-ure 4, most words are supposed to not overlap a

6Note that our melody-lyrics alignment data used in train-ing does not make this assumption, but we can still uniquelyidentify the positions of target notes based on the obtainedmelody-word alignment.

long rest. This means, for example, that when theincoming melody sequence for a next word posi-tion is note, note, (long) rest, note, note, as thesequence following to mi(t−1) in Figure 5, it is de-sirable to select a word whose syllable count is twoor less so that the generated word does not overlapthe long rest. If there is sufficient data available,this tendency may be learned directly from the cor-relation between rests and words without explicitlyconsidering the syllable count of a word. However,our melody-lyrics alignments for 1,000 songs areinsufficient for this purpose.

We take two approaches to address this data spar-sity problem. First, we propose two training strate-gies that increase the number of training examplesusing raw lyrics that can be obtained in greaterquantities. Second, we construct a model that pre-dicts the number of syllables in each word, as wellas words themselves, to explicitly supervise thecorrespondence between rest positions and syllablecounts.

In the following sections, we first describe thedetails of the proposed model and then present thetraining strategies used to obtain better models withour melody-lyrics alignment data.

4.1 Model constructionThe proposed model is based on a standardRNNLM (Mikolov et al., 2010):

P (w) =∏T

t=1P (wt|w0, ..., wt−1), (1)

where context words are encoded usingLSTM (Hochreiter and Schmidhuber, 1997)and the probabilities over words are calculatedby a softmax function. w0 = 〈B〉 is a symboldenoting the beginning of lyrics. We extend thismodel such that each output is conditioned bythe context melody vectors n1, ...,nt, as well asprevious words:

P (w|m) =∏T

t=1P (wt|w0, ..., wt−1,n1, ...,nt). (2)

The model simultaneously predicts the sylla-ble counts of words by sharing the parametersof LSTM with the above word prediction modelin order to learn the correspondence between themelody segments and syllable counts:

P (s|m) =∏T

t=1P (st|w0, ..., wt−1,n1, ...,nt), (3)

where s = s1, ..., sT is a sequence of syllablecounts, which corresponds to w.

166

For each time step t, the model outputs a worddistribution ytw ∈ RV and a distribution of syllablecount yts ∈ RS using a softmax function:

ytw = softmax(BN(Wwzt)), (4)

yts = softmax(BN(Wszt)), (5)

where zt is the output of the LSTM for each timestep. V is the vocabulary size and S is the syllablecount threshold.7 Ww and Ws are weight matri-ces. BN denotes batch normalization (Ioffe andSzegedy, 2015).

The input to the LSTM in each time step t is aconcatenation of the embedding vector of the pre-vious word v(wt−1) and the context melody repre-sentation xtn, which is a nonlinear transformationof the context melody vector nt:

xt = [v(wt−1), xtn], (6)

xtn = ReLU(Wnnt + bn), (7)

where Wn is a weight matrix and bn is a bias.To generate lyrics, the model searches for the

word sequence with the greatest probability (Eq.2) using beam search. The model stops generatinglyrics when the syllable count of the lyrics reachesthe number of notes in the input melody.

Note that our model is not specific to the lan-guage of lyrics. The model only requires the se-quences of melody, words, and syllable counts anddoes not use any language-specific features.

4.2 Context melody vectorIn Section 3, we indicated that the positions ofrests and their durations are important factors formodeling boundaries of lyrics. Thus, we collecta sequence of notes and rests around the currentword position (i.e., time step t) and encode theirinformation into context melody vector nt (see thebottom of Figure 5).

The context melody vector nt is a binary fea-ture vector that includes a musical notation type(i.e., note or rest), a duration8, and a pitch for eachnote/rest in the context window. We collect notesand rests around the target note mi(t) for the cur-rent word position t with a window size of 10 (i.e.,mi(t)−10, ...,mi(t), ...,mi(t)+10).

For pitch information, we use a gap (pitch inter-val) between a target note mi(t) and its previous

7The syllable counts of the 〈BOL〉 and 〈BOB〉 are zero.8We rounded each duration to one of the values 60, 120,

240, 360, 480, 720, 960, 1200, 1440, 1680, 1920, and 3840and use one-hot encoding for each rounded duration.

Algorithm 1 Pseudo melody generation1: for each syllable in the input-lyrics do2: b← get boundary type next to the syllable3: sample note pitch p ∼ P (pi|pi−2, pi−1)4: sample note duration dnote ∼ P (dnote|b)5: assign note with (p, dnote) to the syllable6: sample binary variable r ∼ P (r|b)7: if r = 1 then8: insert rest with duration drest ∼ P (drest|b)9: end if

10: end for

(MIDI Tick)

Figure 6: Distribution of the number of boundaries inpseudo-data.

note mi(t−1). Here, the pitch is represented by aMIDI note number in the range 0 to 127. For ex-ample, the target and its previous notes are 68 and65, respectively, and the gap is +3.

4.3 Training strategiesPretraining The size of our melody-lyrics align-ment data is limited. However, we can obtain alarge amount of raw lyrics. We, therefore, pretrainthe model with 53,181 raw lyrics and then fine-tune it with the melody-lyrics alignment data. Inpretraining, all context melody vectors nt are zerovectors. We refer to these pretrained and fine-tunedmodels as Lyrics-only and Fine-tuned models, re-spectively.

Learning with pseudo-melody We propose amethod to increase the melody-lyrics alignmentdata by attaching pseudo melodies to the obtained53,181 raw lyrics. We refer to the model that usesthis data as the Pseudo-melody model.

Algorithm 1 shows the details of pseudo-melodygeneration. For each syllable in the lyrics, we firstassign a note to the syllable by sampling the proba-bility distributions. The pitch of each note is gener-ated based on the trigram probability. Then, we de-termine whether to generate a rest next to it. Sincewe established the correlations between rests andboundaries of lyrics in Section 3, the probability fora rest and its duration is conditioned by a boundary

167

type next to the target syllable. All probabilities arecalculated using the training split of the melody-lyrics alignment data.

Figure 6 shows the distributions of the numberof boundaries in the pseudo data. The distributionsclosely resemble those of gold data in Figure 4.

5 Quantitative evaluation

We evaluate the proposed Melody-conditionedRNNLMs quantitatively based on two evaluationmetrics: (1) a test set perplexity for measuring thefluency; (2) a line/block boundary replication taskfor measuring the consistency between the melodyand boundaries in the generated lyrics.

5.1 Experimental setupIn our model, we chose the dimensions of the wordembedding vectors and context melody representa-tion vectors to 512 and 256, respectively, and thedimension of the LSTM hidden state was 768. Weused a categorical cross-entropy loss for outputsytw and yts, Adam (Kingma and Ba, 2014) with aninitial learning rate of 0.001 for parameter opti-mization, and a mini-batch size of 32. We appliedan early-stopping strategy with a maximum epochnumber of 100, and training was terminated afterfive epochs of unimproved loss on the validationset. For lyrics generation, we used a beam searchwith a width of 10. An example of the generatedlyrics is shown in the supplemental material.

5.2 Evaluation metricsPerplexity Test-set perplexity (PPL) is a stan-dard evaluation measure for language models. PPLmeasures the predictability of wording in orig-inal lyrics, where a lower PPL value indicatesthat the model can generate fluent lyrics. Weused PPL and its variant PPL-W, which excludesline/block boundaries, to investigate the predictabil-ity of words.

Accuracy of boundary replication Under theassumption that the line and block boundaries ofthe original lyrics are placed at appropriate po-sitions in the melody, we evaluated consistencybetween the melody and boundaries in the gener-ated lyrics by measuring the reproducibility of theboundaries in the original lyrics. Here the metricwe used was F1-measure of the boundary positions.We also asked a person to place line and blockboundaries at plausible positions for randomly se-lected 10 input melodies that the evaluator has

Perplexity F1-measureModel PPL PPL-W BOB BOL UB

Lyrics-only 138.0 225.0 0.121 0.061 0.106Full-data 135.9 222.1 0.122 0.063 0.108Alignment-only 173.3 314.8 0.298 0.287 0.477Heuristic 175.8 284.7 0.373 0.239 0.402Fine-tuned 152.2 275.5 0.260 0.302 0.479Pseudo-melody 115.7 197.5 0.318 0.241 0.406

(w/o ys)Fine-tuned 155.1 278.1 0.318 0.241 0.366Pseudo-melody 118.0 201.5 0.312 0.250 0.406Human - - 0.717 0.671 0.751

Table 1: Results of the quantitative evaluation. “UB”denotes the score for unlabeled matching of line/blockboundaries. “w/o ys” denotes the exclusion of thesyllable-count output layer.

never heard. This person is not a professional mu-sician but an experienced performer educated onmusicology. The bottom part of Table 1 representsthe human performance.

5.3 Effect of Melody-conditioned RNNLMTo investigate the effect of our language models,we compared the following six models. The firstone is (1) a Lyrics-only model, a standard RNNLMtrained with 54,081 song lyrics without melody in-formation. The second and third ones are baselineMelody-conditioned RNNLMs where the proposedtraining strategies are not applied: (2) a Full-datamodel trained with mixed data (54,081 song lyricsand 900 melody-lyrics alignments of those), and(3) an Alignment-only model trained with only 900melody-lyrics alignment data. The fourth one is astrong baseline to evaluate the performance of theproposed approaches: (4) a Heuristic model that(i) assigns a line/block boundary to a rest based onits duration with the same probability, as reportedin Figure 4, and (ii) fills the space between anytwo boundaries with lyrics of the appropriate syl-lable counts. This Heuristic model computes thefollowing word probability:

P (wt|w0, ..., wt−1,m) = (8)

Q(〈BOB〉|mi(t+1)) (if wt = 〈BOB〉)Q(〈BOL〉|mi(t+1)) (if wt = 〈BOL〉)(1−Q(〈BOB〉|mi(t+1))−Q(〈BOL〉|mi(t+1)))×

PLSTM(wt|w0,...,wt−1)1−PLSTM(〈BOL〉|w0,...,wt−1)−PLSTM(〈BOB〉|w0,...,wt−1)

(otherwise)

where Q is the same probability as reported in Fig-ure 4. PLSTM is the word probability calculated bya standard LSTM language model. The remainingtwo are Melody-conditioned RNNLMs with theproposed learning strategies: (5) Fine-tuned and(6) Pseudo-melody models.

168

Lyrics in test set Lyrics generated by Pseudo-melody modelLyrics generated by Heuristic model

(MIDI Tick) (MIDI Tick)(MIDI Tick)

Figure 7: Distribution of the number of boundaries in the test set and lyrics generated by the Heuristic and Pseudo-melody models.

The top part of Table 1 summarizes the perfor-mance of these models. Regarding the boundaryreplication, the Heuristic, Alignment-only, Fine-tuned, and Pseudo-melody models achieved higherperformance than the Lyrics-only model for unla-beled matching of line/block boundaries (i.e., UB).This result indicates that our Melody-conditionedRNNLMs successfully capture the consistency be-tween melody and boundaries of lyrics. The re-sults of the Full-data model is low (as expected)because the size of the melody-lyrics alignmentdata is far smaller than that of the raw lyrics dataand this harms the learning process of the depen-dency between melody and lyrics. For the blockboundary, the Heuristic model achieved the bestperformances. For the line boundary, the Fine-tuned model achieved the best performances.

Regarding PPL and PPL-W, the Lyrics-only,Full-data, and Pseudo-melody models show bet-ter results than the other models. The Fine-tunedmodel shows reduced performance compared withthe Lyrics-only model because fine-tuning witha small amount of data causes overfitting in thelanguage model. Also, the training size of theAlignment-only model is insufficient for learning alanguage model of lyrics. Interestingly, the Pseudo-melody model achieved better performance than theFull-data model and overall achieved the best score.This result indicates that the Pseudo-melody modeluses the information of a given melody to make abetter prediction of its lyrics word sequence. Onthe other hand, the Heuristic model had the worstperformance, despite training with a large amountof raw lyrics. We analyze the reason for such per-formance and describe our results in Section 5.5.It is not necessarily clear which to choose, eitherthe Fine-tuned or Pseudo-melody model, whichmay depend also on the size and diversity of thetraining and test data. However, one can conclude

at least that combining a limited-scale collectionof melody-lyrics alignment data with a far largercollection of lyrics-alone data boosts the model’scapability of generating a fluent lyrics which struc-turally fits well the input melody.

5.4 Effect of predicting syllable-counts

To investigate the effect of predicting syllable-counts, we compared the performance of the pro-posed models to models that exclude the syllable-count output layer ys. The middle part of Table 1summarizes the results. For the pretraining strat-egy, the use of ys successfully alleviates data spar-sity when learning the correlation between syllablecounts and melodies from only words themselves.As can be seen, the model without ys shows re-duced performance relative to both PPLs and theboundary replication. On the other hand, for thepseudo-melody strategy, the two models are com-petitive in both measures. This means that thePseudo-melody model obtained a sufficient amountof word-melody input pairs to learn the correlation.

5.5 Analysis of melody and generated lyrics

To examine whether the models can capture corre-lations between rests and boundaries of lyrics, wecalculate the proportion of the word, line, and blockboundaries in the original lyrics and in the lyricsgenerated by the Heuristic and Pseudo-melodymodel for the test set (Figure 7). The proportionof 〈BOL〉 and 〈BOB〉 generated by the Heuristicmodel are almost equivalent to those of the originallyrics. On the other hand, for the Pseudo-melodymodel, the proportion of line/block boundary typesfor the longer rests are smaller than that of theoriginal lyrics.

Although the Heuristic model reproduces theproportion of the original line/block boundaries,the model had a low performance in terms of PPL,

169

05

1015202530

0 50 100 150 200Num

ber o

f blo

cks

Syllable counts per block

0100200300400500600

0 10 20 30 40 50

Num

ber o

f lin

es

Syllable counts per line

JS-divergence: 0.075




05

1015202530

0 50 100 150 200Num

ber o

f blo

cks


0100200300400500600

0 10 20 30 40 50

Num

ber o

f lin

es

Syllable counts per line05

1015202530

0 50 100 150 200Num

ber o

f blo

cks


0100200300400500600

0 10 20 30 40 50

Num

ber o

f lin

es

Syllable counts per line

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35 40 45 50

Num

bero

flines

Syllablecountsperline

Lyrics in test set Lyrics generated by Heuristic model Lyrics generated by Pseudo-melody model

Figure 8: Distribution of the syllable count of the generated lines/blocks

Heuristic Lyrics-only Fine-tuned Pseudo-melody Human (Upper-bound)Measure Means ± SD Median Means ± SD Median Means ± SD Median Means ± SD Median Means ± SD MedianL 2.06±1.08 2 2.33±1.23 2 2.85±1.20 3 2.93±1.14 3 3.56±1.33 4G 2.28±1.07 2 2.81±1.16 3 2.79±1.06 3 2.97±1.08 3 3.50±1.25 4LM 2.34±1.07 2 2.91±1.15 3 2.70±1.13 3 2.96±1.09 3 3.49±1.35 4DM 2.33±1.10 2 2.80±1.06 3 2.59±1.11 3 2.89±1.07 3 3.49±1.30 4OQ 2.01±1.01 2 2.59±1.15 3 2.42±1.08 2 2.65±1.01 3 3.32±1.19 4

Table 2: Results of the qualitative evaluation.

as shown in Section 5.3. By investigating the lyricsgenerated by the Heuristic model, we found that themodel tends to generate line/block boundaries afterthe melody rest, even if the two rests are quite close.Figure 8 shows the distributions of the syllableper line / block frequency and the distributions ofthe Jensen-Shannon divergence. While the Heuris-tic model tends to generate short lines/blocks, ourmodel generates the lyrics so that lines/blocks donot become too short. This result supports that (i)our model is trained using melody and lyric con-texts and (ii) the heuristic approach, which simplygenerates line/block boundaries based on the dis-tribution in Figure 4, cannot generate fluent lyricswith well-formed line/block lengths.

6 Qualitative evaluation

To asses the quality of the generated lyrics, inspiredby (Oliveira, 2015), we asked 50 Yahoo crowd-sourcing workers to answer the following five ques-tions using a five-point Likert scale:Listenability (L) When listening to melody andlyrics, are the positions of words, lines, and seg-ments natural? (1=Poor to 5=Perfect)Grammaticality (G) Are the lyrics grammaticallycorrect? (1=Poor to 5=Perfect)Line-level meaning (LM) Is each line in the lyricsmeaningful? (1=Unclear to 5=Clear)Document-level meaning (DM) Are the entirelyrics meaningful? (1=Unclear to 5=Clear)Overall quality (OQ) What is the overall qualityof the lyrics? (1=Terrible to 5=Great)

For the evaluation sets, we randomly se-lected four melodies from the RWC MusicDatabase (Goto et al., 2002). For each melody,we prepared four lyrics generated by the Heuristic,Lyrics-only, Fine-tuned, and Pseudo-melody mod-els. Moreover, to obtain an upper bound for thisevaluation, we used the lyrics created by amateurwriters: we asked four native Japanese speakers towrite lyrics on the evaluation melody. One writerwas a junior high school teacher of music who hadexperience in music composition and writing lyrics.Three writers were graduate students with differentlevels of musical expertise. Two of the three writershad experience with music composition, but noneof them had experience with writing lyrics.9 As aresult, we obtained 50 (workers) × 4 (melodies) ×5 (lyrics) samples in total. We note that workers didnot know whether lyrics were created by a humanor generated by a computer.

Table 2 shows the average scores, standard devia-tions, and medians for each measure. Regarding the“Listenability” evaluation, workers gave high scoresto the Fine-tuned and Pseudo-melody models thatare trained using both the melody and lyrics. Thisresult is consistent with the perplexity evaluationresult. On the other hand, regarding the “Grammat-icality” and “Meaning” evaluation, workers gavehigh scores to the Lyrics-only and Pseudo-melodymodels that are well-trained on a large amount oftext data. This result is consistent with the result of

9We release lyrics and audio files used in the quali-tative evaluation on the Web (https://github.com/KentoW/deep-lyrics-examples).

170

the boundary replication task. Regarding the “Over-all quality” evaluation, the Pseudo-melody modeloutperformed all other models. These results indi-cate our pseudo data learning strategy contributesto generating high-quality lyrics. However, thequality of lyrics automatically generated is stillworse than the quality of lyrics that humans pro-duce, and it still remains an open challenge forfuture research to develop computational modelsthat generate high-quality lyrics.

7 Related work

In the literature, a broad range of research effortshas been reported for computationally modelinglyrics-specific properties such as meter, rhythm,rhyme, stress, and accent Greene et al. (2010);Reddy and Knight (2011); Watanabe et al. (2014,2016). While these studies provide insightful find-ings on the properties of lyrics, none of those takesthe approach of using melody-lyrics parallel datafor modeling correlations of lyrics and melodystructures. One exception is the work of Nicholset al. (2009), who used melody-lyrics parallel datato investigate, for example, the correlation betweensyllable stress and pitch; however, their explorationcovers only correlations at the prosody level butnot structural correlations.

The same trend can be seen also in the literatureof automatic lyrics generation, where most stud-ies utilize only lyrics data. Barbieri et al. (2012)and Abe and Ito (2012) propose a model for gen-erating lyrics under a range of constraints pro-vided in terms of rhyme, rhythm, part-of-speech,etc. Potash et al. (2015) proposes an RNNLMthat generates rhymed lyrics under the assump-tion that rhymes tend to coincide with the end oflines. In those studies, the melody is consideredonly indirectly; namely, input prosodic/linguisticconstraints/preferences on lyrics are assumed tobe manually provided by a human user becausethe proposed models are not capable of inter-preting and transforming a given melody to con-straints/preferences.

For generating lyrics for a given melody, wehave so far found in the literature two studieswhich propose a method. Oliveira et al. (2007)and Oliveira (2015) manually analyze correlationsamong melodies, beats, and syllables using 42 Por-tuguese songs and propose a set of heuristic rulesfor lyrics generation. Ramakrishnan A et al. (2009)attempt to induce a statistical model for generating

melodic Tamil lyrics from melody-lyrics paralleldata using only ten songs. However, the former cap-tures only phonological aspects of melody-lyricscorrelations and can generate a small fragment oflyrics (not an entire lyrics) for a given piece ofmelody. The latter suffers from the severe shortageof data and fails to conduct empirical experiments.

8 Conclusion and future work

This paper has presented a novel data-driven ap-proach for building a melody-conditioned lyricslanguage model. We created a 1,000-song melody-lyrics alignment dataset and conducted a quanti-tative investigation into the correlations betweenmelodies and segment boundaries of lyrics. Noprior work has ever conducted such a quantitativeanalysis of melody-lyrics correlations with this sizeof data. We have also proposed a RNN-based,melody-conditioned language model that gener-ates fluent lyrics whose word/line/block boundariesfit a given input melody. Our experimental re-sults have shown that: (1) our Melody-conditionedRNNLMs capture the consistency between melodyand boundaries of lyrics while maintaining wordfluency; (2) combining a limited-scale collection ofmelody-lyrics alignment data with a far larger col-lection of lyrics-alone data for training the modelboosts the model’s competence; (3) we have alsoproduced positive empirical evidence for the effectof applying a multi-task learning schema wherethe model is trained for syllable count prediction aswell as for word prediction; and (4) the human judg-ments collected via crowdsourcing showed that ourmodel improves the quality of generated lyrics.

For future directions, we plan to further extendthe proposed model for capturing other aspects oflyrics/melody discourse structure such as repeti-tions, verse-bridge-chorus structure, and topicalcoherence of discourse segment. The proposedmethod for creating melody-lyrics alignment dataenables us to explore such a broad range of aspectsof melody-lyrics correlations.

Acknowledgments

This study utilized the RWC Music Database (Pop-ular Music). This work was partially supportedby a Grant-in-Aid for JSPS Research Fellow GrantNumber JP16J05945, JSPS KAKENHI Grant Num-bers JP15H01702, and JST ACCEL Grant NumberJPMJAC1602. The authors would like to thank Dr.Paul Reisert for the English language review.

171

ReferencesChihiro Abe and Akinori Ito. 2012. A Japanese lyrics

writing support system for amateur songwriters. InProceedings of Asia-Pacific Signal & InformationProcessing Association Annual Summit and Confer-ence 2012 (APSIPA ASC 2012). pages 1–4.

Dave Austin, Jim Peterik, and Cathy Lynn Austin.2010. Songwriting for Dummies. Wileys.

Gabriele Barbieri, Francois Pachet, Pierre Roy, andMirko Degli Esposti. 2012. Markov constraints forgenerating lyrics with style. In Proceedings of the20th European Conference on Artificial Intelligence(ECAI 2012). pages 115–120.

Masataka Goto, Hiroki Hashiguchi, TakuichiNishimura, and Ryuichi Oka. 2002. RWC Mu-sic Database: Popular, classical and jazz musicdatabases. In Proceedings of the 3rd InternationalConference on Music Information Retrieval (ISMIR2002). volume 2, pages 287–288.

Erica Greene, Tugba Bodrumlu, and Kevin Knight.2010. Automatic analysis of rhythmic poetry withapplications to generation and translation. In Pro-ceedings of the 2010 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP 2010).pages 524–533.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation 9(8):1735–1780.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. In Proceedingsof the 32nd International Conference on MachineLearning (ICML 2015). volume 37, pages 448–456.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto.2004. Applying conditional random fields toJapanese morphological analysis. In Proceedings ofthe 2004 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP 2004). pages230–237.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recurrentneural network based language model. In Proceed-ings of Interspeech 2010. pages 1045–1048.

Saul B. Needleman and Christian D. Wunsch. 1970.A general method applicable to the search for sim-ilarities in the amino acid sequence of two proteins.Journal of Molecular Biology 48(3):443–453.

Eric Nichols, Dan Morris, Sumit Basu, and Christo-pher Raphael. 2009. Relationships between lyricsand melody in popular music. In Proceedings of the10th International Society for Music Information Re-trieval Conference (ISMIR 2009). pages 471–476.

Hugo Goncalo Oliveira. 2015. Tra-la-lyrics 2.0: Au-tomatic generation of song lyrics on a semantic do-main. Journal of Artificial General Intelligence6(1):87–110.

Hugo R. Goncalo Oliveira, F. Amialcar Cardoso, andFrancisco C. Pereira. 2007. Tra-la-lyrics: An ap-proach to generate text based on rhythm. In Pro-ceedings of the 4th International Joint Workshop onComputational Creativity. pages 47–55.

Peter Potash, Alexey Romanov, and Anna Rumshisky.2015. GhostWriter: Using an LSTM for auto-matic Rap lyric generation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP 2015). pages 1919–1924.

Ananth Ramakrishnan A, Sankar Kuppan, andSobha Lalitha Devi. 2009. Automatic generationof Tamil lyrics for melodies. In Proceedings of theWorkshop on Computational Approaches to Linguis-tic Creativity. pages 40–46.

Sravana Reddy and Kevin Knight. 2011. Unsuperviseddiscovery of rhyme schemes. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics (ACL 2011). pages 77–82.

Tatsuji Ueda. 2010. よくわかる作詞の教科書 Thewriting lyrics textbook which is easy to understand(in Japanese). YAMAHA music media corporation.

Kento Watanabe, Yuichiroh Matsubayashi, KentaroInui, and Masataka Goto. 2014. Modeling structuraltopic transitions for automatic lyrics generation. InProceedings of the 28th Pacific Asia Conference onLanguage, Information and Computation (PACLIC2014). pages 422–431.

Kento Watanabe, Yuichiroh Matsubayashi, KentaroInui, Tomoyasu Nakano, Satoru Fukayama, andMasataka Goto. 2017. LyriSys: An Interactive sup-port system for writing lyrics based on topic tran-sition. In Proceedings of the 22nd InternationalConference on Intelligent User Interfaces (ACM IUI2017). pages 559–563.

Kento Watanabe, Yuichiroh Matsubayashi, Naho Orita,Naoaki Okazaki, Kentaro Inui, Satoru Fukayama,Tomoyasu Nakano, Jordan Smith, and MasatakaGoto. 2016. Modeling discourse segments in lyricsusing repeated patterns. In Proceedings of the 26thInternational Conference on Computational Linguis-tics (COLING 2016). pages 1959–1969.

172

Documents

A Melody-Conditioned Lyrics Language Model · Proceedings of NAACL-HLT 2018, pages 163–172 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics